Random thoughts.: Machine Learning

Showing posts with label Machine Learning. Show all posts

Saturday, December 19, 2015

Clustering: ELKI over Weka

I think I have found my favorite tool when it comes to clustering - ELKI. Lots of algorithms, minimal interface, and well implemented. I have a few cribs about the UI/UX, but overall its one of the best options out there. It also helps that it has a great implementation of one of my favorite algorithms - OPTICS.

If you are in the habit of using Weka, for small tasks or maybe because you started off with it, be wary of using it for clustering. Recently,I had to use Weka for cluster analysis for legacy reasons, and I am far from being a happy customer. This short post provides some instances where Weka and ELKI gave different results on a problem.

DBSCAN

The following images show the clusters discovered in the same dataset using eps = 0.5 and minpts = 5 (these are parameters to the DBSCAN algorithm). The original dataset has 31 clusters. Different clusters discovered are represented by different colours.

Left: Weka - 1 cluster, right: ELKI - 24 clusters

As you can see, for the same settings of parameters Weka thinks of the whole dataset as one cluster, while ELKI discovers 24 clusters. In fact this happens to be true for a range of these parameters. The following heatmaps show the number of clusters as eps and minpts are varied.

Number of clusters. Left: Weka, right: ELKI

Weka sees a maximum of 2 clusters - as denoted by the white strip to the extreme left of the plot. ELKI sees quite some variation - at the bottom left there are around 3000 clusters, while for the rest of the plot it is close to 1 (I know it looks like 0, but the shades for 0 and 1 aren't distinguishable visually and the minimum number of clusters can be 1).

If you think about it what ELKI reports makes sense - at some setting of the parameters you would expect each point in the dataset to be a cluster. Since there are 3100 points in this dataset, ELKI sees as many clusters in the bottom left corner. Hence, Weka seems to report incorrect results.

I tried comparing Weka and ELKI on another dataset, this time comparing cluster purity for a range of parameters. Again, in the case of Weka we see that the purity doesn't vary much, whereas for ELKI it varies over a wide range (as you'd expect).

Cluster purity. Left: Weka, right: ELKI

The responses to this question on stackoverflow suggest that this happens in Weka because automatic normalization of distances is imposed. Even if this is true, the implementation is incorrect - normalization here should not be a default and the documentation does not warn you.

k-means

I was a little surprised that the k-means implementation in Weka is buggy. This is one of the simpler algorithms out there!

On the same dataset I've used above, running k-means with k=32 doesn't terminate. It does not terminate even when you set the maximum number of iterations to 500. I had the clustering run for around 40 min before I decided to kill it.

ELKI gave me this in under a second:

k-means, with k = 32

Unlike the problems with DBSCAN, this problem seems to be dataset-specific. For a bunch of datasets I tried, k-means worked as expected.

Performance

I guess a discussion on performance is moot if your results are incorrect. But here are some numbers (from the ELKI website) further strengthening the case for ELKI. I have highlighted DBSCAN and OPTICS since those were the algorithms I was interested in. The other numbers are equally impressive.

ELKI Benchmarks

With this we come to end of this short post. If you haven't used ELKI yet, I hope I have convinced you to give it a shot!

Saturday, August 16, 2014

kNN classifier - in one line of Python. Fits into a Tweet.

Recently, I conducted a session on Python where I walked through implementing a kNN classifier. Close to the end of the session, we got to how succinct Python can be, and I proceeded to reduce our code to the absolute minimum number of lines possible. The impromptu code-golfing exercise led me to an interesting realization - you can write a kNN classifier in one line of Python. A line short enough (126 characters) to fit into a tweet!

Yep, I checked:

Fig 1. A tweet-sized classifier.

In retrospect, this does not seem surprising given certain features of the language. Its just one of those things you don't normally think about; you wouldn't want to implement an algorithm in one, potentially confusing, line. Frankly, you shouldn't. In the interest of the sanity of the poor souls who might need to read your code someday.

But notwithstanding coding etiquette, we are going to take a look at this since a one-liner k-NN is, well, pretty awesome!

In the next couple of sections I provide some context to the problem and the solution. I have made these sections collapsible (doesn't seem to work with the mobile version of the site), since they are not the focus of the post. If you already know what a kNN classifier is, and understand Python maps/lambdas, skip ahead to the section "Initial setup".

What is a kNN classifier?

(show)

A classifier looks at a data point provided to it and declares it to belong to a particular class. A common example of this is a spam classifier, which scans through the text of a mail and declares it to be belonging to one of these classes: spam or no-spam. If you use Gmail, you have already seen a very powerful spam classifier in action - using a classifier is how Gmail knows what goes into your spam folder.

A classifier is initially presented with data points along with their correct labels - so that it can identify associative patterns between the data points and the labels. This dataset is known as training data. For ex, in the case of spam, a classifier might note that words like free, lottery etc. strongly correlate to a mail being a spam mail. Once the training phase is done, the classifier is ready to tag unseen instances, known as test data points, . Beyond the training phase, the classifier may not need to keep the training data set around - instead, it can solely rely on using the patterns it has learned.

The kNN classifier, however, is a lazy classifier - it does not bother with learning associative patterns. It keeps all of the training dataset around (at least, the simplest avatar of the classifier does so) , and for a test point, finds points similar to it in the training dataset, and outputs the dominant label among the similar points. This is where it derives its name from - its classification is based on the labels of the k nearest neighbours of the test point.

I can almost describe the philosophy of the kNN classifier with this pithy quote:

You are the company you keep.

In machine learning, a data point is typically described as a vector of features (also known as attributes or dimensions). For ex, to describe a mail, the features of interest might be the the presence or absence of certain words; these could be my features:

has the word free
has the word lottery
has the word weekend

And a mail described by the vector [1,1,0], would have the words free and lottery in it - indicated by the 1s - but would not have the word weekend, as indicated by the 0.

The number of features is termed as the dimensionality of the data. In the above toy example, we were dealing with 3-dimensional vectors.

Given that every data point is depicted by a n-dimensional vector, we can plot the point a n-dimensional coordinate system. Yes, we can't see stuff beyond 3-dimensions, but in theory, we can think of a data point being plotted so. This helps us visualize what a kNN does. In the following figure the coloured points belong to the training dataset - some have the label red, some have green. We want to find the label for the test point, shown with a "?". If we assume k = 3 (smaller circle), the dominant label is red (red to green:: 2:1), and that is what we declare. If k = 5, green dominates (red to green:: 2:3) and that is what we declare as the result.

Fig 2. The kNN classifier

The above figure helps make a significant point - the accuracy of a kNN classifier depends on k. We won't worry about the right value of k in this post - we'll assume k is provided to us.

Note that, in the figure above, we implicitly assumed that the neighbours of a points are defined by the Euclidean distance - the distance one gets between 2 points when measuring with a ruler. But a different way of measuring distances/similarities may make sense for a problem.

So what does the algorithm for kNN classification look like? A naive implementation would have the following steps:

Calculate the similarity of the test point with each of the training points
Get the labels of the k most similar points
Declare the label of the test point to be the majority vote label

The one-liner I am interested must implement the above steps.

A quick primer on maps and lambdas

(show)

Initial setup

I will assume that these variables are defined for us:

train - is a list with the training data. Every element in this list is a tuple with the first entry as the feature vector and the second entry as the label. So, if I had the following 4-dimensional data-points in my training data:
1. [2, 4, 3, 5], label = 0
2. [5, 7, 4, 7], label = 1
3. [6, 8, 2, 3], label = 1
4. [5, 7, 2, 4], label = 0
then train = [([2, 4, 3, 5], 0), ([5, 7, 4, 7], 1), ([6, 8, 2,3], 1), ([5, 7, 2, 4], 0)].
sim(x, y) - the similarity function. Takes two data-points as arguments and returns a numeric value for similarity. For our example, we use the dot-product as a measure of similarity. The dot-product of two vectors [a, b, c] and [p, q, r] is a*p + b*q + c*r.
k - the number of nearest neighbours we need the labels from. For our example, we will assume k = 3.
test - the test data point, which we are interested in classifying. This is a list of the feature values. We use [1, 2, 3, 4] as an example test point.
I am also assuming that the Python collections module is imported into the current namespace i.e. this statement "from collections import *" has been executed (I know this sounds like cheating, but hold on judging me till you have read the next section ...)

Note that 1-4 are external to the classifier and have to be provided anyway - I am just fixing the variable names and structures.

The One (line ...)

And finally, [drum roll] the magic incantation ...

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

If I were to decompose the above line into logical steps, this is how I would go about it:

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

Explanation

Step 1: Get all similarities, keep the labels around

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

In the first step, we call a map with the training data. The lambda expression returns a tuple with the first entry sim(x[0], test) and the second entry as x[1]. Consider the fact that train is a list of tuples itself, and thus the argument to the lambda is a tuple. For ex the first time the lambda function is called is with the argument ([2, 4, 3, 5], 0). Thus, for the lambda, x[0] is [2, 4, 3, 5] and x[1] is 0. Hence, the first entry of the returned tuple is the similarity of the training data point to the test data point, while the second entry is the label of the training data point.

Given the sample train and sim, this first step produces the following list:

[(39, 0), (59, 1), (40, 1), (41, 0)]

As a side-note, it is quite interesting that the dot-product can itself be written as a lambda using the zip() function:

lambda list_1, list_2: sum(map(lambda x: x[0]*x[1], zip(list_1, list_2)))

Step 2: Find the top k points

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

We now sort whatever we got out of the last step in decreasing order of similarity. We use the sorted() function for this purpose. Since it sorts tuples, and we have not mentioned a key, it sorts on the first index of the tuple - the similarity to the test point. By default, sort happens in increasing order, so the output so far looks like this:

[(39, 0), (40, 1), (41, 0), (59, 1)]

You can sort in decreasing order too - by passing in the argument reverse=True, but that's more keystrokes :)

Now, we extract the tuples with the highest k similarities using [-k:]. The "-" in the list indexing notation in Python implies the counting is done from the tail of the list - here, we get the portion of the list starting with the kth element from the tail of the list till the last element. This gives us the tuples with the k highest similarities. Thus, this is the output of this step:

[(40, 1), (41, 0), (59,1)]

Step 3: Get the labels for these top k points

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

Since we have the tuples corresponding to the top-k similar points, we really don't need to keep the similarities around any more. Another map() call does the trick - look at the lambda expression here: it only returns the value at index 1 of the tuple. Therefore, as output from this step, we have just the labels:

[1, 0, 1]

Step 4: Get label frequencies

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

We now need to find the label with the highest frequency of occurrence in the k closest points. We use the Counter data-structure from the collections module. The Counter is a special kind of dict which, when initialized with a list, uses the unique elements in the list as keys and the frequency of their occurrence as the corresponding value.

Thus, Counter([1,1,2,3,2,2,1]) gives me the dict {1: 3, 2: 3, 3: 1}. This is exactly what we need. Initializing a Counter with the output of the last step - [1, 0, 1] - gives us this dict:

{1: 2, 0: 1}

Calling items() on a dict returns a list of tuples, each of whose 1st entry is a key, and the 2nd entry is the corresponding value. Hence, the output of this step is:

[(0, 1), (1, 2)]

Time to address the elephant in the room, I guess :) You could argue that using something from a module does not make for a strict one-liner. And in a sense that is true. I could have avoided using Counter, and instead, used tricks like these. Technically, we would have still ended up with an one-liner, but Counter expresses our objective in a clear manner. Thus, not using Counter does not mean that you cannot have an one-liner, only that the result won't be pretty.

Note that since I am calling Counter directly, my import should be "from collections import *". If you want a cleaner namespace, you might want to do "import collections" and refer to Counter as "collections.Counter"

Really close now ...

Step 5: Get the most common label

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

All that is left now is to extract the label with the highest frequency. That's easy with the max() function in Python. Note that our tuples (output from the last step) have the label as the first entry and the frequency of the label as the second entry. Hence, in the max() call, we provide the key argument which finds the maximum valued element in the list based on the second entries in the tuples. Yes, with a lambda expression. They really do get around. Thus, max() returns me this:

(1,2)

Of which, I extract the label by specifying the index - [0]. This gives me my final answer:

1

And this ... is the label we are looking for [melodramatic bow] :)

Regression and a related hack

While I was writing this post, I realized that the one-liner for kNN based regression is actually shorter - 92 characters. Also, purer - does not rely on modules that need importing. In a regression problem we are interested in predicting continuous values - as against discrete labels in classification problems. For ex you may want to determine the price of a piece of real estate based on features like proximity to a hospital, distance from an airport, proximity to industries etc. The training data has prices per data point instead of labels.

In kNN based regression, you would look at the values of the k neighbours and return their average.

Here goes:

sum(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:]))/(1.0*k)

If you understand the classifier, understanding this should be easy. The portion in black is the same as the corresponding part in the classifier. The part in red is specific to regression. Note that I need multiply "1.0" to the denominator to make sure Python does not perform an integer round-off in the division.

Now, here is an interesting relationship between kNN regression and classification - if you have only two labels, 0 and 1, then if you average the labels and round-off, it is equivalent to majority voting. For example, consider some sample outputs from Step 3 above:

Top k labels	Avg of labels	Avg rounded	Majority vote
1, 0, 1	0.67	1	1
0,0,1,0	0.25	0	0
1,1,0,0,1	0.6	1	1

This does not work with more than two labels.

Thus, as long as the labels in the classification problems are the numbers 0 and 1 , we can use this following expression:

int(round(sum(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:]))/(1.0*k)))

The portion in black is the expression for regression we saw earlier. The part in red is needed for classification. round() is a built-in in Python which does not need us to import anything (yay!); however, since it returns a float, we explicitly need to cast to int (bummer...).

This version of the one-line classifier is 104 characters long.

Conclusions

A final tally of character counts in the various one-liners. Additionally, mentioned in parentheses, are the sizes we would see if we weren't worried about readability i.e. one letter variable names, no spaces:

kNN classification - 126 characters (107)
kNN regression - 92 characters (77)
kNN classification hack - 104 characters (89)

Pretty impressive numbers - my respect for Python just went up a notch!

I am pretty sure these are not the only implementations of the one-liners possible. For starters, you might favour list comprehensions over map(), do away with the Counter to create a 'purer' one line etc. If you have something interesting, let me know in a comment!

Although, great as an intellectual challenge, I don't think the one-liner is suitable for serious use. Good classification libraries may have optimized implementations for kNN - for ex using kd trees or ball trees - which are preferable. For example, this library.

Sunday, June 15, 2014

Machine Learning - what it is, what is not

In 2007, when I had begun dipping my toes into Machine Learning (ML), the area was little known. When people asked me what I worked on, I usually had to begin by explaining what ML is. Occasionally, I would say Artificial Intelligence (AI) to save on the explanation. In fact, I wasn't sure then whether I wanted to do my research in ML or move to something more mainstream, like Databases or Algorithms (I didn't move). So I find the semi-celebrity status ML has gained over the last 6 years interesting, to say the least.

Unfortunately, I think the new-found fame has worked both ways; while it is nice to see ML being recognized, I also notice what seems to be the makings of a bubble: there is a sudden demand for ML skills, mismatched expectations of what ML can do for an organization (mismatched wrt the effort that they are willing to put in), a notion of how delightfully easy making good models are (for all the good work MOOCs are doing, this is a sad side-effect).

In this rather long post, I want to set right a few of those misleading first-impressions. This post is for you if you haven't got your hands dirty with using ML on real-world problems yet - for ex. if you are new to the area, or, maybe you are planning to start a data science team but have no/little prior experience in the area.

Do the math. Literally. Let's say you want to build the fastest car in the world. Would you be able to do so if you are scared of studying the machinery under the hood? Obviously not. Machine Learning involves a fair amount of mathematics. If you want to be good, train yourself to understand the math. And to extend to the general case, know the underlying algorithm where (much) math is not involved.

I have seen people think along these lines: Let’s pick one of the libraries/tools available and do some ML. I don’t really care what goes on inside the algorithms - I can do pretty well without the math.

In the right context, this way of working is fine. I have done this a few times myself. The right context here is that you don’t have the time to invest in learning, or time for implementing some custom algorithm yourself, but you do want some prediction capability.

Where this starts becoming a problem is when you start seeing this stopgap solution as a solution - you tend a mismatched expectation that this is going to lead you to the best possible solution for your problem. Or is always going to work. Acknowledge that, with this approach, you are only going to get something that works – where “works” could mean anything from “just about works” to “works perfectly”. And you may not have much control where your solution falls in this spectrum.

The only way to be really good is to know the math so that you can pick the right algorithms and squeeze the last bit of performance out of them.

Confidential: path to enlightenment

Humans can beat your algorithms. This should not really be surprising. Take the task of searching for an image similar to one given to you. Today, you can almost always outperform a computer. Try out Google Search by Image to see what I mean - the best results are the ones that are identical or very similar to your input. Or appear on same/similar pages as your input. Beyond that, the results go haywire.

Google, Logan Lerman and Jennifer Connelly are different people.

Look at the results of a search for Jennifer Connelly. The image I used in the left screenshot seems to be a popular one - it appears on multiple websites that have articles about her or her wallpapers. The visually similar images returned as part of the search results are quite accurate. In fact, 5 of the 9 results seem to be from the same interview/session. But if I pick an image that is not so popular, as in the screenshot on the right, Google throws in a few dudes in its results.

This is not just one cherry-picked pathological example; here is what Richard Szeliski, a noted researcher in the field of computer vision, has to say:

It may be many years before computers can name and outline all of the objects in a photograph with the same skill as a two year old child.

You use machine learning not because you can do better than humans (in some cases this may be true). You use ML when the scale that a computer can handle is more important than the accuracy a human can provide. You are good at image search, but can you sift through a billion images in a second?

Applying ML to a real-world problem implies you are willing to sacrifice some accuracy to be able to handle scale. Scale might be important to you because it makes business sense. For example, if you wanted to set up image search as a service, would you rather be handling a 1000 customers a day with 100% accuracy (a hired staff working behind the scenes) or 1 million customers with 70% accuracy? This is really a trick question - there is no clear answer here - depends on your objectives, business model, etc. - my point is that there is a case to be made for scale.

Sometimes your ability to handle scale can also indirectly improve your accuracy. An example:

You create a search engine that is not accurate as humans but can crawl the whole internet to give you good enough results to begin with.
This makes it a very popular tool – because at least your tool can look at the whole of the vast WWW.
As more and more people use it you keep noting which links people click among the ones you have provided as the search result.
You start ranking higher the ones clicked most. And lo! – now you have something that progressively starts approximating human search behavior.

Moral of the story - know when to invest in ML. More importantly, set your expectations right about the returns on the investment.

And this is not a glass of pepsi

Patterns detected by many algorithms cannot be conveniently interpreted. Take Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Random Forests (RF). These are well established algorithms that are known to do well on a variety of problems but , in general, the model they learn cannot be represented as easy-to-understand rules. Say, you are trying to predict estate prices based on various factors like the average income in the locality, proximity of the estate to important landmarks in a city, the number of hospitals and schools in the locality etc. If you use a SVM on the problem, in most cases, you won't be able to infer something like this from the model:
The price of the estate is directly proportional to the number of hospitals within a 3km radius of the estate.

This often leads to confusion on these lines:
How can we trust this black-box model when we really don’t know what it is doing?

Some thoughts:
- Decouple the following ideas: (1) "can be easily interpreted by humans", (2) "reliable". Algorithms like ANNs etc have a strong theoretical base - if you want to understand what the model spits out, the way to do it is not to look for rules, but to go back and look at this theory. Yes, the paradigm is different - instead of picking out rules from a model and looking at them saying "Yes, this one looks reasonable", the idea is to have a proof that says that whatever comes out the learning phase, under certain circumstances, has such-and-such error bounds and would perform in such-and-such manner.
  
  This is, in a way, reliable in a much more objective manner - the proof puts down what's to be expected in black and white.
  
  One of the ideas used by the winning team of the famous Netflix prize was Restricted Boltzman Machines. Not exactly very interpretable and yet it raked in a hefty $1M prize. Reconsider - do you really need interpretability? :)
- Interpretability offers you a comfort zone. Convince yourself to step out of it because we have a simple test to measure accuracy - and that's what really matters. Keep a test set of data that is not used by the algorithm to train. Find out how accurate the algorithm is on this data. This is as reliable and objective as it gets.
  
  I guess what I am driving at is whenever you find yourself not comfortable with a model that cannot be interpreted, ask yourself whether you really need to interpret it. And the answer cannot be "for my own comfort". Isn't high accuracy sufficient? Isn't a good model better than an interpretable but not-so-good a model? In some cases, you might really need models which need to be explained - perhaps, to your clients, as part of your findings - this is probably the only time you should start seriously thinking about interpretability.
Worry about interpretability when you really need it. Don't expect a good model to be also conveniently interpretable.
Use the right metrics.

Source

Do you remember the compass that Captain Jack Sparrow uses in Pirates of the Caribbean? The needle points to whatever the person using the compass wants most. The compass is a fantastic tool, but useless, if you don't know what you want.

ML is like the compass, but it helps you a lot better if you can express exactly what you want to measure.

Sorry for butting in a cheesy analogy. Hopefully, it would help me make my point.

Consider this problem: you are running a car dealership. You make a convincing sales pitch to everyone who walks into your store. Additionally, for certain customers, who are among the ones who have agreed to fill in some kind of a form (where you ask them about their income, current mode of transport, name of place of work, age,etc), you throw in something extra: you let them borrow a trial car for a day, with fuel expenses paid. You believe that this set of customers will make a purchase with some cajoling.

How do you pick this set of people? Let's say you train a ML model based on the information entered in the form, to make the binary prediction: likely to purchase (we call these our positive labels) or not likely to purchase ( negative labels). You make the trial-car offer to people your model assigns positive labels to. Clearly, a lot rides on the performance of this algorithm.

Let's assume that, typically, 5% of the people visiting your store buy a car. Lets look at some metrics you may consider to assess the value of your model:
1. How about good old accuracy? This measures the percentage of predictions you get correct overall. If I have a model that only predicts "not likely to purchase" I would be wrong only 5% of the time. So, my model is 95% accurate! That should normally be great, but it makes no sense for my business since my model doesn't identify anyone as a potential purchaser anymore.
2. We are interested in people who are likely to buy - so lets use a model that has a high accuracy in predicting a positive label for people with a positive label in our data. Now, you could have a model that always says "likely to purchase", which will give you a 100% accuracy within the group of positively labelled people. This is no good either - think about all those free trials you would be giving away to the 95% who aren't going to buy!
3. Obviously, we need a way to balance out the problems we just saw. We want whoever the model calls out to be a potential purchaser to be really a potential purchaser, so we don't miss the people we can sell to. But, this should not happen at the cost of identifying non-purchasers as purchasers - because then we would be just giving away way too many free trials.
A good metric to be used here is the F1-score, which is defined in terms of precision, recall. In our context: $$Precision = \frac{people \; identified \; as \; positives/purchasers \; who \; are \; actually \; purchasers}{people \; identified \; as \; purchasers}$$
$$Recall = \frac{people \; identified \; as \; negatives/purchasers \; who \; are \; actually \; purchasers}{actual \; purchasers}$$ $$F1= \frac{2 \cdot Precision \cdot Recall}{Precision+Recall}\;$$
Note how points a and b (from the previous list) are taken care of by precision and recall. In particular, look at how the F1 score penalizes both low precision and low recall. For clarity, plots of F1 (z-axis) as a function of precision and recall (x and y axes) accuracy are shown below. The four plots correspond to the same relationship, but show the plot from different perspectives. Swatches of the same color represent approximately the same F1 score. Observe that the plotted surface seems "taped" to the x and y axes, signifying that if any of precision or recall is zero, the F1-score is zero.

F1 score as a function of precision and recall

Picking the right metric is important to make sure that your ML model is solving a problem that translates into value for you. There are a host of standard metrics you can pick from - ex F-score, lift, ROC Area, average precision, precision/recall, break-even point, squared error, cross-entropy. Of course, you might want to consider using a custom metric if that suits you well.

Always have a baseline. Some problems are harder than others. Whether 90% accuracy is good, depends on the problem.

My favorite example is Sentence Boundary Disambiguation (SBD) - the task is to identify elements/tokens in a given text that indicate sentence boundaries i.e. points where a sentence ends. Here, a token of text is anything not a whitespace. For example, the text (a quote by Leonardo Da Vinci):

In rivers, the water that you touch is the last of what has passed and the first of that which comes; so with present time.

has 28 tokens; 25 words and 3 punctuation symbols. We need to identify which tokens among these 28 represent sentence boundaries. There is only one in this sentence - the "." at the end. This is a binary classification problem - for each token, we only need to specify whether it is a sentence boundary or not. The catch here is that a boundary indicators have other uses too - for ex "." is also used for abbreviations - and also, there is more than one sentence boundary indicator - "?", "!", etc.

Is a model with a F1-score of 90% a good model for this task? Lets start by looking at some very simple "models" and how they perform. In the table below, I have listed 4 such very-non-ML-ish models for the SBD task, and their performance on text from the Brown corpus. Performance is characterized by precision, recall and F1 scores (a positive label for a token is when we identify it as indicating a sentence boundary).

Description of Model	Precision	Recall	F1-score
Sentence boundary if element is "."	100.0	86.06	92.51
If element is one of "?", ".", "!"	94.35	91.54	92.92
If element is "." and it hasn't occurred as one of last 3 elements	100.0	84.97	91.87
If element is one of "?", ".", "!" and none of the last 3 elements are "?", "." or "!"	94.26	84.58	89.16

As you can well see, it is hard not to score around 90% on this task! And I wrote the code that generated the above data while writing this post - all 4 models in about 50 lines of Python.

A baseline tells you how effective an absolute commonsensical model is. This helps you better guide your efforts: do you want to spend time implementing a model that literature says would give you 95% accuracy? When there is something reasonably good you can do in a short time? Is your accuracy of 60% necessarily a bad score? - not if you have a reasonable baseline that gives you 30%. How far have you come from your baseline accuracy with your current model? - if its not much you may want to revisit your hypotheses about data so that you can pick a better model. For all real problems, having a reasonable baseline is a must.

Try it out. Models make various implicit or explicit assumptions about the underlying data - for ex about what the separation boundary between the classes looks like, the size of the data set, the distribution of noise, whether sequence matters, etc. Since you wouldn't have all the information about your data that can indeed help you validate these assumptions (if you knew so much you probably wouldn't be modeling the data), you might end up using a model that doesn't suit your data. And the only way to know (at least, in most cases) that there is a mismatch is by actually trying it out on the data. Building a good model can be an extremely empirical process. So, be careful to be not become complacent only because a model looks good on paper. It might not work the same way for your data.
Bias-Variance tradeoff, dimensionality and cross-validation. These are topics that deserve longer discussions and I am not going to talk about them much here. I fear that a short discussion would either confuse or impart a false sense of security. If you work with real data, I cannot stress enough the importance, and relevance, of learning about the bias-variance tradeoff, the effects of dimensionality and the practice of cross-validation. What I am going to do, however, is give you a teaser of the pitfalls that you are vulnerable to if you are unaware of these ideas.

Say, your friend comes to you with data to be classified. You have, at your disposal, implementations of the Naive Bayes algorithm and C4.5 - a type of decision tree.You know nothing about how the data was generated but your friend tells you that he is almost sure that the data has a complex separation boundary between the positive and negative classes. So, which algorithm do you use? Clearly C4.5, right? Since, it is well known that it can come up with complex boundaries while Naive Bayes can only get you linear boundaries.

Here's the surprising bit: You can do worse than the Naive Bayes classifier. Even though the data is not linearly separable.

What did you miss? You forgot to take into account the fact that Naive Bayes has a high bias and decision trees suffer from high variance. And as a result, size of the dataset affects their relative performance.

Something with a high bias is very rigid in its idea of the separation boundary between the positive and negative classes. Naive Bayes comes up with a separation boundary that is a line - throwing more data will only help it find a better line, but it cannot find anything non-linear i.e. say if the optimal separation boundary looks like a circle, well, you are out of luck.

A high variance classifier can find complex boundaries but is sensitive to the precise distribution of points. This is a problem because data may be often contaminated with noise i.e. instances in your data set that are not consequence of the phenomena you want to study/model. In the sales car pitch problem discussed previously, the forms in which people filled in an incorrect age would result in noise. A high variance classifier would try to find a separation boundary which correctly labels the noisy points too. This is not a smart thing to do since the noise that you saw on the training data (data used to build our model) may not show up in the test data (data on which we are making our predictions based on the model built on training data) - its noise after all - leading to inaccurate predictions.

The figure below shows these problems. In the left panel, which shows the training data, we can see how a high variance model overfits and finds a boundary that includes the noisy green points - which exist far inside the "red region". The high bias classifier is unaffected, but then, it is also a bad model, since the ideal boundary seems quadratic. In the right panel, we see that in the test data the noisy points are absent, and instead, there are red points present in the region as we might expect. The high variance model predicts everything in this region as green however - since that is what it has learnt from the training data. The high bias model, labels a lot of the green points as red, because it could never fit the data correctly in the first place. That it was not affected by noise is an artifact of its general inflexibility.

Also note that even the ideal boundary misclassifies one red point (bottom left-most) - the lesson is no one is expected to get everything right, we are merely in the business of minimizing errors.

Training Data

Test Data

The graph below shows how the accuracies of our models change as size of the dataset increases. With low training data, the high variance decision tree classifier fits noise - so much as that the Naive Bayes model, which is downright wrong for this data (remember the data in the example is not linearly separable) outperforms it. A high amount of data effectively tames the decision tree to focus on the dominant patterns in the data only (i.e. not noise) and its performance becomes better. We also discussed that throwing more data at a high-bias classifier doesn't help things much - accordingly, we see that the Naive Bayes line plateaus after a while, whereas, more data definitely helps the decision tree become progressively better.

Accuracy of Naive Bayes vs Decision tree, as a function of dataset size. Source.

This scenario should tell you that some of what you think of as well-informed decisions can turn out to be catastrophically wrong (duh! life!) - what is a better classifier can depend on how much data you have (among other things). Now, go read up on these topics as much as you can :)

Literature survey is important. If you are new to ML you may have those moments when it seems you've hit upon an original idea to solve a problem. I will be harsh here and tell you that its most likely not as original as you think it is. ML has been around for a while - research on neural networks started in the late 50s! - and in the last couple of decades in that time, the field has adopted a reeling pace in its development. What that means for you is that most common "use-cases", and many not so common ones, have been thought about, and solutions published or patented.

One way I like to put this is, to be original, you are now competing with people across space and time. In your class in school, for an idea to seem brilliant, you had to only come up with something that didn't strike your 30 odd or so peers in a given amount of time. Here, however, your peer group is composed of every living human being who is working on ML (and would publish/patent) - they are distributed across the globe and some of them probably have the luxury of dedicating more time to solving ML problems than you do. And this is only the space part. Your competitors are also the people who contributed research to the discipline in the last 50-60 yrs. This is the time part. Needless to say, with such a massive demographic that spans space and time, for something to be original is tough. Especially so if you are unaware of what existing literature consists of.

The moral here is that when you come across a problem, do your literature survey first. Chances are are your problem has already been studied. Even if that's not the case, a good literature survey can provide a logical/mathematical structure to tease your problem into, enabling quick and systematic progress. Don't reinvent the wheel. And remember that Google Scholar and CiteSeerX are your friends.

I feel our approach to approaching ML problems (or, really, any problem) should be in the same spirit of humility with which E.W.Dijkstra, computer scientist extraordinaire and a Turing award winner, suggested we approach problems in computer programming:

Edsger W. Dijkstra

We shall do a much better programming job, provided that we approach the task with a full appreciation of its tremendous difficulty, provided that we stick to modest and elegant programming languages, provided that we respect the intrinsic limitations of the human mind and approach the task as Very Humble Programmers. ( Source: The Humble Programmer )

Working with data is a lot of dirty work. This point is especially directed to people who want to move into Data Mining or Machine Learning job profiles. In most (not all) of these case there is the glamorous notion involved that they will be working on cool learning algorithms all day. Truth is, working with data can be hard and inelegant. And sometimes, plain boring. All those algorithms ML teaches you can be applied only if you have captured the required data in the first place, cleaned it up (data capturing is hardly every perfect and bereft of noise) and transformed it into a format that your ML tools can use. This is not pretty work and takes up most of your time. This is often the 'straw that breaks the camels back', so to speak - people who were enamored with working on ML, become disgruntled with the area, when they are faced up-close with these aspects of their job.

My advice to people who want to move into this line of work is set your expectations right. Find someone whose day job is using ML, preferably in the workplace you are interested in, and ask him what his daily routine consists of.

The following awesome CnH comic strip aptly sums up my sentiments - only replace "science" with "ML":
Solutions evolve. It is interesting that although we take a matured approach of iterative development for software, in the case of modeling data, a similar mindset is not as prevalent. We tend to commit mistakes of both omission and commission. Either we give the problem too little attention and expect the first few tries to completely nail it, or we tend to get into a purist-academic research mode where we don't want to commit to a solution until we deem it perfect. I feel that the s/w model of iterative development fits very nicely with how ML models should be built in the industry:
- You need to have a long-term commitment to your data modeling problem. Don't expect to dedicate 2 days to your problem and get a model that is optimal, robust, processor friendly, has low data complexity, etc.
- Don't get stuck trying to come up with the best model in the first go. Think versions. Keep revisiting your solution and making it better. This has the added advantage that your first (few) version(s) will help set up a proper workflow/pipeline/interface i.e. in a product or a service your model is presumably one component of and the first (few) model(s) can help you smooth out the various kinks around how it precisely plugs in into the larger system.
The man inside the machine is not going away soon. (Warning - this is a subjective viewpoint) Although the last few years has seen immense progress in the field of ML (I am occasionally still surprised that voice based search is a reality and runs off my mobile phone) we are not yet at a place where we can let algorithms learn from data in cruise-control mode. You need human involvement to pick the best algorithm from the plethora available out there if you want anything to be done in a reasonable amount of time. I have heard people lament that whole learning process is not "automatic" i.e. you can't throw whatever data you have into a black-box and out comes an optimal model. But then the problem we are solving is so complex!

Different phenomena produce data that have different underlying idiosyncrasies, and to model these effectively, many learning algorithms come halfway by making some assumptions of their own so they don't have to rediscover everything from scratch. How, then, is the right algorithm to be matched to the problem, accounting for your various constraints like time, accuracy, computational resources, etc? The missing element is you - you need to pick the right algorithm for your problem, much like picking the right key from a keychain for a lock.

In theory, yes, you could have an algorithm that loops through all possible separation boundaries to pick an optimal one - but there are an infinite (this is not an exaggeration) number of them out there, and you would need forever to solve a single problem. So that you don't have to suffer through this rigmarole, you rely on the experience and judgement of a (good) data scientist to pick a promising subset to try out, from the universe of models. Unless there is a major theoretical revolution in the field, say on a Newtonian or Einsteinian scale, or alternatively, parallel processing becomes far more parallel and affordable, enough so that we can brute force our way out, in the general case the need for human involvement stays.

Phew!

Subscribe to: Comments ( Atom )

Random thoughts.