Random thoughts.: June 2014

In 2007, when I had begun dipping my toes into Machine Learning (ML), the area was little known. When people asked me what I worked on, I usually had to begin by explaining what ML is. Occasionally, I would say Artificial Intelligence (AI) to save on the explanation. In fact, I wasn't sure then whether I wanted to do my research in ML or move to something more mainstream, like Databases or Algorithms (I didn't move). So I find the semi-celebrity status ML has gained over the last 6 years interesting, to say the least.

Unfortunately, I think the new-found fame has worked both ways; while it is nice to see ML being recognized, I also notice what seems to be the makings of a bubble: there is a sudden demand for ML skills, mismatched expectations of what ML can do for an organization (mismatched wrt the effort that they are willing to put in), a notion of how delightfully easy making good models are (for all the good work MOOCs are doing, this is a sad side-effect).

In this rather long post, I want to set right a few of those misleading first-impressions. This post is for you if you haven't got your hands dirty with using ML on real-world problems yet - for ex. if you are new to the area, or, maybe you are planning to start a data science team but have no/little prior experience in the area.

Do the math. Literally. Let's say you want to build the fastest car in the world. Would you be able to do so if you are scared of studying the machinery under the hood? Obviously not. Machine Learning involves a fair amount of mathematics. If you want to be good, train yourself to understand the math. And to extend to the general case, know the underlying algorithm where (much) math is not involved.

I have seen people think along these lines: Let’s pick one of the libraries/tools available and do some ML. I don’t really care what goes on inside the algorithms - I can do pretty well without the math.

In the right context, this way of working is fine. I have done this a few times myself. The right context here is that you don’t have the time to invest in learning, or time for implementing some custom algorithm yourself, but you do want some prediction capability.

Where this starts becoming a problem is when you start seeing this stopgap solution as a solution - you tend a mismatched expectation that this is going to lead you to the best possible solution for your problem. Or is always going to work. Acknowledge that, with this approach, you are only going to get something that works – where “works” could mean anything from “just about works” to “works perfectly”. And you may not have much control where your solution falls in this spectrum.

The only way to be really good is to know the math so that you can pick the right algorithms and squeeze the last bit of performance out of them.

Confidential: path to enlightenment

Humans can beat your algorithms. This should not really be surprising. Take the task of searching for an image similar to one given to you. Today, you can almost always outperform a computer. Try out Google Search by Image to see what I mean - the best results are the ones that are identical or very similar to your input. Or appear on same/similar pages as your input. Beyond that, the results go haywire.

Google, Logan Lerman and Jennifer Connelly are different people.

Look at the results of a search for Jennifer Connelly. The image I used in the left screenshot seems to be a popular one - it appears on multiple websites that have articles about her or her wallpapers. The visually similar images returned as part of the search results are quite accurate. In fact, 5 of the 9 results seem to be from the same interview/session. But if I pick an image that is not so popular, as in the screenshot on the right, Google throws in a few dudes in its results.

This is not just one cherry-picked pathological example; here is what Richard Szeliski, a noted researcher in the field of computer vision, has to say:

It may be many years before computers can name and outline all of the objects in a photograph with the same skill as a two year old child.

You use machine learning not because you can do better than humans (in some cases this may be true). You use ML when the scale that a computer can handle is more important than the accuracy a human can provide. You are good at image search, but can you sift through a billion images in a second?

Applying ML to a real-world problem implies you are willing to sacrifice some accuracy to be able to handle scale. Scale might be important to you because it makes business sense. For example, if you wanted to set up image search as a service, would you rather be handling a 1000 customers a day with 100% accuracy (a hired staff working behind the scenes) or 1 million customers with 70% accuracy? This is really a trick question - there is no clear answer here - depends on your objectives, business model, etc. - my point is that there is a case to be made for scale.

Sometimes your ability to handle scale can also indirectly improve your accuracy. An example:

You create a search engine that is not accurate as humans but can crawl the whole internet to give you good enough results to begin with.
This makes it a very popular tool – because at least your tool can look at the whole of the vast WWW.
As more and more people use it you keep noting which links people click among the ones you have provided as the search result.
You start ranking higher the ones clicked most. And lo! – now you have something that progressively starts approximating human search behavior.

Moral of the story - know when to invest in ML. More importantly, set your expectations right about the returns on the investment.

And this is not a glass of pepsi

Patterns detected by many algorithms cannot be conveniently interpreted. Take Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Random Forests (RF). These are well established algorithms that are known to do well on a variety of problems but , in general, the model they learn cannot be represented as easy-to-understand rules. Say, you are trying to predict estate prices based on various factors like the average income in the locality, proximity of the estate to important landmarks in a city, the number of hospitals and schools in the locality etc. If you use a SVM on the problem, in most cases, you won't be able to infer something like this from the model:
The price of the estate is directly proportional to the number of hospitals within a 3km radius of the estate.

This often leads to confusion on these lines:
How can we trust this black-box model when we really don’t know what it is doing?

Some thoughts:
- Decouple the following ideas: (1) "can be easily interpreted by humans", (2) "reliable". Algorithms like ANNs etc have a strong theoretical base - if you want to understand what the model spits out, the way to do it is not to look for rules, but to go back and look at this theory. Yes, the paradigm is different - instead of picking out rules from a model and looking at them saying "Yes, this one looks reasonable", the idea is to have a proof that says that whatever comes out the learning phase, under certain circumstances, has such-and-such error bounds and would perform in such-and-such manner.
  
  This is, in a way, reliable in a much more objective manner - the proof puts down what's to be expected in black and white.
  
  One of the ideas used by the winning team of the famous Netflix prize was Restricted Boltzman Machines. Not exactly very interpretable and yet it raked in a hefty $1M prize. Reconsider - do you really need interpretability? :)
- Interpretability offers you a comfort zone. Convince yourself to step out of it because we have a simple test to measure accuracy - and that's what really matters. Keep a test set of data that is not used by the algorithm to train. Find out how accurate the algorithm is on this data. This is as reliable and objective as it gets.
  
  I guess what I am driving at is whenever you find yourself not comfortable with a model that cannot be interpreted, ask yourself whether you really need to interpret it. And the answer cannot be "for my own comfort". Isn't high accuracy sufficient? Isn't a good model better than an interpretable but not-so-good a model? In some cases, you might really need models which need to be explained - perhaps, to your clients, as part of your findings - this is probably the only time you should start seriously thinking about interpretability.
Worry about interpretability when you really need it. Don't expect a good model to be also conveniently interpretable.
Use the right metrics.

Source

Do you remember the compass that Captain Jack Sparrow uses in Pirates of the Caribbean? The needle points to whatever the person using the compass wants most. The compass is a fantastic tool, but useless, if you don't know what you want.

ML is like the compass, but it helps you a lot better if you can express exactly what you want to measure.

Sorry for butting in a cheesy analogy. Hopefully, it would help me make my point.

Consider this problem: you are running a car dealership. You make a convincing sales pitch to everyone who walks into your store. Additionally, for certain customers, who are among the ones who have agreed to fill in some kind of a form (where you ask them about their income, current mode of transport, name of place of work, age,etc), you throw in something extra: you let them borrow a trial car for a day, with fuel expenses paid. You believe that this set of customers will make a purchase with some cajoling.

How do you pick this set of people? Let's say you train a ML model based on the information entered in the form, to make the binary prediction: likely to purchase (we call these our positive labels) or not likely to purchase ( negative labels). You make the trial-car offer to people your model assigns positive labels to. Clearly, a lot rides on the performance of this algorithm.

Let's assume that, typically, 5% of the people visiting your store buy a car. Lets look at some metrics you may consider to assess the value of your model:
1. How about good old accuracy? This measures the percentage of predictions you get correct overall. If I have a model that only predicts "not likely to purchase" I would be wrong only 5% of the time. So, my model is 95% accurate! That should normally be great, but it makes no sense for my business since my model doesn't identify anyone as a potential purchaser anymore.
2. We are interested in people who are likely to buy - so lets use a model that has a high accuracy in predicting a positive label for people with a positive label in our data. Now, you could have a model that always says "likely to purchase", which will give you a 100% accuracy within the group of positively labelled people. This is no good either - think about all those free trials you would be giving away to the 95% who aren't going to buy!
3. Obviously, we need a way to balance out the problems we just saw. We want whoever the model calls out to be a potential purchaser to be really a potential purchaser, so we don't miss the people we can sell to. But, this should not happen at the cost of identifying non-purchasers as purchasers - because then we would be just giving away way too many free trials.
A good metric to be used here is the F1-score, which is defined in terms of precision, recall. In our context: $$Precision = \frac{people \; identified \; as \; positives/purchasers \; who \; are \; actually \; purchasers}{people \; identified \; as \; purchasers}$$
$$Recall = \frac{people \; identified \; as \; negatives/purchasers \; who \; are \; actually \; purchasers}{actual \; purchasers}$$ $$F1= \frac{2 \cdot Precision \cdot Recall}{Precision+Recall}\;$$
Note how points a and b (from the previous list) are taken care of by precision and recall. In particular, look at how the F1 score penalizes both low precision and low recall. For clarity, plots of F1 (z-axis) as a function of precision and recall (x and y axes) accuracy are shown below. The four plots correspond to the same relationship, but show the plot from different perspectives. Swatches of the same color represent approximately the same F1 score. Observe that the plotted surface seems "taped" to the x and y axes, signifying that if any of precision or recall is zero, the F1-score is zero.

F1 score as a function of precision and recall

Picking the right metric is important to make sure that your ML model is solving a problem that translates into value for you. There are a host of standard metrics you can pick from - ex F-score, lift, ROC Area, average precision, precision/recall, break-even point, squared error, cross-entropy. Of course, you might want to consider using a custom metric if that suits you well.

Always have a baseline. Some problems are harder than others. Whether 90% accuracy is good, depends on the problem.

My favorite example is Sentence Boundary Disambiguation (SBD) - the task is to identify elements/tokens in a given text that indicate sentence boundaries i.e. points where a sentence ends. Here, a token of text is anything not a whitespace. For example, the text (a quote by Leonardo Da Vinci):

In rivers, the water that you touch is the last of what has passed and the first of that which comes; so with present time.

has 28 tokens; 25 words and 3 punctuation symbols. We need to identify which tokens among these 28 represent sentence boundaries. There is only one in this sentence - the "." at the end. This is a binary classification problem - for each token, we only need to specify whether it is a sentence boundary or not. The catch here is that a boundary indicators have other uses too - for ex "." is also used for abbreviations - and also, there is more than one sentence boundary indicator - "?", "!", etc.

Is a model with a F1-score of 90% a good model for this task? Lets start by looking at some very simple "models" and how they perform. In the table below, I have listed 4 such very-non-ML-ish models for the SBD task, and their performance on text from the Brown corpus. Performance is characterized by precision, recall and F1 scores (a positive label for a token is when we identify it as indicating a sentence boundary).

Description of Model	Precision	Recall	F1-score
Sentence boundary if element is "."	100.0	86.06	92.51
If element is one of "?", ".", "!"	94.35	91.54	92.92
If element is "." and it hasn't occurred as one of last 3 elements	100.0	84.97	91.87
If element is one of "?", ".", "!" and none of the last 3 elements are "?", "." or "!"	94.26	84.58	89.16

As you can well see, it is hard not to score around 90% on this task! And I wrote the code that generated the above data while writing this post - all 4 models in about 50 lines of Python.

A baseline tells you how effective an absolute commonsensical model is. This helps you better guide your efforts: do you want to spend time implementing a model that literature says would give you 95% accuracy? When there is something reasonably good you can do in a short time? Is your accuracy of 60% necessarily a bad score? - not if you have a reasonable baseline that gives you 30%. How far have you come from your baseline accuracy with your current model? - if its not much you may want to revisit your hypotheses about data so that you can pick a better model. For all real problems, having a reasonable baseline is a must.

Try it out. Models make various implicit or explicit assumptions about the underlying data - for ex about what the separation boundary between the classes looks like, the size of the data set, the distribution of noise, whether sequence matters, etc. Since you wouldn't have all the information about your data that can indeed help you validate these assumptions (if you knew so much you probably wouldn't be modeling the data), you might end up using a model that doesn't suit your data. And the only way to know (at least, in most cases) that there is a mismatch is by actually trying it out on the data. Building a good model can be an extremely empirical process. So, be careful to be not become complacent only because a model looks good on paper. It might not work the same way for your data.
Bias-Variance tradeoff, dimensionality and cross-validation. These are topics that deserve longer discussions and I am not going to talk about them much here. I fear that a short discussion would either confuse or impart a false sense of security. If you work with real data, I cannot stress enough the importance, and relevance, of learning about the bias-variance tradeoff, the effects of dimensionality and the practice of cross-validation. What I am going to do, however, is give you a teaser of the pitfalls that you are vulnerable to if you are unaware of these ideas.

Say, your friend comes to you with data to be classified. You have, at your disposal, implementations of the Naive Bayes algorithm and C4.5 - a type of decision tree.You know nothing about how the data was generated but your friend tells you that he is almost sure that the data has a complex separation boundary between the positive and negative classes. So, which algorithm do you use? Clearly C4.5, right? Since, it is well known that it can come up with complex boundaries while Naive Bayes can only get you linear boundaries.

Here's the surprising bit: You can do worse than the Naive Bayes classifier. Even though the data is not linearly separable.

What did you miss? You forgot to take into account the fact that Naive Bayes has a high bias and decision trees suffer from high variance. And as a result, size of the dataset affects their relative performance.

Something with a high bias is very rigid in its idea of the separation boundary between the positive and negative classes. Naive Bayes comes up with a separation boundary that is a line - throwing more data will only help it find a better line, but it cannot find anything non-linear i.e. say if the optimal separation boundary looks like a circle, well, you are out of luck.

A high variance classifier can find complex boundaries but is sensitive to the precise distribution of points. This is a problem because data may be often contaminated with noise i.e. instances in your data set that are not consequence of the phenomena you want to study/model. In the sales car pitch problem discussed previously, the forms in which people filled in an incorrect age would result in noise. A high variance classifier would try to find a separation boundary which correctly labels the noisy points too. This is not a smart thing to do since the noise that you saw on the training data (data used to build our model) may not show up in the test data (data on which we are making our predictions based on the model built on training data) - its noise after all - leading to inaccurate predictions.

The figure below shows these problems. In the left panel, which shows the training data, we can see how a high variance model overfits and finds a boundary that includes the noisy green points - which exist far inside the "red region". The high bias classifier is unaffected, but then, it is also a bad model, since the ideal boundary seems quadratic. In the right panel, we see that in the test data the noisy points are absent, and instead, there are red points present in the region as we might expect. The high variance model predicts everything in this region as green however - since that is what it has learnt from the training data. The high bias model, labels a lot of the green points as red, because it could never fit the data correctly in the first place. That it was not affected by noise is an artifact of its general inflexibility.

Also note that even the ideal boundary misclassifies one red point (bottom left-most) - the lesson is no one is expected to get everything right, we are merely in the business of minimizing errors.

Training Data

Test Data

The graph below shows how the accuracies of our models change as size of the dataset increases. With low training data, the high variance decision tree classifier fits noise - so much as that the Naive Bayes model, which is downright wrong for this data (remember the data in the example is not linearly separable) outperforms it. A high amount of data effectively tames the decision tree to focus on the dominant patterns in the data only (i.e. not noise) and its performance becomes better. We also discussed that throwing more data at a high-bias classifier doesn't help things much - accordingly, we see that the Naive Bayes line plateaus after a while, whereas, more data definitely helps the decision tree become progressively better.

Accuracy of Naive Bayes vs Decision tree, as a function of dataset size. Source.

This scenario should tell you that some of what you think of as well-informed decisions can turn out to be catastrophically wrong (duh! life!) - what is a better classifier can depend on how much data you have (among other things). Now, go read up on these topics as much as you can :)

Literature survey is important. If you are new to ML you may have those moments when it seems you've hit upon an original idea to solve a problem. I will be harsh here and tell you that its most likely not as original as you think it is. ML has been around for a while - research on neural networks started in the late 50s! - and in the last couple of decades in that time, the field has adopted a reeling pace in its development. What that means for you is that most common "use-cases", and many not so common ones, have been thought about, and solutions published or patented.

One way I like to put this is, to be original, you are now competing with people across space and time. In your class in school, for an idea to seem brilliant, you had to only come up with something that didn't strike your 30 odd or so peers in a given amount of time. Here, however, your peer group is composed of every living human being who is working on ML (and would publish/patent) - they are distributed across the globe and some of them probably have the luxury of dedicating more time to solving ML problems than you do. And this is only the space part. Your competitors are also the people who contributed research to the discipline in the last 50-60 yrs. This is the time part. Needless to say, with such a massive demographic that spans space and time, for something to be original is tough. Especially so if you are unaware of what existing literature consists of.

The moral here is that when you come across a problem, do your literature survey first. Chances are are your problem has already been studied. Even if that's not the case, a good literature survey can provide a logical/mathematical structure to tease your problem into, enabling quick and systematic progress. Don't reinvent the wheel. And remember that Google Scholar and CiteSeerX are your friends.

I feel our approach to approaching ML problems (or, really, any problem) should be in the same spirit of humility with which E.W.Dijkstra, computer scientist extraordinaire and a Turing award winner, suggested we approach problems in computer programming:

Edsger W. Dijkstra

We shall do a much better programming job, provided that we approach the task with a full appreciation of its tremendous difficulty, provided that we stick to modest and elegant programming languages, provided that we respect the intrinsic limitations of the human mind and approach the task as Very Humble Programmers. ( Source: The Humble Programmer )

Working with data is a lot of dirty work. This point is especially directed to people who want to move into Data Mining or Machine Learning job profiles. In most (not all) of these case there is the glamorous notion involved that they will be working on cool learning algorithms all day. Truth is, working with data can be hard and inelegant. And sometimes, plain boring. All those algorithms ML teaches you can be applied only if you have captured the required data in the first place, cleaned it up (data capturing is hardly every perfect and bereft of noise) and transformed it into a format that your ML tools can use. This is not pretty work and takes up most of your time. This is often the 'straw that breaks the camels back', so to speak - people who were enamored with working on ML, become disgruntled with the area, when they are faced up-close with these aspects of their job.

My advice to people who want to move into this line of work is set your expectations right. Find someone whose day job is using ML, preferably in the workplace you are interested in, and ask him what his daily routine consists of.

The following awesome CnH comic strip aptly sums up my sentiments - only replace "science" with "ML":
Solutions evolve. It is interesting that although we take a matured approach of iterative development for software, in the case of modeling data, a similar mindset is not as prevalent. We tend to commit mistakes of both omission and commission. Either we give the problem too little attention and expect the first few tries to completely nail it, or we tend to get into a purist-academic research mode where we don't want to commit to a solution until we deem it perfect. I feel that the s/w model of iterative development fits very nicely with how ML models should be built in the industry:
- You need to have a long-term commitment to your data modeling problem. Don't expect to dedicate 2 days to your problem and get a model that is optimal, robust, processor friendly, has low data complexity, etc.
- Don't get stuck trying to come up with the best model in the first go. Think versions. Keep revisiting your solution and making it better. This has the added advantage that your first (few) version(s) will help set up a proper workflow/pipeline/interface i.e. in a product or a service your model is presumably one component of and the first (few) model(s) can help you smooth out the various kinks around how it precisely plugs in into the larger system.
The man inside the machine is not going away soon. (Warning - this is a subjective viewpoint) Although the last few years has seen immense progress in the field of ML (I am occasionally still surprised that voice based search is a reality and runs off my mobile phone) we are not yet at a place where we can let algorithms learn from data in cruise-control mode. You need human involvement to pick the best algorithm from the plethora available out there if you want anything to be done in a reasonable amount of time. I have heard people lament that whole learning process is not "automatic" i.e. you can't throw whatever data you have into a black-box and out comes an optimal model. But then the problem we are solving is so complex!

Different phenomena produce data that have different underlying idiosyncrasies, and to model these effectively, many learning algorithms come halfway by making some assumptions of their own so they don't have to rediscover everything from scratch. How, then, is the right algorithm to be matched to the problem, accounting for your various constraints like time, accuracy, computational resources, etc? The missing element is you - you need to pick the right algorithm for your problem, much like picking the right key from a keychain for a lock.

In theory, yes, you could have an algorithm that loops through all possible separation boundaries to pick an optimal one - but there are an infinite (this is not an exaggeration) number of them out there, and you would need forever to solve a single problem. So that you don't have to suffer through this rigmarole, you rely on the experience and judgement of a (good) data scientist to pick a promising subset to try out, from the universe of models. Unless there is a major theoretical revolution in the field, say on a Newtonian or Einsteinian scale, or alternatively, parallel processing becomes far more parallel and affordable, enough so that we can brute force our way out, in the general case the need for human involvement stays.

Phew!

Random thoughts.

Sunday, June 15, 2014

Machine Learning - what it is, what is not

collapsible text view

About Me

Blog Archive