Random thoughts.

Sunday, July 31, 2016

Solving hard problems with time travel!

If you find the title a tad bit cheesy and link-baitish, in my defence I would say that this is really exciting stuff. Stuff with time travel and cats! OK, maybe one cat. Who is also, probably, not relevant to the plot.

In a recent discussion on algorithm run-times, I seemed to faintly recall how you could solve NP complete problems if you could do time-travel. Yes, time travel. Sure enough, some amount of Googling later, I found the relevant arguments by Hans Moravec. Here is an article by him. He also discusses this in his book - Robot: Mere Machine to Transcendent Mind ; unfortunately, I could not find an affordable copy. I present the arguments from a small portion of the article in this post.

Here's a high-level flow of what we're going to talk about:

We discuss interesting modifications you can make to certain simple circuits if you could travel in time.
Then we see how those modifications can be used to solve certain kinds of problems.
And then we build on those ideas to solve some super tough algorithmic problems.

Circuits with feed back loops

Logic gates take as inputs binary signals - each a 0 or a 1 - and produce a binary output. We are interested in the

amplifier (or buffer), which does not change the input signal, and
the inverter or the NOT gate which inverts the input.

Both these gates take in only one input. The following figure shows how they are represented - the fat end is the input and the pointed end is the output. The NOT gate looks like an amplifier with a, um, snot-nose.

Fig 1. The amplifier and NOT gates

There is a time-delay between the consuming of an input and producing the output (typically in nanoseconds). We will denote this by $\Delta t$.

Consider what would happen if, after sending in an initial input signal, we immediately connect the output to the input and hold that configuration. This arrangement is known as a feedback loop. This is shown below for the NOT gate; the amplifier circuit looks similar. Note the device labelled "measure" - we're interested in measuring output at this point in the circuit.

Fig 2. A feedback loop with the NOT gate

The case for the amplifier seems pretty non-controversial: if we started with an input of 1, the output would be 1, which again feeds in, again becomes the output, and so on. All 1s. No problem.

The case for the NOT gate does not seem so simple. Say, we start with a signal of value 1 i.e. the first input to the gate is a 1. This is transformed into a 0 after $\Delta t$. This 0 immediately becomes the new input (we assume no delays in the wire, the only delay is in the gate), and another $\Delta t$ later 1 is the new output. Which then becomes the new input leading to an output of 0 another $\Delta t$ later. It looks like after every duration equaling $\Delta t$, the output switches state i.e. it alternates between 1 and 0, starting with a 0. How rapidly this switching happens depends on the time delay - smaller values of $\Delta t$ lead to faster switching.

This is illustrated below. We start observing our feedback circuit at time $t_1$. The readings obtained by the measuring device are shown as solid circles, green for 1 and red for 0. Intervals of length $\Delta t$ are marked. The signal value in the wire between these intervals of measurement is also shown - as adjacent colored rectangles under the time axis. After each duration of length $\Delta t$, the output becomes the new input immediately - this is shown by small tight loops with the appropriate color.

Fig 3. Behavior with positive time delay

The timeline for the amplifier holds no surprises. For the NOT gate, we see a green rectangle giving way to a red circle and a red loop; denoting inversion of the signal, and this inverted signal becoming the new input. This alternation goes on because of how the circuit is built i.e. the alternation exists by construction.

What would you measure if the speed of switching becomes infinitely fast? 0? 1? Or something midway like 0.5?

Time travel

Before we answer that question, let's first ask how do you make the switching infinitely fast. And I am not using the term "infinitely" merely as a metaphor.

One way is time-travel.Which, of course, is as easy as taking a stroll in the park. If you could send signals into the past, you could add a negative time delay unit in the circuit (reminds me of reannual plants. Terry Pratchett anyone?). The role of this unit is simple on paper - carry its input unchanged to a time in the past.

With a unit like this plugged into the circuit, as shown in the following figure, what are the implications?

Fig 4. Negative delay!!!

Consider a negative delay that conveys the signal back to the instant $t_1$. In case of the NOT circuit, you might see yourself reasoning in this manner: "1 goes in, 0 comes out, which goes into the past implying that 1 hadn't gone in, 0 had. So 1 should come out. Which, alas, goes into the past and becomes the original input, and so 0 indeed should be the output ...".

Clearly we need a diagram for the sake of sanity.

The following figure shows the timeline in the spirit of the previous diagrams. But the negative delay causes interesting differences. We again start monitoring our circuit at $t_1$ and start with a 1/green signal (iteration 1 in the diagram). Till the gate is done inverting the signal i.e. in the interval from $t_1$ to $t_1+\Delta t$, we see the 1 signal in the wire - represented by the green rectangle. At $t_1+\Delta t$, we measure a 0 - red circle - and this signal feeds back into the gate as the new input in the past i.e. as if 0 was the the input to begin with. The diagram shows the input arc stretching back to just around $t_1$.

Fig 5. Quantum superposition for the NOT gate

Beyond this, iteration 2 begins and a similar sequence of steps execute and lead to future iterations. While in Fig 3, the iterations progressed from left-to-right on the time axis, here they all are observed within the same time span, $t_1$ to $t_1+\Delta t$ - hence they are shown stacked vertically. The mind-warping catch here is that like before, by construction, all these iterations will happen and hence you will have an infinite number of both 0s and 1s, from different iterations, as legal values of the measurement at $t_1+\Delta t$. All of them supposed to show up at the same precise instant.

So, what does the device measure?

Say hello to the interesting notion of quantum superposition. The idea is that at the subatomic level entities can be in multiple states simultaneously. Here, the signal, which we assume to be a result of phenomena at the subatomic level, is both a 0 and a 1. Quantum mechanics also tells us that superposition exists until we measure the signal which leads to decoherence or its collapse: this is geek-speak for "one state is forced on the entity when we measure its state".

Fig 6. A cat as promised.Because I used the word "quantum" and I like cats.

Here's where I ask you to ignore the specifics of what constitutes the "signal" and the "measuring device". Note how I haven't mentioned the signals are electrical in nature. I want you to think of all this time-travel business logically - 1/0 signals maybe represented by positive/negative electrical charges, by light beams of opposite polarization, or something else entirely. As the original article by Moravec points out, what you end up measuring, using the device that suspiciously looks like an ammeter (it is not, its a symbolic measurement device), depends a lot on these specifics.

But we are going to assume a few things; we are going to assume is the measurement shown by the device is an aggregate property of the stuff that constitutes the signal over its different iterations. So individual entities in this stuff can be either in a state of 0 or 1, and the device reports a value that's a average of these values. If a lot of these entities are at a value 0, you read a value close to 0. If 50% of them are at a value 0 and the remaining 50% are at 1, you read 0.5 off the device.

In our case, when the device tries to probe each of these entities to see if they are a 0 or 1, the afore-mentioned quantum collapse occurs. Some entities manifest as a 0, some as 1. In what ratio? Given the circuit setup, we have no reason to believe one state is produced more than the other - so we have an exact ratio of 50:50. And the device measures 0.5.

You good? You aren't looking at me like this right now?

Iterative problem solving

You're probably thinking "All fine, but wasn't this guy schmoozing about problem-solving a while back?" - fret not, we are getting to that bit now.

There are many problems where a solution is arrived at by using an iterative algorithm. We will, in general, represent the problems we are interested in, in the following manner: find the value of $x$ that minimizes $f(x)$. Some common steps that such an iterative algorithm take are:

The solution to the problem is arrived at via multiple iterations. Every iteration makes a new estimate of the solution.
To start with, before iteration 1, we assume our estimate of the solution is an arbitrary value $x_0$.
Iteration $i$, takes in the current guess $x_{i-1}$, and uses it to come up with a new guess $x_i$. In most cases, the new guess is also a better guess i.e. $f(x_{i-1}) \geq f(x_i)$. Here, $i=1, 2, 3, ...$
The solution to the problem, denoted by $x^*$, where $f(x^*)$ is the minimum value for $f(x)$, is a fixed point of the algorithm: an iteration that consumes $x^*$ also produces $x^*$. So, if we find the solution in iteration $k$, i.e. $x_k = x^*$, then $x_k = x_{k+1} = x_{k+2} = x_{k+3}=... x^*$.

Now let's go back to the circuits we have been talking about. What if, instead of the NOT gate, we had a complicated bit of circuitry that performed an iteration for us i.e. it performed Step 3 above? We would get the ball rolling by inputting $x_0$, this would produce $x_1$, which would produce $x_2$, and so on, till we reach $x^*$. Beyond this point the circuit only produces $x^*$. The measuring device measures the current value of $x_i$ in an iteration in this modified setup. Adapting the previous diagram, this is what would happen:

Fig 7. Stabilizing at the fixed point , x*

Here's the million dollar question - what would the device show? We measured 0.5 in the previous case; what would it read now?

Remember we assumed that the device reports the average of everything it sees. Here, it sees a bunch of different values of $x_i$ to begin with, but once we hit $x^*$, we only get this value from that point onward; we see an infinitude of $x^*$ values. What is the average of a set of numbers that has an infinite number $x^*$, and a finite number of other values? Yes, the device would measure $x^*$.

Take a while to think about what we've achieved. We start the circuit at $t_1$ with some arbitrary $x_0$ as input, and $\Delta t$ time later the device coolly reads out the final answer $x^*$. All the iterations that we are supposed to wait for ... are gone. Poof! The iterations are hidden from our sense of time, thanks to our hypothetical ability to travel in time.

Problems with no answers

This is a good place to discuss what happens when an iterative algorithm does not find an answer to a problem. There is no $x^*$, and hence the algorithm keeps generating a bunch of values ad-infinitum. It is important that our measuring device indicates this state of "no-answer", to help us distinguish from cases where there is an answer. For ex, it is bad if $x$ can be any number, and the iterative algorithm somehow gets stuck with generating 3 specific values in a round-robin fashion, and our device shows us the average of these 3 numbers. We would be misled into believing this to be the answer.

Whether a measuring device can do justice here depends on the implementation. In the specific case of the problem that we see in the next section, we will, fortunately, have a way to recognize these "no-answer" results.

Solving NP complete problems

Finally, with all that elaborate setup, we are ready for the pièce de résistance.. We take our mechanism one step further now and look at how it can be used to solve the NP-complete (NPC) problems. This class of problems is of significant interest in the computer science community.

The hallmark of NPC problems are that there is no known fast algorithm to solve them. If there are are $K$ candidate solutions to a problem, then you would need to enumerate each of these $K$ possible solutions to identify the best - no shortcuts. In fact, efficient solutions have so notoriously resisted discovery that there is a million dollar prize for anyone who finds one. I wasn't kidding when I said we are going to solve hard problems.

As an aside, another interesting characteristic of NPC problems are that if you know how to efficiently solve one particular NPC problem, you can adapt that solution to solve any NPC problem. I mention this now since although we are going to pick up one NPC problem for illustration, let's be aware that all NPC problems are taken care of.

A popular example of a NPC problem is the Travelling Salesman Problem (TSP). A salesman is to visit each of $n$ cities, starting at a particular city, and come back to the starting city; and he must pick a route that is the shortest. Fig 8 shows 4 cities A,B, C and D, and the distances between them.

Fig 8. Travelling Salesman Problem. Source:[1]

Some possible routes and their lengths are:

A-B-D-C-A, 108
A-C-B-D-A, 141
A-B-C-D-A, 97

It so happens that 97 is the best you can do here. What makes this problem NPC is that the only way to solve this problem is to enumerate all 4! = 24 routes and identify the shortest one. For $n$ cities this is $n!$ possibilities (which is your K here), and while it works out for the above example with $n=4$, the number of possible routes increases really fast with increasing $n$.

This wouldn't have been, umm, "complete" without a xkcd comic strip.

To give you an idea: if I wanted to solve this problem for just the districts within the state of Karnataka, India - which has 30 districts - I will need to look through 30! routes. That is $2.6 \times 10^{32}$ routes! Add to it the districts in the state of Maharashtra, for a total of 66 districts, we are now looking $66! = 5 \times 10^{92}$ routes. You know why that is scary? - the number of atoms in the visible universe is $\sim 10^{81}$. Good luck with finding the best route.

But we are not going to worry because we have time-travel on our side ;)

To retrofit our setup from the last section, we think of the solution estimates $x_i$ as routes, and $f(x_i)$ as returning the length of a route $x_i$. We also add a new component to the setup - a knob that can be set to a number, say $d$. The problem-solving circuitry, which performs the iteration in Step 3, is programmed to work in the following manner in iteration $i$:

If for the input route $x_{i-1}$, we have $f(x_{i-1}) \leq d$, output the same route i.e. show $x_{i-1}$ as output.
Else, generate the next permutation $x_i$ and return this. Here, the "next" permutation is given by some ordering like the alphabetical ordering.

Clearly, if there is a route possible that has length less than or equal to $d$, its a fixed point for the algorithm - all iterations beyond the iteration when it first sees such a route, keep returning this route. The measuring device, which now sees an infinitude instances of this route, would only display this route. If such a route does not exist - maybe because its too small a number, for ex $d=10$ for Fig 8 - then the system becomes "unsteady" in the sense that all permutations sort of show up for the device to measure. If the device has a 16-segment display to show routes, we would probably see "weird" characters in the display due to multiple symbols trying to appear on top of each other.

Mayday ... Mayday ... Mayday! Source: [2]

This is how we solve this problem then: we start by holding the knob at a small value, where we presumably see an "unsteady" state. We keep turning it up slowly, till we see a proper route. The minimum $d$ for which this happens is our answer. This is easy (and fast) to get right by trial and error.

Finer details:

For TSP, you can always start with $d=0$.
How large can $d$ be? There is a easy heuristic to determine this. Pick any route at random and calculate its length, say $d'$ - this is how large you need $d$ to ever be. Why? If you have not picked the shortest route, which is likely, then you will find your solution much before hitting $d'$, when you work your way up from $d=0$. If, by chance, you have picked the shortest route, then you will find your solution when your knob is exactly at $d'$. In either case, you don't need to go beyond $d'$.
The elephant-in-the-room question: in what increments should you increase $d$? This depends on the precision of the numbers in the problem. In the above TSP example, all distances are integers, and hence the shortest distance will also be an integer - hence you could turn up the knob in units of $1$ when moving from $d=0$ to $d=d'$. In the generic case, if your distances have 2 significant digits after the decimal, like 54.25 km, then it makes sense to move up in increments of $0.01$.
Is this fast? Yes, very! For every position of the knob, you have your answer, or lack of it - the "unsteady" state - in $\Delta t$ time. In this $\Delta t$ time, all the circuit is doing is calculating the length of a route, which is a bunch of look-ups and sums; on a modern computer this would take much less than a second. A human would see an answer instantaneously for the knob held at a particular position.
Come to think of it, since you have an upper bound $d'$ now, you don't have to even move up linearly from $d=0$ to $d=d'$. You can do a binary search. This is a significant savings. If the number of decimal numbers you were exploring between two integers is $p$ ($p=1$ in the TSP example, and $p=100$ for the example with 2 significant digits after the decimal), then instead of holding the knob at $pd'$ positions, you can make do with only $O(\log pd')$ positions.
Of course, the "knob" here is symbolic. In a simpler life, you can always have a program turn the metaphorical knob.

Pretty cool, right? The original article is a lot more detailed and talks about a few more problem-solving use-cases if you're interested.

[1] By Sdo - self-made using xfig, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=715485
[2] https://openclipart.org/detail/179999/16-segment-led

Saturday, December 19, 2015

Clustering: ELKI over Weka

I think I have found my favorite tool when it comes to clustering - ELKI. Lots of algorithms, minimal interface, and well implemented. I have a few cribs about the UI/UX, but overall its one of the best options out there. It also helps that it has a great implementation of one of my favorite algorithms - OPTICS.

If you are in the habit of using Weka, for small tasks or maybe because you started off with it, be wary of using it for clustering. Recently,I had to use Weka for cluster analysis for legacy reasons, and I am far from being a happy customer. This short post provides some instances where Weka and ELKI gave different results on a problem.

DBSCAN

The following images show the clusters discovered in the same dataset using eps = 0.5 and minpts = 5 (these are parameters to the DBSCAN algorithm). The original dataset has 31 clusters. Different clusters discovered are represented by different colours.

Left: Weka - 1 cluster, right: ELKI - 24 clusters

As you can see, for the same settings of parameters Weka thinks of the whole dataset as one cluster, while ELKI discovers 24 clusters. In fact this happens to be true for a range of these parameters. The following heatmaps show the number of clusters as eps and minpts are varied.

Number of clusters. Left: Weka, right: ELKI

Weka sees a maximum of 2 clusters - as denoted by the white strip to the extreme left of the plot. ELKI sees quite some variation - at the bottom left there are around 3000 clusters, while for the rest of the plot it is close to 1 (I know it looks like 0, but the shades for 0 and 1 aren't distinguishable visually and the minimum number of clusters can be 1).

If you think about it what ELKI reports makes sense - at some setting of the parameters you would expect each point in the dataset to be a cluster. Since there are 3100 points in this dataset, ELKI sees as many clusters in the bottom left corner. Hence, Weka seems to report incorrect results.

I tried comparing Weka and ELKI on another dataset, this time comparing cluster purity for a range of parameters. Again, in the case of Weka we see that the purity doesn't vary much, whereas for ELKI it varies over a wide range (as you'd expect).

Cluster purity. Left: Weka, right: ELKI

The responses to this question on stackoverflow suggest that this happens in Weka because automatic normalization of distances is imposed. Even if this is true, the implementation is incorrect - normalization here should not be a default and the documentation does not warn you.

k-means

I was a little surprised that the k-means implementation in Weka is buggy. This is one of the simpler algorithms out there!

On the same dataset I've used above, running k-means with k=32 doesn't terminate. It does not terminate even when you set the maximum number of iterations to 500. I had the clustering run for around 40 min before I decided to kill it.

ELKI gave me this in under a second:

k-means, with k = 32

Unlike the problems with DBSCAN, this problem seems to be dataset-specific. For a bunch of datasets I tried, k-means worked as expected.

Performance

I guess a discussion on performance is moot if your results are incorrect. But here are some numbers (from the ELKI website) further strengthening the case for ELKI. I have highlighted DBSCAN and OPTICS since those were the algorithms I was interested in. The other numbers are equally impressive.

ELKI Benchmarks

With this we come to end of this short post. If you haven't used ELKI yet, I hope I have convinced you to give it a shot!

Monday, August 17, 2015

Mori and Pop Girl - how do you fight someone who knows the future?

I recently came across the book The Watchmaker of Filigree Street. While it is an enjoyable read overall, the bit that interested me is a particular skill of one of the lead characters "Mori". Mori can see the future. Moris' faculties are so sensitive that people do not have to do things to make a particular future feasible, they only need to intend to do certain things. Mori would know what future would be led to if those intentions were executed. Also, since there are multiple possibilities about how the future can roll out, Mori is good at guessing only when one of the possibilities seem dominant. For example, Mori can guess that a dice is about to fall, but he cannot guess what face it would show because the outcome is truly random.

Now switch to the movie Push (2009), starring Chris Evans when he still wasn't Captain America. About mutants with different kinds of superpowers. We are introduced to this category called Watchers. Watchers can see the future. Here too, we have the notion of many possible futures, and what a Watcher sees changes based on what happens in the present. We meet the "Pop Girl", a powerful Watcher, who, like Mori, needs only people to decide on doing something before she can see the relevant future.

(You are a computer geek if you thought of the word "stack" just because I have said 'push' and 'pop' in the same paragraph. Like I did. ;) )

The reason I mention these characters together are both stories have people with no clairvoyant powers whatsoever trying to outsmart them. I think this is an interesting setup. Think about it - your enemy is prescient; when you even think of a strategy to fight them, they already know. How do you fight someone like that?

Interestingly, both stories deal with this differently:

Mori vs Grace: Grace relies on randomness. When she travels she allows coin flips to do a lot of the decision-making. In a particular part in the story she needs a package to be carried by a device (trying hard to avoid spoilers here) - and she succeeds because the device has the capability to occasionally move randomly. Mori knows something is up, even that there is a package on the move, but he is at a loss to guess precisely. Remember, how he can see dice falling, but can't guess outcomes? This is exactly what happens now.

Pop Girl vs Nick: Nick relies on not knowing the plan or having his team know of the plan till the very last moment. Knowing the plan leads to intentions, and intentions lead to the Pop Girl seeing the future. So if you have a plan that is already set in motion, but you have nothing to do with it since you do not know about it yet, Pop Girl does not see it as a future you are involved in.

How does Nick pull this off? He thinks up a plan - on the other side Pop Girl starts seeing a future - writes letters to the members in his team, including himself, detailing out the part of the plan they are to execute, with instructions to everyone to read their letters at predetermined times. Pop Girl does not see the whole plan yet. Nick then has his memory wiped out starting from the time when he thought of the plan. The memory-erasure is done by another kind of mutant - who is also instructed to hand over the self-addressed letter to Nick just after the erasure session.

Once Nicks' memory is gone, so are his intentions around the now forgotten plan - and Pop Girl stops seeing a definitive future for Nick or his team. At a later point in time, when everyone has seen his or her part of the plan, it is possibly too late for the Pop Girl to do anything (she can tune into a future immediately, she cannot translocate immediately). Note that, just reading at a few instructions in a letter is probably not potent enough as realizing how they fit into the overall plan - blind instructions do show you some kind of a future, but your understanding of how the whole thing works, which lead to very specific intentions, are better fodder for Pop Girl. This adds another layer of vagueness that a Watcher must contend with.

Both stories leave many questions unanswered, and having to do with playing around with time, (possibly) has loopholes. For example:

Moris' visions of the future where he has picked up a new skill imbues him with those skills now. So if in the future he is to speak flawless English, he starts speaking English now. How very Grandfather Paradox-y!
Why doesn't Pop Girl fall back to using her visions from the time before Nick had himself erased? Also, erasing would have been an intention - so she would have known that her visions before the erasing were good to go on with.

I am sure one can come up with possible explanations. But leaving details aside, and thinking of these ideas as only high-level suggestions instead of fleshed-out strategies, I liked how two different approaches - randomness and "just-in-time" plans were explored in the stories.

Saturday, August 16, 2014

kNN classifier - in one line of Python. Fits into a Tweet.

Recently, I conducted a session on Python where I walked through implementing a kNN classifier. Close to the end of the session, we got to how succinct Python can be, and I proceeded to reduce our code to the absolute minimum number of lines possible. The impromptu code-golfing exercise led me to an interesting realization - you can write a kNN classifier in one line of Python. A line short enough (126 characters) to fit into a tweet!

Yep, I checked:

Fig 1. A tweet-sized classifier.

In retrospect, this does not seem surprising given certain features of the language. Its just one of those things you don't normally think about; you wouldn't want to implement an algorithm in one, potentially confusing, line. Frankly, you shouldn't. In the interest of the sanity of the poor souls who might need to read your code someday.

But notwithstanding coding etiquette, we are going to take a look at this since a one-liner k-NN is, well, pretty awesome!

In the next couple of sections I provide some context to the problem and the solution. I have made these sections collapsible (doesn't seem to work with the mobile version of the site), since they are not the focus of the post. If you already know what a kNN classifier is, and understand Python maps/lambdas, skip ahead to the section "Initial setup".

What is a kNN classifier?

(show)

A classifier looks at a data point provided to it and declares it to belong to a particular class. A common example of this is a spam classifier, which scans through the text of a mail and declares it to be belonging to one of these classes: spam or no-spam. If you use Gmail, you have already seen a very powerful spam classifier in action - using a classifier is how Gmail knows what goes into your spam folder.

A classifier is initially presented with data points along with their correct labels - so that it can identify associative patterns between the data points and the labels. This dataset is known as training data. For ex, in the case of spam, a classifier might note that words like free, lottery etc. strongly correlate to a mail being a spam mail. Once the training phase is done, the classifier is ready to tag unseen instances, known as test data points, . Beyond the training phase, the classifier may not need to keep the training data set around - instead, it can solely rely on using the patterns it has learned.

The kNN classifier, however, is a lazy classifier - it does not bother with learning associative patterns. It keeps all of the training dataset around (at least, the simplest avatar of the classifier does so) , and for a test point, finds points similar to it in the training dataset, and outputs the dominant label among the similar points. This is where it derives its name from - its classification is based on the labels of the k nearest neighbours of the test point.

I can almost describe the philosophy of the kNN classifier with this pithy quote:

You are the company you keep.

In machine learning, a data point is typically described as a vector of features (also known as attributes or dimensions). For ex, to describe a mail, the features of interest might be the the presence or absence of certain words; these could be my features:

has the word free
has the word lottery
has the word weekend

And a mail described by the vector [1,1,0], would have the words free and lottery in it - indicated by the 1s - but would not have the word weekend, as indicated by the 0.

The number of features is termed as the dimensionality of the data. In the above toy example, we were dealing with 3-dimensional vectors.

Given that every data point is depicted by a n-dimensional vector, we can plot the point a n-dimensional coordinate system. Yes, we can't see stuff beyond 3-dimensions, but in theory, we can think of a data point being plotted so. This helps us visualize what a kNN does. In the following figure the coloured points belong to the training dataset - some have the label red, some have green. We want to find the label for the test point, shown with a "?". If we assume k = 3 (smaller circle), the dominant label is red (red to green:: 2:1), and that is what we declare. If k = 5, green dominates (red to green:: 2:3) and that is what we declare as the result.

Fig 2. The kNN classifier

The above figure helps make a significant point - the accuracy of a kNN classifier depends on k. We won't worry about the right value of k in this post - we'll assume k is provided to us.

Note that, in the figure above, we implicitly assumed that the neighbours of a points are defined by the Euclidean distance - the distance one gets between 2 points when measuring with a ruler. But a different way of measuring distances/similarities may make sense for a problem.

So what does the algorithm for kNN classification look like? A naive implementation would have the following steps:

Calculate the similarity of the test point with each of the training points
Get the labels of the k most similar points
Declare the label of the test point to be the majority vote label

The one-liner I am interested must implement the above steps.

A quick primer on maps and lambdas

(show)

Initial setup

I will assume that these variables are defined for us:

train - is a list with the training data. Every element in this list is a tuple with the first entry as the feature vector and the second entry as the label. So, if I had the following 4-dimensional data-points in my training data:
1. [2, 4, 3, 5], label = 0
2. [5, 7, 4, 7], label = 1
3. [6, 8, 2, 3], label = 1
4. [5, 7, 2, 4], label = 0
then train = [([2, 4, 3, 5], 0), ([5, 7, 4, 7], 1), ([6, 8, 2,3], 1), ([5, 7, 2, 4], 0)].
sim(x, y) - the similarity function. Takes two data-points as arguments and returns a numeric value for similarity. For our example, we use the dot-product as a measure of similarity. The dot-product of two vectors [a, b, c] and [p, q, r] is a*p + b*q + c*r.
k - the number of nearest neighbours we need the labels from. For our example, we will assume k = 3.
test - the test data point, which we are interested in classifying. This is a list of the feature values. We use [1, 2, 3, 4] as an example test point.
I am also assuming that the Python collections module is imported into the current namespace i.e. this statement "from collections import *" has been executed (I know this sounds like cheating, but hold on judging me till you have read the next section ...)

Note that 1-4 are external to the classifier and have to be provided anyway - I am just fixing the variable names and structures.

The One (line ...)

And finally, [drum roll] the magic incantation ...

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

If I were to decompose the above line into logical steps, this is how I would go about it:

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

Explanation

Step 1: Get all similarities, keep the labels around

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

In the first step, we call a map with the training data. The lambda expression returns a tuple with the first entry sim(x[0], test) and the second entry as x[1]. Consider the fact that train is a list of tuples itself, and thus the argument to the lambda is a tuple. For ex the first time the lambda function is called is with the argument ([2, 4, 3, 5], 0). Thus, for the lambda, x[0] is [2, 4, 3, 5] and x[1] is 0. Hence, the first entry of the returned tuple is the similarity of the training data point to the test data point, while the second entry is the label of the training data point.

Given the sample train and sim, this first step produces the following list:

[(39, 0), (59, 1), (40, 1), (41, 0)]

As a side-note, it is quite interesting that the dot-product can itself be written as a lambda using the zip() function:

lambda list_1, list_2: sum(map(lambda x: x[0]*x[1], zip(list_1, list_2)))

Step 2: Find the top k points

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

We now sort whatever we got out of the last step in decreasing order of similarity. We use the sorted() function for this purpose. Since it sorts tuples, and we have not mentioned a key, it sorts on the first index of the tuple - the similarity to the test point. By default, sort happens in increasing order, so the output so far looks like this:

[(39, 0), (40, 1), (41, 0), (59, 1)]

You can sort in decreasing order too - by passing in the argument reverse=True, but that's more keystrokes :)

Now, we extract the tuples with the highest k similarities using [-k:]. The "-" in the list indexing notation in Python implies the counting is done from the tail of the list - here, we get the portion of the list starting with the kth element from the tail of the list till the last element. This gives us the tuples with the k highest similarities. Thus, this is the output of this step:

[(40, 1), (41, 0), (59,1)]

Step 3: Get the labels for these top k points

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

Since we have the tuples corresponding to the top-k similar points, we really don't need to keep the similarities around any more. Another map() call does the trick - look at the lambda expression here: it only returns the value at index 1 of the tuple. Therefore, as output from this step, we have just the labels:

[1, 0, 1]

Step 4: Get label frequencies

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

We now need to find the label with the highest frequency of occurrence in the k closest points. We use the Counter data-structure from the collections module. The Counter is a special kind of dict which, when initialized with a list, uses the unique elements in the list as keys and the frequency of their occurrence as the corresponding value.

Thus, Counter([1,1,2,3,2,2,1]) gives me the dict {1: 3, 2: 3, 3: 1}. This is exactly what we need. Initializing a Counter with the output of the last step - [1, 0, 1] - gives us this dict:

{1: 2, 0: 1}

Calling items() on a dict returns a list of tuples, each of whose 1st entry is a key, and the 2nd entry is the corresponding value. Hence, the output of this step is:

[(0, 1), (1, 2)]

Time to address the elephant in the room, I guess :) You could argue that using something from a module does not make for a strict one-liner. And in a sense that is true. I could have avoided using Counter, and instead, used tricks like these. Technically, we would have still ended up with an one-liner, but Counter expresses our objective in a clear manner. Thus, not using Counter does not mean that you cannot have an one-liner, only that the result won't be pretty.

Note that since I am calling Counter directly, my import should be "from collections import *". If you want a cleaner namespace, you might want to do "import collections" and refer to Counter as "collections.Counter"

Really close now ...

Step 5: Get the most common label

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(), key = lambda x:x[1])[0]

All that is left now is to extract the label with the highest frequency. That's easy with the max() function in Python. Note that our tuples (output from the last step) have the label as the first entry and the frequency of the label as the second entry. Hence, in the max() call, we provide the key argument which finds the maximum valued element in the list based on the second entries in the tuples. Yes, with a lambda expression. They really do get around. Thus, max() returns me this:

(1,2)

Of which, I extract the label by specifying the index - [0]. This gives me my final answer:

1

And this ... is the label we are looking for [melodramatic bow] :)

Regression and a related hack

While I was writing this post, I realized that the one-liner for kNN based regression is actually shorter - 92 characters. Also, purer - does not rely on modules that need importing. In a regression problem we are interested in predicting continuous values - as against discrete labels in classification problems. For ex you may want to determine the price of a piece of real estate based on features like proximity to a hospital, distance from an airport, proximity to industries etc. The training data has prices per data point instead of labels.

In kNN based regression, you would look at the values of the k neighbours and return their average.

Here goes:

sum(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:]))/(1.0*k)

If you understand the classifier, understanding this should be easy. The portion in black is the same as the corresponding part in the classifier. The part in red is specific to regression. Note that I need multiply "1.0" to the denominator to make sure Python does not perform an integer round-off in the division.

Now, here is an interesting relationship between kNN regression and classification - if you have only two labels, 0 and 1, then if you average the labels and round-off, it is equivalent to majority voting. For example, consider some sample outputs from Step 3 above:

Top k labels	Avg of labels	Avg rounded	Majority vote
1, 0, 1	0.67	1	1
0,0,1,0	0.25	0	0
1,1,0,0,1	0.6	1	1

This does not work with more than two labels.

Thus, as long as the labels in the classification problems are the numbers 0 and 1 , we can use this following expression:

int(round(sum(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:]))/(1.0*k)))

The portion in black is the expression for regression we saw earlier. The part in red is needed for classification. round() is a built-in in Python which does not need us to import anything (yay!); however, since it returns a float, we explicitly need to cast to int (bummer...).

This version of the one-line classifier is 104 characters long.

Conclusions

A final tally of character counts in the various one-liners. Additionally, mentioned in parentheses, are the sizes we would see if we weren't worried about readability i.e. one letter variable names, no spaces:

kNN classification - 126 characters (107)
kNN regression - 92 characters (77)
kNN classification hack - 104 characters (89)

Pretty impressive numbers - my respect for Python just went up a notch!

I am pretty sure these are not the only implementations of the one-liners possible. For starters, you might favour list comprehensions over map(), do away with the Counter to create a 'purer' one line etc. If you have something interesting, let me know in a comment!

Although, great as an intellectual challenge, I don't think the one-liner is suitable for serious use. Good classification libraries may have optimized implementations for kNN - for ex using kd trees or ball trees - which are preferable. For example, this library.

Sunday, June 15, 2014

Machine Learning - what it is, what is not

In 2007, when I had begun dipping my toes into Machine Learning (ML), the area was little known. When people asked me what I worked on, I usually had to begin by explaining what ML is. Occasionally, I would say Artificial Intelligence (AI) to save on the explanation. In fact, I wasn't sure then whether I wanted to do my research in ML or move to something more mainstream, like Databases or Algorithms (I didn't move). So I find the semi-celebrity status ML has gained over the last 6 years interesting, to say the least.

Unfortunately, I think the new-found fame has worked both ways; while it is nice to see ML being recognized, I also notice what seems to be the makings of a bubble: there is a sudden demand for ML skills, mismatched expectations of what ML can do for an organization (mismatched wrt the effort that they are willing to put in), a notion of how delightfully easy making good models are (for all the good work MOOCs are doing, this is a sad side-effect).

In this rather long post, I want to set right a few of those misleading first-impressions. This post is for you if you haven't got your hands dirty with using ML on real-world problems yet - for ex. if you are new to the area, or, maybe you are planning to start a data science team but have no/little prior experience in the area.

Do the math. Literally. Let's say you want to build the fastest car in the world. Would you be able to do so if you are scared of studying the machinery under the hood? Obviously not. Machine Learning involves a fair amount of mathematics. If you want to be good, train yourself to understand the math. And to extend to the general case, know the underlying algorithm where (much) math is not involved.

I have seen people think along these lines: Let’s pick one of the libraries/tools available and do some ML. I don’t really care what goes on inside the algorithms - I can do pretty well without the math.

In the right context, this way of working is fine. I have done this a few times myself. The right context here is that you don’t have the time to invest in learning, or time for implementing some custom algorithm yourself, but you do want some prediction capability.

Where this starts becoming a problem is when you start seeing this stopgap solution as a solution - you tend a mismatched expectation that this is going to lead you to the best possible solution for your problem. Or is always going to work. Acknowledge that, with this approach, you are only going to get something that works – where “works” could mean anything from “just about works” to “works perfectly”. And you may not have much control where your solution falls in this spectrum.

The only way to be really good is to know the math so that you can pick the right algorithms and squeeze the last bit of performance out of them.

Confidential: path to enlightenment

Humans can beat your algorithms. This should not really be surprising. Take the task of searching for an image similar to one given to you. Today, you can almost always outperform a computer. Try out Google Search by Image to see what I mean - the best results are the ones that are identical or very similar to your input. Or appear on same/similar pages as your input. Beyond that, the results go haywire.

Google, Logan Lerman and Jennifer Connelly are different people.

Look at the results of a search for Jennifer Connelly. The image I used in the left screenshot seems to be a popular one - it appears on multiple websites that have articles about her or her wallpapers. The visually similar images returned as part of the search results are quite accurate. In fact, 5 of the 9 results seem to be from the same interview/session. But if I pick an image that is not so popular, as in the screenshot on the right, Google throws in a few dudes in its results.

This is not just one cherry-picked pathological example; here is what Richard Szeliski, a noted researcher in the field of computer vision, has to say:

It may be many years before computers can name and outline all of the objects in a photograph with the same skill as a two year old child.

You use machine learning not because you can do better than humans (in some cases this may be true). You use ML when the scale that a computer can handle is more important than the accuracy a human can provide. You are good at image search, but can you sift through a billion images in a second?

Applying ML to a real-world problem implies you are willing to sacrifice some accuracy to be able to handle scale. Scale might be important to you because it makes business sense. For example, if you wanted to set up image search as a service, would you rather be handling a 1000 customers a day with 100% accuracy (a hired staff working behind the scenes) or 1 million customers with 70% accuracy? This is really a trick question - there is no clear answer here - depends on your objectives, business model, etc. - my point is that there is a case to be made for scale.

Sometimes your ability to handle scale can also indirectly improve your accuracy. An example:

You create a search engine that is not accurate as humans but can crawl the whole internet to give you good enough results to begin with.
This makes it a very popular tool – because at least your tool can look at the whole of the vast WWW.
As more and more people use it you keep noting which links people click among the ones you have provided as the search result.
You start ranking higher the ones clicked most. And lo! – now you have something that progressively starts approximating human search behavior.

Moral of the story - know when to invest in ML. More importantly, set your expectations right about the returns on the investment.

And this is not a glass of pepsi

Patterns detected by many algorithms cannot be conveniently interpreted. Take Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Random Forests (RF). These are well established algorithms that are known to do well on a variety of problems but , in general, the model they learn cannot be represented as easy-to-understand rules. Say, you are trying to predict estate prices based on various factors like the average income in the locality, proximity of the estate to important landmarks in a city, the number of hospitals and schools in the locality etc. If you use a SVM on the problem, in most cases, you won't be able to infer something like this from the model:
The price of the estate is directly proportional to the number of hospitals within a 3km radius of the estate.

This often leads to confusion on these lines:
How can we trust this black-box model when we really don’t know what it is doing?

Some thoughts:
- Decouple the following ideas: (1) "can be easily interpreted by humans", (2) "reliable". Algorithms like ANNs etc have a strong theoretical base - if you want to understand what the model spits out, the way to do it is not to look for rules, but to go back and look at this theory. Yes, the paradigm is different - instead of picking out rules from a model and looking at them saying "Yes, this one looks reasonable", the idea is to have a proof that says that whatever comes out the learning phase, under certain circumstances, has such-and-such error bounds and would perform in such-and-such manner.
  
  This is, in a way, reliable in a much more objective manner - the proof puts down what's to be expected in black and white.
  
  One of the ideas used by the winning team of the famous Netflix prize was Restricted Boltzman Machines. Not exactly very interpretable and yet it raked in a hefty $1M prize. Reconsider - do you really need interpretability? :)
- Interpretability offers you a comfort zone. Convince yourself to step out of it because we have a simple test to measure accuracy - and that's what really matters. Keep a test set of data that is not used by the algorithm to train. Find out how accurate the algorithm is on this data. This is as reliable and objective as it gets.
  
  I guess what I am driving at is whenever you find yourself not comfortable with a model that cannot be interpreted, ask yourself whether you really need to interpret it. And the answer cannot be "for my own comfort". Isn't high accuracy sufficient? Isn't a good model better than an interpretable but not-so-good a model? In some cases, you might really need models which need to be explained - perhaps, to your clients, as part of your findings - this is probably the only time you should start seriously thinking about interpretability.
Worry about interpretability when you really need it. Don't expect a good model to be also conveniently interpretable.
Use the right metrics.

Source

Do you remember the compass that Captain Jack Sparrow uses in Pirates of the Caribbean? The needle points to whatever the person using the compass wants most. The compass is a fantastic tool, but useless, if you don't know what you want.

ML is like the compass, but it helps you a lot better if you can express exactly what you want to measure.

Sorry for butting in a cheesy analogy. Hopefully, it would help me make my point.

Consider this problem: you are running a car dealership. You make a convincing sales pitch to everyone who walks into your store. Additionally, for certain customers, who are among the ones who have agreed to fill in some kind of a form (where you ask them about their income, current mode of transport, name of place of work, age,etc), you throw in something extra: you let them borrow a trial car for a day, with fuel expenses paid. You believe that this set of customers will make a purchase with some cajoling.

How do you pick this set of people? Let's say you train a ML model based on the information entered in the form, to make the binary prediction: likely to purchase (we call these our positive labels) or not likely to purchase ( negative labels). You make the trial-car offer to people your model assigns positive labels to. Clearly, a lot rides on the performance of this algorithm.

Let's assume that, typically, 5% of the people visiting your store buy a car. Lets look at some metrics you may consider to assess the value of your model:
1. How about good old accuracy? This measures the percentage of predictions you get correct overall. If I have a model that only predicts "not likely to purchase" I would be wrong only 5% of the time. So, my model is 95% accurate! That should normally be great, but it makes no sense for my business since my model doesn't identify anyone as a potential purchaser anymore.
2. We are interested in people who are likely to buy - so lets use a model that has a high accuracy in predicting a positive label for people with a positive label in our data. Now, you could have a model that always says "likely to purchase", which will give you a 100% accuracy within the group of positively labelled people. This is no good either - think about all those free trials you would be giving away to the 95% who aren't going to buy!
3. Obviously, we need a way to balance out the problems we just saw. We want whoever the model calls out to be a potential purchaser to be really a potential purchaser, so we don't miss the people we can sell to. But, this should not happen at the cost of identifying non-purchasers as purchasers - because then we would be just giving away way too many free trials.
A good metric to be used here is the F1-score, which is defined in terms of precision, recall. In our context: $$Precision = \frac{people \; identified \; as \; positives/purchasers \; who \; are \; actually \; purchasers}{people \; identified \; as \; purchasers}$$
$$Recall = \frac{people \; identified \; as \; negatives/purchasers \; who \; are \; actually \; purchasers}{actual \; purchasers}$$ $$F1= \frac{2 \cdot Precision \cdot Recall}{Precision+Recall}\;$$
Note how points a and b (from the previous list) are taken care of by precision and recall. In particular, look at how the F1 score penalizes both low precision and low recall. For clarity, plots of F1 (z-axis) as a function of precision and recall (x and y axes) accuracy are shown below. The four plots correspond to the same relationship, but show the plot from different perspectives. Swatches of the same color represent approximately the same F1 score. Observe that the plotted surface seems "taped" to the x and y axes, signifying that if any of precision or recall is zero, the F1-score is zero.

F1 score as a function of precision and recall

Picking the right metric is important to make sure that your ML model is solving a problem that translates into value for you. There are a host of standard metrics you can pick from - ex F-score, lift, ROC Area, average precision, precision/recall, break-even point, squared error, cross-entropy. Of course, you might want to consider using a custom metric if that suits you well.

Always have a baseline. Some problems are harder than others. Whether 90% accuracy is good, depends on the problem.

My favorite example is Sentence Boundary Disambiguation (SBD) - the task is to identify elements/tokens in a given text that indicate sentence boundaries i.e. points where a sentence ends. Here, a token of text is anything not a whitespace. For example, the text (a quote by Leonardo Da Vinci):

In rivers, the water that you touch is the last of what has passed and the first of that which comes; so with present time.

has 28 tokens; 25 words and 3 punctuation symbols. We need to identify which tokens among these 28 represent sentence boundaries. There is only one in this sentence - the "." at the end. This is a binary classification problem - for each token, we only need to specify whether it is a sentence boundary or not. The catch here is that a boundary indicators have other uses too - for ex "." is also used for abbreviations - and also, there is more than one sentence boundary indicator - "?", "!", etc.

Is a model with a F1-score of 90% a good model for this task? Lets start by looking at some very simple "models" and how they perform. In the table below, I have listed 4 such very-non-ML-ish models for the SBD task, and their performance on text from the Brown corpus. Performance is characterized by precision, recall and F1 scores (a positive label for a token is when we identify it as indicating a sentence boundary).

Description of Model	Precision	Recall	F1-score
Sentence boundary if element is "."	100.0	86.06	92.51
If element is one of "?", ".", "!"	94.35	91.54	92.92
If element is "." and it hasn't occurred as one of last 3 elements	100.0	84.97	91.87
If element is one of "?", ".", "!" and none of the last 3 elements are "?", "." or "!"	94.26	84.58	89.16

As you can well see, it is hard not to score around 90% on this task! And I wrote the code that generated the above data while writing this post - all 4 models in about 50 lines of Python.

A baseline tells you how effective an absolute commonsensical model is. This helps you better guide your efforts: do you want to spend time implementing a model that literature says would give you 95% accuracy? When there is something reasonably good you can do in a short time? Is your accuracy of 60% necessarily a bad score? - not if you have a reasonable baseline that gives you 30%. How far have you come from your baseline accuracy with your current model? - if its not much you may want to revisit your hypotheses about data so that you can pick a better model. For all real problems, having a reasonable baseline is a must.

Try it out. Models make various implicit or explicit assumptions about the underlying data - for ex about what the separation boundary between the classes looks like, the size of the data set, the distribution of noise, whether sequence matters, etc. Since you wouldn't have all the information about your data that can indeed help you validate these assumptions (if you knew so much you probably wouldn't be modeling the data), you might end up using a model that doesn't suit your data. And the only way to know (at least, in most cases) that there is a mismatch is by actually trying it out on the data. Building a good model can be an extremely empirical process. So, be careful to be not become complacent only because a model looks good on paper. It might not work the same way for your data.
Bias-Variance tradeoff, dimensionality and cross-validation. These are topics that deserve longer discussions and I am not going to talk about them much here. I fear that a short discussion would either confuse or impart a false sense of security. If you work with real data, I cannot stress enough the importance, and relevance, of learning about the bias-variance tradeoff, the effects of dimensionality and the practice of cross-validation. What I am going to do, however, is give you a teaser of the pitfalls that you are vulnerable to if you are unaware of these ideas.

Say, your friend comes to you with data to be classified. You have, at your disposal, implementations of the Naive Bayes algorithm and C4.5 - a type of decision tree.You know nothing about how the data was generated but your friend tells you that he is almost sure that the data has a complex separation boundary between the positive and negative classes. So, which algorithm do you use? Clearly C4.5, right? Since, it is well known that it can come up with complex boundaries while Naive Bayes can only get you linear boundaries.

Here's the surprising bit: You can do worse than the Naive Bayes classifier. Even though the data is not linearly separable.

What did you miss? You forgot to take into account the fact that Naive Bayes has a high bias and decision trees suffer from high variance. And as a result, size of the dataset affects their relative performance.

Something with a high bias is very rigid in its idea of the separation boundary between the positive and negative classes. Naive Bayes comes up with a separation boundary that is a line - throwing more data will only help it find a better line, but it cannot find anything non-linear i.e. say if the optimal separation boundary looks like a circle, well, you are out of luck.

A high variance classifier can find complex boundaries but is sensitive to the precise distribution of points. This is a problem because data may be often contaminated with noise i.e. instances in your data set that are not consequence of the phenomena you want to study/model. In the sales car pitch problem discussed previously, the forms in which people filled in an incorrect age would result in noise. A high variance classifier would try to find a separation boundary which correctly labels the noisy points too. This is not a smart thing to do since the noise that you saw on the training data (data used to build our model) may not show up in the test data (data on which we are making our predictions based on the model built on training data) - its noise after all - leading to inaccurate predictions.

The figure below shows these problems. In the left panel, which shows the training data, we can see how a high variance model overfits and finds a boundary that includes the noisy green points - which exist far inside the "red region". The high bias classifier is unaffected, but then, it is also a bad model, since the ideal boundary seems quadratic. In the right panel, we see that in the test data the noisy points are absent, and instead, there are red points present in the region as we might expect. The high variance model predicts everything in this region as green however - since that is what it has learnt from the training data. The high bias model, labels a lot of the green points as red, because it could never fit the data correctly in the first place. That it was not affected by noise is an artifact of its general inflexibility.

Also note that even the ideal boundary misclassifies one red point (bottom left-most) - the lesson is no one is expected to get everything right, we are merely in the business of minimizing errors.

Training Data

Test Data

The graph below shows how the accuracies of our models change as size of the dataset increases. With low training data, the high variance decision tree classifier fits noise - so much as that the Naive Bayes model, which is downright wrong for this data (remember the data in the example is not linearly separable) outperforms it. A high amount of data effectively tames the decision tree to focus on the dominant patterns in the data only (i.e. not noise) and its performance becomes better. We also discussed that throwing more data at a high-bias classifier doesn't help things much - accordingly, we see that the Naive Bayes line plateaus after a while, whereas, more data definitely helps the decision tree become progressively better.

Accuracy of Naive Bayes vs Decision tree, as a function of dataset size. Source.

This scenario should tell you that some of what you think of as well-informed decisions can turn out to be catastrophically wrong (duh! life!) - what is a better classifier can depend on how much data you have (among other things). Now, go read up on these topics as much as you can :)

Literature survey is important. If you are new to ML you may have those moments when it seems you've hit upon an original idea to solve a problem. I will be harsh here and tell you that its most likely not as original as you think it is. ML has been around for a while - research on neural networks started in the late 50s! - and in the last couple of decades in that time, the field has adopted a reeling pace in its development. What that means for you is that most common "use-cases", and many not so common ones, have been thought about, and solutions published or patented.

One way I like to put this is, to be original, you are now competing with people across space and time. In your class in school, for an idea to seem brilliant, you had to only come up with something that didn't strike your 30 odd or so peers in a given amount of time. Here, however, your peer group is composed of every living human being who is working on ML (and would publish/patent) - they are distributed across the globe and some of them probably have the luxury of dedicating more time to solving ML problems than you do. And this is only the space part. Your competitors are also the people who contributed research to the discipline in the last 50-60 yrs. This is the time part. Needless to say, with such a massive demographic that spans space and time, for something to be original is tough. Especially so if you are unaware of what existing literature consists of.

The moral here is that when you come across a problem, do your literature survey first. Chances are are your problem has already been studied. Even if that's not the case, a good literature survey can provide a logical/mathematical structure to tease your problem into, enabling quick and systematic progress. Don't reinvent the wheel. And remember that Google Scholar and CiteSeerX are your friends.

I feel our approach to approaching ML problems (or, really, any problem) should be in the same spirit of humility with which E.W.Dijkstra, computer scientist extraordinaire and a Turing award winner, suggested we approach problems in computer programming:

Edsger W. Dijkstra

We shall do a much better programming job, provided that we approach the task with a full appreciation of its tremendous difficulty, provided that we stick to modest and elegant programming languages, provided that we respect the intrinsic limitations of the human mind and approach the task as Very Humble Programmers. ( Source: The Humble Programmer )

Working with data is a lot of dirty work. This point is especially directed to people who want to move into Data Mining or Machine Learning job profiles. In most (not all) of these case there is the glamorous notion involved that they will be working on cool learning algorithms all day. Truth is, working with data can be hard and inelegant. And sometimes, plain boring. All those algorithms ML teaches you can be applied only if you have captured the required data in the first place, cleaned it up (data capturing is hardly every perfect and bereft of noise) and transformed it into a format that your ML tools can use. This is not pretty work and takes up most of your time. This is often the 'straw that breaks the camels back', so to speak - people who were enamored with working on ML, become disgruntled with the area, when they are faced up-close with these aspects of their job.

My advice to people who want to move into this line of work is set your expectations right. Find someone whose day job is using ML, preferably in the workplace you are interested in, and ask him what his daily routine consists of.

The following awesome CnH comic strip aptly sums up my sentiments - only replace "science" with "ML":
Solutions evolve. It is interesting that although we take a matured approach of iterative development for software, in the case of modeling data, a similar mindset is not as prevalent. We tend to commit mistakes of both omission and commission. Either we give the problem too little attention and expect the first few tries to completely nail it, or we tend to get into a purist-academic research mode where we don't want to commit to a solution until we deem it perfect. I feel that the s/w model of iterative development fits very nicely with how ML models should be built in the industry:
- You need to have a long-term commitment to your data modeling problem. Don't expect to dedicate 2 days to your problem and get a model that is optimal, robust, processor friendly, has low data complexity, etc.
- Don't get stuck trying to come up with the best model in the first go. Think versions. Keep revisiting your solution and making it better. This has the added advantage that your first (few) version(s) will help set up a proper workflow/pipeline/interface i.e. in a product or a service your model is presumably one component of and the first (few) model(s) can help you smooth out the various kinks around how it precisely plugs in into the larger system.
The man inside the machine is not going away soon. (Warning - this is a subjective viewpoint) Although the last few years has seen immense progress in the field of ML (I am occasionally still surprised that voice based search is a reality and runs off my mobile phone) we are not yet at a place where we can let algorithms learn from data in cruise-control mode. You need human involvement to pick the best algorithm from the plethora available out there if you want anything to be done in a reasonable amount of time. I have heard people lament that whole learning process is not "automatic" i.e. you can't throw whatever data you have into a black-box and out comes an optimal model. But then the problem we are solving is so complex!

Different phenomena produce data that have different underlying idiosyncrasies, and to model these effectively, many learning algorithms come halfway by making some assumptions of their own so they don't have to rediscover everything from scratch. How, then, is the right algorithm to be matched to the problem, accounting for your various constraints like time, accuracy, computational resources, etc? The missing element is you - you need to pick the right algorithm for your problem, much like picking the right key from a keychain for a lock.

In theory, yes, you could have an algorithm that loops through all possible separation boundaries to pick an optimal one - but there are an infinite (this is not an exaggeration) number of them out there, and you would need forever to solve a single problem. So that you don't have to suffer through this rigmarole, you rely on the experience and judgement of a (good) data scientist to pick a promising subset to try out, from the universe of models. Unless there is a major theoretical revolution in the field, say on a Newtonian or Einsteinian scale, or alternatively, parallel processing becomes far more parallel and affordable, enough so that we can brute force our way out, in the general case the need for human involvement stays.

Phew!

Subscribe to: Posts ( Atom )