Saturday, August 16, 2014

kNN classifier - in one line of Python. Fits into a Tweet.

Recently, I conducted a session on Python where I walked through implementing a kNN classifier. Close to the end of the session, we got to how succinct Python can be, and I proceeded to reduce our code to the absolute minimum number of lines possible. The impromptu code-golfing exercise led me to an interesting realization - you can write a kNN classifier in one line of Python. A line short enough (126 characters) to fit into a tweet!

Yep, I checked:

Fig 1. A tweet-sized classifier.

In retrospect, this does not seem surprising given certain features of the language. Its just one of those things you don't normally think about; you wouldn't want to implement an algorithm in one, potentially confusing, line. Frankly, you shouldn't. In the interest of the sanity of the poor souls who might need to read your code someday.

But notwithstanding coding etiquette, we are going to take a look at this since a one-liner k-NN is, well, pretty awesome!

In the next couple of sections I provide some context to the problem and the solution. I have made these sections collapsible (doesn't seem to work with the mobile version of the site), since they are not the focus of the post. If you already know what a kNN classifier is, and understand Python maps/lambdas, skip ahead to the section "Initial setup".

 

What is a kNN classifier?

(show)

 

A quick primer on maps and lambdas

(show)

 

Initial setup

I will assume that these variables are defined for us:
  1. train - is a list with the training data. Every element in this list is a tuple with the first entry as the feature vector and the second entry as the label. So, if I had the following 4-dimensional data-points in my training data:
    1. [2, 4, 3, 5], label = 0
    2. [5, 7, 4, 7], label = 1
    3. [6, 8, 2, 3], label = 1
    4. [5, 7, 2, 4], label = 0
    then train = [([2, 4, 3, 5], 0), ([5, 7, 4, 7], 1), ([6, 8, 2,3], 1), ([5, 7, 2, 4], 0)].
  2. sim(x, y) - the similarity function. Takes two data-points as arguments and returns a numeric value for similarity. For our example, we use the dot-product as a measure of similarity. The dot-product of two vectors [a, b, c] and [p, q, r] is a*p + b*q + c*r.
  3. k - the number of nearest neighbours we need the labels from. For our example, we will assume k = 3.
  4. test - the test data point, which we are interested in classifying. This is a list of the feature values. We use [1, 2, 3, 4] as an example test point.
  5. I am also assuming that the Python collections module is imported into the current namespace i.e. this statement "from collections import *" has been executed (I know this sounds like cheating, but hold on judging me till you have read the next section ...)
Note that 1-4 are external to the classifier and have to be provided anyway - I am just fixing the variable names and structures.

The One (line ...)

And finally, [drum roll] the magic incantation ...
max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]

If I were to decompose the above line into logical steps, this is how I would go about it:
  1. max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]
  2. max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]
  3. max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]
  4. max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0] 
  5. max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0] 


Explanation


Step 1: Get all similarities, keep the labels around 

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]

In the first step, we call a map with the training data. The lambda expression returns a tuple with the first entry sim(x[0], test) and the second entry as x[1]. Consider the fact that train is a list of tuples itself, and thus the argument to the lambda is a tuple. For ex the first time the lambda function is called is with the argument ([2, 4, 3, 5], 0). Thus, for the lambda, x[0] is [2, 4, 3, 5] and x[1] is 0. Hence, the first entry of the returned tuple is the similarity of the training data point to the test data point, while the second entry is the label of the training data point.

Given the sample train and sim, this first step produces the following list:
[(39, 0), (59, 1), (40, 1),  (41, 0)]
As a side-note, it is quite interesting that the dot-product can itself be written as a lambda using the zip() function:
lambda list_1, list_2: sum(map(lambda x: x[0]*x[1],  zip(list_1, list_2)))



Step 2: Find the top k points

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]

We now sort whatever we got out of the last step in decreasing order of similarity. We use the sorted() function for this purpose. Since it sorts tuples, and we have not mentioned a key, it sorts on the first index of the tuple - the similarity to the test point. By default, sort happens in increasing order, so the output so far looks like this:
[(39, 0), (40, 1),  (41, 0), (59, 1)]
You can sort in decreasing order too - by passing in the argument reverse=True, but that's more keystrokes :)

Now, we extract the tuples with the highest k similarities using [-k:]. The "-" in the list indexing notation in Python implies the counting is done from the tail of the list - here, we get the portion of the list starting with the kth element from the tail of the list till the last element. This gives us the tuples with the k highest similarities. Thus, this is the output of this step:
[(40, 1),  (41, 0), (59,1)]



Step 3: Get the labels for these top k points

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]

Since we have the tuples corresponding to the top-k similar points, we really don't need to keep the similarities around any more. Another map() call does the trick - look at the lambda expression here: it only returns the value at index 1 of the tuple. Therefore, as output from this step, we have just the labels:
[1, 0, 1]



Step 4: Get label frequencies  

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0] 


We now need to find the label with the highest frequency of occurrence in the k closest points. We use the Counter data-structure from the collections module. The Counter is a special kind of dict which, when initialized with a list, uses the unique elements in the list as keys and the frequency of their occurrence as the corresponding value.

Thus, Counter([1,1,2,3,2,2,1]) gives me the dict {1: 3, 2: 3, 3: 1}. This is exactly what we need. Initializing a Counter with the output of the last step - [1, 0, 1] - gives us this dict:
{1: 2, 0: 1}
Calling items() on a dict returns a list of tuples, each of whose 1st entry is a key, and the 2nd entry is the corresponding value. Hence, the output of this step is:
[(0, 1), (1, 2)]
Time to address the elephant in the room, I guess :) You could argue that using something from a module does not make for a strict one-liner. And in a sense that is true. I could  have avoided using Counter, and instead, used tricks like these. Technically, we would have still ended up with an one-liner, but Counter expresses our objective in a clear manner. Thus, not using Counter does not mean that you cannot have an one-liner, only that the result won't be pretty.

Note that since I am calling Counter directly, my import should be "from collections import *". If you want a cleaner namespace, you might want to do "import collections" and refer to Counter as "collections.Counter"

Really close now ...



Step 5: Get the most common label 

max(Counter(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:])).items(),  key = lambda x:x[1])[0]

All that is left now is to extract the label with the highest frequency. That's easy with the max() function in Python. Note that our tuples  (output from the last step) have the label as the first entry and the frequency of the label as the second entry. Hence, in the max() call, we provide the key argument which finds the maximum valued element in the list based on the second entries in the tuples. Yes, with a lambda expression. They really do get around. Thus, max() returns me this:
(1,2)
Of which, I extract the label by specifying the index - [0]. This gives me my final answer:
And this ... is the label we are looking for [melodramatic bow] :)


Regression and a related hack

While I was writing this post, I realized that the one-liner for kNN based regression is actually shorter - 92 characters. Also, purer - does not rely on modules that need importing. In a regression problem we are interested in predicting continuous values - as against discrete labels in classification problems. For ex you may want to determine the price of a piece of real estate based on features like proximity to a hospital, distance from an airport, proximity to industries etc. The training data has prices per data point instead of labels.

In kNN based regression, you would look at the values of the k neighbours and return their average.

Here goes:
sum(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:]))/(1.0*k)
If you understand the classifier, understanding this should be easy. The portion in black is the same as the corresponding part in the classifier. The part in red is specific to  regression. Note that I need multiply "1.0" to the denominator to make sure Python does not perform an integer round-off in the division.

Now, here is an interesting relationship between kNN regression and classification - if you have only two labels, 0 and 1, then if you average the labels and round-off, it is equivalent to majority voting. For example, consider some sample outputs from Step 3 above:

Top k labels Avg of labelsAvg roundedMajority vote
1, 0, 1 0.6711
0,0,1,0 0.2500
1,1,0,0,1 0.611


This does not work with more than two labels.

Thus, as long as the  labels in the classification problems are the numbers and 1 , we can use this following expression:
int(round(sum(map(lambda x: x[1], sorted(map(lambda x: (sim(x[0], test), x[1]), train))[-k:]))/(1.0*k)))

The portion in black is the expression for regression we saw earlier. The part in red is needed for classification. round() is a built-in in Python which does not need us to import anything (yay!); however, since it returns a float, we explicitly need to cast to int (bummer...).

This version of the one-line classifier is 104 characters long.

Conclusions

A final tally of character counts in the various one-liners. Additionally, mentioned in parentheses, are the sizes we would see if we weren't worried about readability i.e. one letter variable names, no spaces:
  • kNN classification - 126 characters (107)
  • kNN regression - 92 characters (77)
  • kNN classification hack - 104 characters (89)  

Pretty impressive numbers - my respect for Python just went up a notch!

I am pretty sure these are not the only implementations of the one-liners possible. For starters, you might favour list comprehensions over map(), do away with the Counter to create a 'purer' one line etc. If you have something interesting, let me know in a comment!

Although, great as an intellectual challenge, I don't think the one-liner is suitable for serious use. Good classification libraries may have optimized implementations for kNN - for ex using kd trees or ball trees - which are preferable. For example, this library.