Random thoughts.

Sunday, June 15, 2014

Machine Learning - what it is, what is not

In 2007, when I had begun dipping my toes into Machine Learning (ML), the area was little known. When people asked me what I worked on, I usually had to begin by explaining what ML is. Occasionally, I would say Artificial Intelligence (AI) to save on the explanation. In fact, I wasn't sure then whether I wanted to do my research in ML or move to something more mainstream, like Databases or Algorithms (I didn't move). So I find the semi-celebrity status ML has gained over the last 6 years interesting, to say the least.

Unfortunately, I think the new-found fame has worked both ways; while it is nice to see ML being recognized, I also notice what seems to be the makings of a bubble: there is a sudden demand for ML skills, mismatched expectations of what ML can do for an organization (mismatched wrt the effort that they are willing to put in), a notion of how delightfully easy making good models are (for all the good work MOOCs are doing, this is a sad side-effect).

In this rather long post, I want to set right a few of those misleading first-impressions. This post is for you if you haven't got your hands dirty with using ML on real-world problems yet - for ex. if you are new to the area, or, maybe you are planning to start a data science team but have no/little prior experience in the area.

Do the math. Literally. Let's say you want to build the fastest car in the world. Would you be able to do so if you are scared of studying the machinery under the hood? Obviously not. Machine Learning involves a fair amount of mathematics. If you want to be good, train yourself to understand the math. And to extend to the general case, know the underlying algorithm where (much) math is not involved.

I have seen people think along these lines: Let’s pick one of the libraries/tools available and do some ML. I don’t really care what goes on inside the algorithms - I can do pretty well without the math.

In the right context, this way of working is fine. I have done this a few times myself. The right context here is that you don’t have the time to invest in learning, or time for implementing some custom algorithm yourself, but you do want some prediction capability.

Where this starts becoming a problem is when you start seeing this stopgap solution as a solution - you tend a mismatched expectation that this is going to lead you to the best possible solution for your problem. Or is always going to work. Acknowledge that, with this approach, you are only going to get something that works – where “works” could mean anything from “just about works” to “works perfectly”. And you may not have much control where your solution falls in this spectrum.

The only way to be really good is to know the math so that you can pick the right algorithms and squeeze the last bit of performance out of them.

Confidential: path to enlightenment

Humans can beat your algorithms. This should not really be surprising. Take the task of searching for an image similar to one given to you. Today, you can almost always outperform a computer. Try out Google Search by Image to see what I mean - the best results are the ones that are identical or very similar to your input. Or appear on same/similar pages as your input. Beyond that, the results go haywire.

Google, Logan Lerman and Jennifer Connelly are different people.

Look at the results of a search for Jennifer Connelly. The image I used in the left screenshot seems to be a popular one - it appears on multiple websites that have articles about her or her wallpapers. The visually similar images returned as part of the search results are quite accurate. In fact, 5 of the 9 results seem to be from the same interview/session. But if I pick an image that is not so popular, as in the screenshot on the right, Google throws in a few dudes in its results.

This is not just one cherry-picked pathological example; here is what Richard Szeliski, a noted researcher in the field of computer vision, has to say:

It may be many years before computers can name and outline all of the objects in a photograph with the same skill as a two year old child.

You use machine learning not because you can do better than humans (in some cases this may be true). You use ML when the scale that a computer can handle is more important than the accuracy a human can provide. You are good at image search, but can you sift through a billion images in a second?

Applying ML to a real-world problem implies you are willing to sacrifice some accuracy to be able to handle scale. Scale might be important to you because it makes business sense. For example, if you wanted to set up image search as a service, would you rather be handling a 1000 customers a day with 100% accuracy (a hired staff working behind the scenes) or 1 million customers with 70% accuracy? This is really a trick question - there is no clear answer here - depends on your objectives, business model, etc. - my point is that there is a case to be made for scale.

Sometimes your ability to handle scale can also indirectly improve your accuracy. An example:

You create a search engine that is not accurate as humans but can crawl the whole internet to give you good enough results to begin with.
This makes it a very popular tool – because at least your tool can look at the whole of the vast WWW.
As more and more people use it you keep noting which links people click among the ones you have provided as the search result.
You start ranking higher the ones clicked most. And lo! – now you have something that progressively starts approximating human search behavior.

Moral of the story - know when to invest in ML. More importantly, set your expectations right about the returns on the investment.

And this is not a glass of pepsi

Patterns detected by many algorithms cannot be conveniently interpreted. Take Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Random Forests (RF). These are well established algorithms that are known to do well on a variety of problems but , in general, the model they learn cannot be represented as easy-to-understand rules. Say, you are trying to predict estate prices based on various factors like the average income in the locality, proximity of the estate to important landmarks in a city, the number of hospitals and schools in the locality etc. If you use a SVM on the problem, in most cases, you won't be able to infer something like this from the model:
The price of the estate is directly proportional to the number of hospitals within a 3km radius of the estate.

This often leads to confusion on these lines:
How can we trust this black-box model when we really don’t know what it is doing?

Some thoughts:
- Decouple the following ideas: (1) "can be easily interpreted by humans", (2) "reliable". Algorithms like ANNs etc have a strong theoretical base - if you want to understand what the model spits out, the way to do it is not to look for rules, but to go back and look at this theory. Yes, the paradigm is different - instead of picking out rules from a model and looking at them saying "Yes, this one looks reasonable", the idea is to have a proof that says that whatever comes out the learning phase, under certain circumstances, has such-and-such error bounds and would perform in such-and-such manner.
  
  This is, in a way, reliable in a much more objective manner - the proof puts down what's to be expected in black and white.
  
  One of the ideas used by the winning team of the famous Netflix prize was Restricted Boltzman Machines. Not exactly very interpretable and yet it raked in a hefty $1M prize. Reconsider - do you really need interpretability? :)
- Interpretability offers you a comfort zone. Convince yourself to step out of it because we have a simple test to measure accuracy - and that's what really matters. Keep a test set of data that is not used by the algorithm to train. Find out how accurate the algorithm is on this data. This is as reliable and objective as it gets.
  
  I guess what I am driving at is whenever you find yourself not comfortable with a model that cannot be interpreted, ask yourself whether you really need to interpret it. And the answer cannot be "for my own comfort". Isn't high accuracy sufficient? Isn't a good model better than an interpretable but not-so-good a model? In some cases, you might really need models which need to be explained - perhaps, to your clients, as part of your findings - this is probably the only time you should start seriously thinking about interpretability.
Worry about interpretability when you really need it. Don't expect a good model to be also conveniently interpretable.
Use the right metrics.

Source

Do you remember the compass that Captain Jack Sparrow uses in Pirates of the Caribbean? The needle points to whatever the person using the compass wants most. The compass is a fantastic tool, but useless, if you don't know what you want.

ML is like the compass, but it helps you a lot better if you can express exactly what you want to measure.

Sorry for butting in a cheesy analogy. Hopefully, it would help me make my point.

Consider this problem: you are running a car dealership. You make a convincing sales pitch to everyone who walks into your store. Additionally, for certain customers, who are among the ones who have agreed to fill in some kind of a form (where you ask them about their income, current mode of transport, name of place of work, age,etc), you throw in something extra: you let them borrow a trial car for a day, with fuel expenses paid. You believe that this set of customers will make a purchase with some cajoling.

How do you pick this set of people? Let's say you train a ML model based on the information entered in the form, to make the binary prediction: likely to purchase (we call these our positive labels) or not likely to purchase ( negative labels). You make the trial-car offer to people your model assigns positive labels to. Clearly, a lot rides on the performance of this algorithm.

Let's assume that, typically, 5% of the people visiting your store buy a car. Lets look at some metrics you may consider to assess the value of your model:
1. How about good old accuracy? This measures the percentage of predictions you get correct overall. If I have a model that only predicts "not likely to purchase" I would be wrong only 5% of the time. So, my model is 95% accurate! That should normally be great, but it makes no sense for my business since my model doesn't identify anyone as a potential purchaser anymore.
2. We are interested in people who are likely to buy - so lets use a model that has a high accuracy in predicting a positive label for people with a positive label in our data. Now, you could have a model that always says "likely to purchase", which will give you a 100% accuracy within the group of positively labelled people. This is no good either - think about all those free trials you would be giving away to the 95% who aren't going to buy!
3. Obviously, we need a way to balance out the problems we just saw. We want whoever the model calls out to be a potential purchaser to be really a potential purchaser, so we don't miss the people we can sell to. But, this should not happen at the cost of identifying non-purchasers as purchasers - because then we would be just giving away way too many free trials.
A good metric to be used here is the F1-score, which is defined in terms of precision, recall. In our context: $$Precision = \frac{people \; identified \; as \; positives/purchasers \; who \; are \; actually \; purchasers}{people \; identified \; as \; purchasers}$$
$$Recall = \frac{people \; identified \; as \; negatives/purchasers \; who \; are \; actually \; purchasers}{actual \; purchasers}$$ $$F1= \frac{2 \cdot Precision \cdot Recall}{Precision+Recall}\;$$
Note how points a and b (from the previous list) are taken care of by precision and recall. In particular, look at how the F1 score penalizes both low precision and low recall. For clarity, plots of F1 (z-axis) as a function of precision and recall (x and y axes) accuracy are shown below. The four plots correspond to the same relationship, but show the plot from different perspectives. Swatches of the same color represent approximately the same F1 score. Observe that the plotted surface seems "taped" to the x and y axes, signifying that if any of precision or recall is zero, the F1-score is zero.

F1 score as a function of precision and recall

Picking the right metric is important to make sure that your ML model is solving a problem that translates into value for you. There are a host of standard metrics you can pick from - ex F-score, lift, ROC Area, average precision, precision/recall, break-even point, squared error, cross-entropy. Of course, you might want to consider using a custom metric if that suits you well.

Always have a baseline. Some problems are harder than others. Whether 90% accuracy is good, depends on the problem.

My favorite example is Sentence Boundary Disambiguation (SBD) - the task is to identify elements/tokens in a given text that indicate sentence boundaries i.e. points where a sentence ends. Here, a token of text is anything not a whitespace. For example, the text (a quote by Leonardo Da Vinci):

In rivers, the water that you touch is the last of what has passed and the first of that which comes; so with present time.

has 28 tokens; 25 words and 3 punctuation symbols. We need to identify which tokens among these 28 represent sentence boundaries. There is only one in this sentence - the "." at the end. This is a binary classification problem - for each token, we only need to specify whether it is a sentence boundary or not. The catch here is that a boundary indicators have other uses too - for ex "." is also used for abbreviations - and also, there is more than one sentence boundary indicator - "?", "!", etc.

Is a model with a F1-score of 90% a good model for this task? Lets start by looking at some very simple "models" and how they perform. In the table below, I have listed 4 such very-non-ML-ish models for the SBD task, and their performance on text from the Brown corpus. Performance is characterized by precision, recall and F1 scores (a positive label for a token is when we identify it as indicating a sentence boundary).

Description of Model	Precision	Recall	F1-score
Sentence boundary if element is "."	100.0	86.06	92.51
If element is one of "?", ".", "!"	94.35	91.54	92.92
If element is "." and it hasn't occurred as one of last 3 elements	100.0	84.97	91.87
If element is one of "?", ".", "!" and none of the last 3 elements are "?", "." or "!"	94.26	84.58	89.16

As you can well see, it is hard not to score around 90% on this task! And I wrote the code that generated the above data while writing this post - all 4 models in about 50 lines of Python.

A baseline tells you how effective an absolute commonsensical model is. This helps you better guide your efforts: do you want to spend time implementing a model that literature says would give you 95% accuracy? When there is something reasonably good you can do in a short time? Is your accuracy of 60% necessarily a bad score? - not if you have a reasonable baseline that gives you 30%. How far have you come from your baseline accuracy with your current model? - if its not much you may want to revisit your hypotheses about data so that you can pick a better model. For all real problems, having a reasonable baseline is a must.

Try it out. Models make various implicit or explicit assumptions about the underlying data - for ex about what the separation boundary between the classes looks like, the size of the data set, the distribution of noise, whether sequence matters, etc. Since you wouldn't have all the information about your data that can indeed help you validate these assumptions (if you knew so much you probably wouldn't be modeling the data), you might end up using a model that doesn't suit your data. And the only way to know (at least, in most cases) that there is a mismatch is by actually trying it out on the data. Building a good model can be an extremely empirical process. So, be careful to be not become complacent only because a model looks good on paper. It might not work the same way for your data.
Bias-Variance tradeoff, dimensionality and cross-validation. These are topics that deserve longer discussions and I am not going to talk about them much here. I fear that a short discussion would either confuse or impart a false sense of security. If you work with real data, I cannot stress enough the importance, and relevance, of learning about the bias-variance tradeoff, the effects of dimensionality and the practice of cross-validation. What I am going to do, however, is give you a teaser of the pitfalls that you are vulnerable to if you are unaware of these ideas.

Say, your friend comes to you with data to be classified. You have, at your disposal, implementations of the Naive Bayes algorithm and C4.5 - a type of decision tree.You know nothing about how the data was generated but your friend tells you that he is almost sure that the data has a complex separation boundary between the positive and negative classes. So, which algorithm do you use? Clearly C4.5, right? Since, it is well known that it can come up with complex boundaries while Naive Bayes can only get you linear boundaries.

Here's the surprising bit: You can do worse than the Naive Bayes classifier. Even though the data is not linearly separable.

What did you miss? You forgot to take into account the fact that Naive Bayes has a high bias and decision trees suffer from high variance. And as a result, size of the dataset affects their relative performance.

Something with a high bias is very rigid in its idea of the separation boundary between the positive and negative classes. Naive Bayes comes up with a separation boundary that is a line - throwing more data will only help it find a better line, but it cannot find anything non-linear i.e. say if the optimal separation boundary looks like a circle, well, you are out of luck.

A high variance classifier can find complex boundaries but is sensitive to the precise distribution of points. This is a problem because data may be often contaminated with noise i.e. instances in your data set that are not consequence of the phenomena you want to study/model. In the sales car pitch problem discussed previously, the forms in which people filled in an incorrect age would result in noise. A high variance classifier would try to find a separation boundary which correctly labels the noisy points too. This is not a smart thing to do since the noise that you saw on the training data (data used to build our model) may not show up in the test data (data on which we are making our predictions based on the model built on training data) - its noise after all - leading to inaccurate predictions.

The figure below shows these problems. In the left panel, which shows the training data, we can see how a high variance model overfits and finds a boundary that includes the noisy green points - which exist far inside the "red region". The high bias classifier is unaffected, but then, it is also a bad model, since the ideal boundary seems quadratic. In the right panel, we see that in the test data the noisy points are absent, and instead, there are red points present in the region as we might expect. The high variance model predicts everything in this region as green however - since that is what it has learnt from the training data. The high bias model, labels a lot of the green points as red, because it could never fit the data correctly in the first place. That it was not affected by noise is an artifact of its general inflexibility.

Also note that even the ideal boundary misclassifies one red point (bottom left-most) - the lesson is no one is expected to get everything right, we are merely in the business of minimizing errors.

Training Data

Test Data

The graph below shows how the accuracies of our models change as size of the dataset increases. With low training data, the high variance decision tree classifier fits noise - so much as that the Naive Bayes model, which is downright wrong for this data (remember the data in the example is not linearly separable) outperforms it. A high amount of data effectively tames the decision tree to focus on the dominant patterns in the data only (i.e. not noise) and its performance becomes better. We also discussed that throwing more data at a high-bias classifier doesn't help things much - accordingly, we see that the Naive Bayes line plateaus after a while, whereas, more data definitely helps the decision tree become progressively better.

Accuracy of Naive Bayes vs Decision tree, as a function of dataset size. Source.

This scenario should tell you that some of what you think of as well-informed decisions can turn out to be catastrophically wrong (duh! life!) - what is a better classifier can depend on how much data you have (among other things). Now, go read up on these topics as much as you can :)

Literature survey is important. If you are new to ML you may have those moments when it seems you've hit upon an original idea to solve a problem. I will be harsh here and tell you that its most likely not as original as you think it is. ML has been around for a while - research on neural networks started in the late 50s! - and in the last couple of decades in that time, the field has adopted a reeling pace in its development. What that means for you is that most common "use-cases", and many not so common ones, have been thought about, and solutions published or patented.

One way I like to put this is, to be original, you are now competing with people across space and time. In your class in school, for an idea to seem brilliant, you had to only come up with something that didn't strike your 30 odd or so peers in a given amount of time. Here, however, your peer group is composed of every living human being who is working on ML (and would publish/patent) - they are distributed across the globe and some of them probably have the luxury of dedicating more time to solving ML problems than you do. And this is only the space part. Your competitors are also the people who contributed research to the discipline in the last 50-60 yrs. This is the time part. Needless to say, with such a massive demographic that spans space and time, for something to be original is tough. Especially so if you are unaware of what existing literature consists of.

The moral here is that when you come across a problem, do your literature survey first. Chances are are your problem has already been studied. Even if that's not the case, a good literature survey can provide a logical/mathematical structure to tease your problem into, enabling quick and systematic progress. Don't reinvent the wheel. And remember that Google Scholar and CiteSeerX are your friends.

I feel our approach to approaching ML problems (or, really, any problem) should be in the same spirit of humility with which E.W.Dijkstra, computer scientist extraordinaire and a Turing award winner, suggested we approach problems in computer programming:

Edsger W. Dijkstra

We shall do a much better programming job, provided that we approach the task with a full appreciation of its tremendous difficulty, provided that we stick to modest and elegant programming languages, provided that we respect the intrinsic limitations of the human mind and approach the task as Very Humble Programmers. ( Source: The Humble Programmer )

Working with data is a lot of dirty work. This point is especially directed to people who want to move into Data Mining or Machine Learning job profiles. In most (not all) of these case there is the glamorous notion involved that they will be working on cool learning algorithms all day. Truth is, working with data can be hard and inelegant. And sometimes, plain boring. All those algorithms ML teaches you can be applied only if you have captured the required data in the first place, cleaned it up (data capturing is hardly every perfect and bereft of noise) and transformed it into a format that your ML tools can use. This is not pretty work and takes up most of your time. This is often the 'straw that breaks the camels back', so to speak - people who were enamored with working on ML, become disgruntled with the area, when they are faced up-close with these aspects of their job.

My advice to people who want to move into this line of work is set your expectations right. Find someone whose day job is using ML, preferably in the workplace you are interested in, and ask him what his daily routine consists of.

The following awesome CnH comic strip aptly sums up my sentiments - only replace "science" with "ML":
Solutions evolve. It is interesting that although we take a matured approach of iterative development for software, in the case of modeling data, a similar mindset is not as prevalent. We tend to commit mistakes of both omission and commission. Either we give the problem too little attention and expect the first few tries to completely nail it, or we tend to get into a purist-academic research mode where we don't want to commit to a solution until we deem it perfect. I feel that the s/w model of iterative development fits very nicely with how ML models should be built in the industry:
- You need to have a long-term commitment to your data modeling problem. Don't expect to dedicate 2 days to your problem and get a model that is optimal, robust, processor friendly, has low data complexity, etc.
- Don't get stuck trying to come up with the best model in the first go. Think versions. Keep revisiting your solution and making it better. This has the added advantage that your first (few) version(s) will help set up a proper workflow/pipeline/interface i.e. in a product or a service your model is presumably one component of and the first (few) model(s) can help you smooth out the various kinks around how it precisely plugs in into the larger system.
The man inside the machine is not going away soon. (Warning - this is a subjective viewpoint) Although the last few years has seen immense progress in the field of ML (I am occasionally still surprised that voice based search is a reality and runs off my mobile phone) we are not yet at a place where we can let algorithms learn from data in cruise-control mode. You need human involvement to pick the best algorithm from the plethora available out there if you want anything to be done in a reasonable amount of time. I have heard people lament that whole learning process is not "automatic" i.e. you can't throw whatever data you have into a black-box and out comes an optimal model. But then the problem we are solving is so complex!

Different phenomena produce data that have different underlying idiosyncrasies, and to model these effectively, many learning algorithms come halfway by making some assumptions of their own so they don't have to rediscover everything from scratch. How, then, is the right algorithm to be matched to the problem, accounting for your various constraints like time, accuracy, computational resources, etc? The missing element is you - you need to pick the right algorithm for your problem, much like picking the right key from a keychain for a lock.

In theory, yes, you could have an algorithm that loops through all possible separation boundaries to pick an optimal one - but there are an infinite (this is not an exaggeration) number of them out there, and you would need forever to solve a single problem. So that you don't have to suffer through this rigmarole, you rely on the experience and judgement of a (good) data scientist to pick a promising subset to try out, from the universe of models. Unless there is a major theoretical revolution in the field, say on a Newtonian or Einsteinian scale, or alternatively, parallel processing becomes far more parallel and affordable, enough so that we can brute force our way out, in the general case the need for human involvement stays.

Phew!

Wednesday, December 25, 2013

A play

Going through some old documents I came across my first attempt at writing a play. Maybe a little embarrassing now, but I am afraid I will lose the document sooner or later. This is for the record. I remember that I couldn't settle on a name for the play, and the file that I found is called "ytbn" - short for "yet to be named". As far as I remember, it never was named.

Introduction:

The play is an incident in a society that dictatorially enforces conformance of deeds and ideas. This society is loosely fashioned after the society portrayed by George Orwell in his book, “1984”.

Characters:

The salesman (our protagonist)

The shopkeeper

Merry, assistant to the shopkeeper

Agents of the government

The curtains open to melancholy setting of a neighborhood. It is evening. Our protagonist, a salesman, walks on a street lined with shabby looking houses. He, apparently, is new to the neighborhood, since he walks slowly and allows his eyes to inspect every signboard that he notices. Shortly he comes near a lamp post and is momentarily shocked to see two men, dressed in black overcoats standing beneath it. Their surreptitious presence makes him pause, but with a voluntary effort he resumes his pace, to stop again after a few brisk strides. He has found what he was looking for. His eyes travel over the signboard of a shop. The shop is small, and almost looks as if it has been forcefully tucked into an incredibly narrow space between two houses by a superhuman cause. The sign reads, “The Curio Shop”. Without further deliberation the salesman walks in.

The shop is dimly lit. In the prevailing semi-darkness it is difficult to see any merchandise that the shop might deal in. Presently a person appears and addresses the salesman.

New character/Shopkeeper: Good evening! I am the shopkeeper. How may I help you? (Smiles pleasantly)

Salesman: I am a salesman. I am looking for a few things (clears his throat) that I might find here – so I have been told.

Shopkeeper: (Pleasantly smiling) Yes, of course! We sell quite a few things here that are hard to come by otherwise.

He gently sways his hand in a gesture of welcome towards his left. The light adjusts to reveal an array of mantelpieces.

Shopkeeper: (Continuing, with unmistakable excitement in his voice) Mantelpieces like these, Sir, I assure you, you will find nowhere. We also have some exquisite furniture, and certain other things I would like to show you, should you have the leisure to look at them!

Salesman: (Shuffles his feet and looks uncomfortable) Er..yes. (Pauses) But this is not what I came here for.

Shopkeeper: (Narrowing his eyes, but keeping his smile intact) I am not sure what you are talking about. Perhaps, you would want to quote a reference?

Salesman: Frank.

Shopkeeper: (Pauses, but quite commendably still holds his smile) Ah, yes. This way please.

The shopkeeper leads the salesman to a wall in an inner room. Placing his palms together on the wall, he gently slides them towards himself. A faint click… a hidden door!

The door opens to reveal another badly lit room, the contents of which are not immediately visible.

Shopkeeper: Before we enter ….(looks about himself)…Merry! Merry!

Another man appears, He is unremarkable in any other way but for the calmly morose expression he wears.

Shopkeeper: (To the salesman) This is Merry, my assistant.

Merry looks at the guest but doesn’t attempt a smile. It is almost as if Merry and the shopkeeper complement each other as far as their expressions go. The shopkeeper is a perfect salesman with a ready smile at his disposal. He seems almost incapable of any other expression, something that can be said equally well for Merry, who, however, is a picture of gloom. They are like two poles of emotions in a balanced conversation, embodied apart.

Shopkeeper: Merry would watch the shop while we are inside. I apologize for my earlier behavior. You are obviously aware of how things stand today.

The salesman, the shopkeeper and Merry stop mid-air in their actions, as another part of the stage lights up to depict their thoughts at the moment. Here we see a bonfire of books and paintings, tended by men wearing black overcoats. Nearby, a person, presumably the owner of the property that feeds the spiteful fire, is being forcefully dragged away by more men in black overcoats. He resists, but his face is painfully calm, proclaiming that this was inevitable, and he knew that such a day must arrive. This distressing sight stays for a moment and then vanishes, with the portion of the stage returning to darkness, as our three characters become animate again.

The shopkeeper and salesman walk into the newly revealed room while Merry stays outside. Inside the room, we now see shelves lined with transparent masks.

Shopkeeper: (Smiling) These masks, as you would know, have no features, only expressions. When you wear them, they take on the features of the face underneath, but the expression is theirs. A happy mask makes you look happy; a sad mask makes you look sad. Always.

The salesman reaches inside his pocket to draw out his purse to pay the shopkeeper.

Shopkeeper: (Smiling) Which one will it be then?

Salesman: The one that smiles, of course.

Taking down a mask from the shelf, the shopkeeper wraps it in a piece of brown paper.

Shopkeeper: Here it is. Please hide it well when you carry it outside. Our shop is already being watched.

The salesman and the shopkeeper leave the inner room and the sliding door is pushed back into place. Merry, with his unchanging morose look, who has been waiting at the other side, disappears as soon as they return.

The salesman carefully tucks the wrapped parcel into his jacket and makes to leave the shop.

Salesman: Thank you!

Shopkeeper: You are welcome! But be very discreet about this little affair.

Smiling a knowing smile, the salesman walks out. He makes brisk pace on his way back. As he nears the lamp post, he spots the two men in black overcoats still present. Distracted, he tries to take a quick step, and trips and falls. The brown package is thrown away from his person due to the impact. The wrapping was not well done, and a part of the mask is now clearly visible in the package, which is now lying on the ground.

The men become suddenly alert, and lose no time to run to the shop after they notice the now ill-guised contents of the package. The salesman is paralyzed with fear, and not knowing how to react to this new development, watches panic stricken in the direction of the shop.

The men disappear into the shop. Awhile later Merry and the shopkeeper are brutally dragged outside into the street. As they shout, scream and their limbs flail wildly in the air, the salesman notices that the shopkeeper can’t help but still smile pleasantly. Merry, too, like a vivid contradiction to the world around him, to the reaction of his body and his very name, wears a mask of calm moroseness.

CURTAIN

Sunday, April 7, 2013

Thoughts on Online Learning

When online learning became a reality for me with MIT OpenCourseware, I was excited. I could watch Gilbert Strang teach Linear Algebra for free and learn at my pace. So when Coursera and Udacity were launched I was as happy as the next guy, if not more. I work with Machine Learning at my job and it is extremely valuable to me that I can watch related videos online, for free, with the content taught by some of the most prominent researchers/practitioners in their respective areas.

But it doesn't seem to have worked out very well for me. And apparently not for a lot of other people either. Personally, I am back to reading books. With a year of Coursera/Udacity behind us, I am seeing articles, like these, discussing shortfalls. In this post, I want to discuss where these fell short for me.

To be clear this is an extremely subjective article discussing things from my point of view. One that is not called "What's wrong with online learning?".

What I find missing

My first two online courses were Linear Algebra by Gilbert Strang offered by MIT Open Courseware and Machine Learning taught by Andrew Ng offered by Stanford Engineering Everywhere.When I had initially started going through the material, I wasn't doing much beyond watching the lectures and occasionally scribbling notes. Although the courses filled me with an immediate sense of achievement, I realized that my ability to recall finer details later was essentially broken. In most cases, when coming across a concept elsewhere, I could relate to it at a high level - but to dwell on the workings was difficult, and I often had to go back to the videos for a quick revision.

At this point, other problems would crop up - videos are not text books that you remember the layouts of. That makes skimming and visually searching difficult. They don't have indexes that can facilitate quick look-ups. I would typically narrow down to a specific couple of videos (one - if lucky) and then rummage about with the playbar till I found what I was looking for. Not to mention I needed a computer and my video lectures around for that. Thus, not only was I unable to recall specifics, but digging them out was no walk in the park either.

To fix this, I re-visited the lectures and prepared extensive notes this time. That worked out amazingly well for me - I remembered most of the stuff, and had a ready reference to consult (my notes).

Fig 1. Notes from Machine Learning

But these came at the cost of a huge amount of learning time - if a particular video lecture was an hour long, listening to it and making notes seemed to take roughly twice that time, assuming I took detailed notes, and not merely wrote down what I heard/saw but phrased it in my own words.

Given that I was never much of a note-taker in class, and still did well on most of my courses, I had to wonder - why is it different now? Here are a few things I missed in my online learning experience:

Active participation in the learning - in classes, I could ask questions. Equally important - others could ask questions. All during the class. The opportunity to ask questions worked to my benefit in two significant ways:
- It kept me engaged. At some level, it felt as if the class was collectively and organically working towards discovering something
- It prevented technical debt. Usually that term is used in the context of software development, but I feel it is equally applicable here. One shoe doesn't fit all - so if the material prepared by the lecturer falls short in satisfying a student he can immediately ask a question to satisfy his information need. With video lectures, I had to make a note to Google up something later. You could pause the video and do that instantly. Or, you could let these pile up and attend to them later. Either way, that wasn't the best of experiences for me for subjects I really cared about.
Learning in a class for the want of a better term, is an immersive experience. This helps recall and motivation. You remember more than just the content from the class - you remember people, places, a comment that made you laugh, etc. Sometimes, these help recall stuff better. You also see real people sitting next to you trying to be good at the same thing as you. If you want to discuss a specific concept with them, it is easier verbally than posting on a forum. There is collective motivation to learn and communication with peers is easy.
If you want to go back to revising something you can refer to your notes, the lecturer's notes or the textbook the class is following. With video lectures this is tough.

In a nutshell, for online courses, I don't think the content registers too well because they don't do great on the first couple of fronts. And what doesn't register, should at least be easily searchable - unfortunately, that doesn't happen either.

I mentioned I have moved back to text books. Why am I learning better with them? With regards to the first two factors I think textbooks actually do worse than online learning. But they do way better when it comes to ease of revisiting stuff - and that ease aids reinforcing of material which sort of makes up, over a period of time, for the relatively lacking engagement.

It hit me when I started taking the course on Probabilistic Graphical Models by Daphne Koller. The lectures were detailed -I loved them- but the need for taking detailed notes increased too. Given my full-time job it was becoming difficult - I was not sure how long I could keep up.

Fig 2. Notes from Probabilistic Graphical Models

One weekend I ended up with a sprained neck trying to take notes for a bunch of lectures. I gave up, re-enrolled when the course was offered the next time, only to again drop out for similar reasons. The gamification (deadlines) made me feel stupid and was demotivating. I realized that this wasn't sustainable for me and went back to books.

To sum up the comparison, textbooks provide me with a medium I can easily revisit, and the time I am not spending creating this easily searcheable medium - which I would have to for video lectures i.e. take extensive notes - makes up for the other benefits video lectures have to offer.

Possible solutions

To be frank, I don't know what changes would make online courses better for me. It's hard to know without actually trying. Personally, I should probably focus on only one course at a time and accept the fact that it is going to take me much more than the stipulated time to finish it.

However, if I were to take guesses as to what might make courses better, this would be my list:

Make lectures more interactive - this would promote engagement. Interaction in courses exist - the operative word is "more". Better engagement would help better recall and drive down the need for extensive note taking. Note that this does not completely do away with the need for taking notes - but only eliminates the need to do it in painstakingly copious amounts.

Interaction is one way to promote engagement. There might be others too. Perhaps, something on the lines of what the Head First guys did for programming? They made reading programming books as easy as reading story books. Is there a way to make watching video lectures as effortless as watching a movie? Or, playing a game - like this experiment from UCSD.
Make them searchable - somehow. Find a way so I can revisit the course for specific topics with ease. Transcribe the speech of the lecturer and index the video timeline with words from the transcription if that's what it takes!
I like how Erik Demaine prepares his courses.

Fig 3. Geometric Folding Algorithms by Erik Demaine

Fig 3. is from his course on Geometric Folding Algorithms. There is a lecture video on the page, there is a frame with handwritten notes in it (shown in a black box) and there is another frame with slides in it (shown in a red box). The video, the notes and the slides are synchronized: when Erik moves ahead with his lecture, the slides and notes change pages too.

This helps the learner associate textual content - notes/slides - that can be searched (probably easier after printing) with the actual lectures. This is a clever way of maintaining the engagement videos give us, and making the content easy to revisit later without actually revisiting the video. All depends on how strong you can make the association between the text content and the video in the presentation.

InfoQ and videolectures.net do this to some extent too.

Wednesday, November 9, 2011

The Best Card Trick

... is the title of an article [PDF] from the Mathematical Intelligencer I recently came across that, well, discusses a namesake card trick. As it turned out, the somewhat presumptuous title notwithstanding, the trick happens to be pretty interesting. Not to mention it gives me enough material to to gab about and break out of my posting hiatus (before slipping into another :) ). So, here we go ...

A bit of history

The original article offers close to 1.5 pages of history if you are interested - consider this section a "tl;dr"-ed version.

The trick appeared in the book Math Miracles by Wallace Lee, where it is credited to William Fitch Cheney Jr., a.k.a. "Fitch".
The article is by Michael Kleber.

Here's where I would have ideally ended this section (a bit of history: check) but Fitch having been the interesting character that he had been, it would be unfair if I didn't mention the following facts about him before we move on:

If a name like "Fitch" inspires the image of an ordinary performer of card tricks, or worse still, a street magician, you couldn't be more wrong - Fitch was awarded the first PhD in mathematics from MIT in 1927.

As a teacher, Fitch seems to have been unusually entertaining:

"... he enjoyed introducing magic effects into the classroom, both to illustrate points and to assure his students' attentiveness

. He also trained himself to be ambidextrous (although naturally left-handed), and amazed his classes with his ability to write equations simultaneously with both hands, meeting in the center at the "equals" sign."

--- from a description of Fitch by his son, Bill Cheney

Trick, and a mathematical treat

To appreciate how the trick works under the hood, lets start by looking at it from the point of view of the audience. If you were at the receiving end of the trick this is what you would see:

(1) The magician holds out a normal deck of 52 cards to you, and asks you to pick 5 cards.

(2) You do so and hand them over to his assistant.

Say, this is what you pick:

(3) Of the 5 cards handed over to her, she hands 4 of them to the magician. Say, these:

(4) He looks at the 4 cards handed to him, and with < insert favorite dramatic gesture here > announces the 5th card you had selected.

Take a while to think about what happened here. Assuming a normal, well-shuffled deck and no underhand techniques, the objective of the magician seems to be to look at 4 cards, handed to him by his assistant, and infer the 5th. With absolutely no prior knowledge of the selection on part of the magician (in other tricks this is typically arranged for by using a deck with more cards of one kind than another, or stacking it up in some particular way etc), how exactly is the assistant to indicate the rank and suit of the 5th card using the 4 cards she hands over?

What messages can you pass with 4 cards ?

A possible way seems to be arranging 4 cards in a way that identifies the 5th card i.e. you could use the ordering of 4 cards to encode the suit and rank of the 5th card.

But there is a catch here - 4 cards can be arranged in 4! = 24 different ways, whereas the 5th card could be any one of the remaining 52 - 4 = 48 cards. Apparently we do not have enough arrangements of 4 cards to indicate the 5th card. One card more, and we would have been good to go - but, the technique doesn't seem workable as-is.

In essence, the trick is not a sleight-of-hand-ish trick - it boils down to compressing and conveying information. Fitch being a mathematician probably has something to do with this.

So,which 4 cards does the assistant hand over to the magician? And how do these indicate the 5th card?

The workings

We saw how we fall short when we try to use orderings for our purpose; this is where the trick gets interesting. Instead of abandoning the approach completely, it looks deeper at the kind of objects we are dealing with. Each card is marked with a suit and a rank - can these attributes help us in squeezing in more information? Apparently, yes:

Since there are just 4 suits, but you have picked 5 cards, there are at least 2 cards of the same suit in your selection (Hello, pige onhole principle!). In the above example these are 3 ♠ and 10 ♠ . As per a convention decided upon by the assistant and the magician, the assistant holds back one of the cards with a repeated suit, and in the arrangement of cards he hands to the magician, he positions the other card of the suit as the first card. Looking at the first card, the magician now only needs to guess the rank of the unseen card - the suit is same as that of the first one.

Look at the cards in step 3 above; 10 ♠ is the first card handed over. The magician knows that the 5th card is also a spade.

But with the first card used to reveal just the suit, we have only 3 cards left to guess the rank of the 5th card. Which can take on 13 values. But 3 cards can be arranged in only 3! = 6 ways.

Ouch. Again.

Since our strategy is already piggybacking on the repeating of suits, it's worth asking, among the cards with the repeated suit, is there a good way to pick the card the assistant would be sliding into the first position?

In the above example, this is equivalent to asking, between 3 ♠ and 10 ♠, which card should be put in the first position?

Is there an optimal choice here? Optimal in a way that helps us to work around he limitation we encountered above i.e. can choosing the 1st card wisely, lets us use just 3 cards to indicate the rank of the 5th card ?

Turns out, the answer is yes. Here's a short, math-savvy hint: modular arithmetic.

In simple words, here, this means that we do not directly indicate the rank of the 5th card, but we indicate what number you should add to the rank of the 1st card to obtain the rank of the 5th card. Except that we agree to a way of handling the addition when it goes beyond 13 - in such cases, we simply start counting at 1 again.

This can be easily understood by imagining the ranks to be laid out in a circle. Adding x to a number y here is equivalent to going x steps around the circle, clockwise, starting from y.

So, 10+5 is 2. 13 + 1 is 1. And, of course, 2 + 3 is 5.

How does this help in working around our earlier limitation? Consider the shorter distance between any 2 points on the circle - its

never

greater than 5. This means that of 2 points in this circle, I could always pick a point, so that you wouldn't need to add a number more than 6 to reach the other point. This is important since the remaining 3 cards - the ones at positions 2nd, 3rd and 4th - can be arranged in 6 different ways.

♠ to the magician in our example, and not 3

♠ (and arranges the other cards to indicate the number - 6 - to be added; we'll get to the arrangement in a while)

Thus, it turns out that if we pick right the card of the repeated suit to be given to the magician, the ordering of 3 cards is sufficient to indicate the rank of the unseen card. This completes our strategy.
Here, to implement point 2, one could follow any system to encode numbers by orderings, as long as the magician and his assistant have agreed upon it. A simple way could be to use the ordering of suits and ranks used in the game of bridge:

Ordering of suits, highest to lowest: ♠, ♥, ♦, ♣

Within a suit, ordering of ranks, highest to lowest: Ace (A), King (K), Queen (Q), Joker (J), 10, 9, 8, 7, 6, 5, 4, 3, 2.

For our cards in positions 2nd to 4th ( 8 ♥, K ♦, A ♣), this implies the following encoding for all possible arrangements (assuming a "lower" ordering corresponds to a lower number):

A ♣, K ♦, 8 ♥ → 1

A ♣, 8 ♥, K ♦ → 2

K ♦, A ♣, 8 ♥ → 3

K ♦, 8 ♥, A ♣ → 4

8 ♥, A ♣, K ♦ → 5

8 ♥, K ♦, A ♣ → 6

Thus, looking at the ordering of these 3 cards, our magician adds "6" to the rank of the first card, which is 10. This tells him that the 5th card has a rank 3 (remember he already knows that the suit is a ♠).

In summary

The following image tries to sum up the reasoning of the magician:

You can blame Google for the crummy resizing.

Stage advice ...

Some practical advice from the article if you are indeed planning on performing the trick:

If trying to determine what number the ordering of the 2nd, 3rd and 4th cards corresponds to in the manner discussed, seems a little clumsy, the author of the article suggests his own variation.

"... place the smallest card in the left/middle/right position to encode 12/34/56 respectively, placing medium before or after large to indicate the first or second number in each pair"

As an example, the ordering "A ♣, 8 ♥, K ♦" would imply a rank of 2, since the smallest card, A ♣, is in the first position (so this is either rank 1 or 2) and the medium card, K ♦, follows the large card, 8 ♥ (which makes it rank 2).

This one, straight from Fitch. If you are repeating the trick to an audience, say for more than a few times, there is the risk that someone would notice that one of the cards with a repeated suit is always in the first position. In such a case, place the card that indicates the suit in the (i mod 4)^th position for the i^th performance (we assume position "0" indicates the 4^th position).

Thus, for the 1^st performance this card is in position 1, for the 2^nd performance this is in position 2, for the 7^th performance, assuming your audience is still around and awake, this goes in at position 3 (since, 7 mod 4 = 3).

Generalization

If you go through the various steps above, the ideas involved seem to just about fit in: simple ordering seems deficient? - repeated suit to the rescue; 3 cards are insufficient to indicate a rank? - modular arithmetic alert.

A happy-go-lucky patchwork of clever ideas as it were...

... which, well, really is fair game. However, the key contribution of the article is that it convinces you this is not all that there is to it: it analyzes these seemingly ad-hoc steps rigorously and suggests extensions of the trick. Unfortunately, this is not something I will go into, in this post, since I want to keep it fairly non-technical. For those who are interested, going through the original article is strongly recommended. I intend to follow-up this post with another one focused only on the generalizations, at a later point in time.

Friday, November 6, 2009

The Perfect Murder

A murder mystery is solved when the detective notices a circumstance that stands out from the rest. If circumstances were cocktail sticks, this is the perfect murder; given their flawless conformance - no one cocktail stick stands out! - our detective is at a loss to discover the culprit.

Source Of Image: alastairlevy.net

Subscribe to: Comments ( Atom )