Saturday, December 19, 2015

Clustering: ELKI over Weka

I think I have found my favorite tool when it comes to clustering - ELKI. Lots of algorithms, minimal interface, and well implemented. I have a few cribs about the UI/UX, but overall its one of the best options out there. It also helps that it has a great implementation of one of my favorite algorithms - OPTICS.


If you are in the habit of using Weka, for small tasks or maybe because you started off with it, be wary of using it for clustering. Recently,I had to use Weka for cluster analysis for legacy reasons, and I am far from being a happy customer. This short post provides some instances where Weka and ELKI gave different results on a problem.

DBSCAN

The following images show the clusters discovered in the same dataset using eps = 0.5 and minpts = 5 (these are parameters to the DBSCAN algorithm). The original dataset has 31 clusters. Different clusters discovered are represented by different colours.


Left: Weka - 1 cluster, right: ELKI - 24 clusters
As you can see, for the same settings of parameters Weka thinks of the whole dataset as one cluster, while ELKI discovers 24 clusters. In fact this happens to be true for a range of these parameters. The following heatmaps show the number of clusters as eps and minpts are varied.

Number of clusters. Left: Weka, right: ELKI
Weka sees a maximum of 2 clusters - as denoted by the white strip to the extreme left of the plot. ELKI sees quite some variation - at the bottom left there are around 3000 clusters, while for the rest of the plot it is close to 1 (I know it looks like 0, but the shades for 0 and 1 aren't distinguishable visually and the minimum number of clusters can be 1).

If you think about it what ELKI reports makes sense - at some setting of the parameters you would expect each point in the dataset to be a cluster. Since there are 3100 points in this dataset, ELKI sees as many clusters in the bottom left corner. Hence, Weka seems to report incorrect results.

I tried comparing Weka and ELKI on another dataset, this time comparing cluster purity for a range of  parameters. Again, in the case of Weka we see that the purity doesn't vary much, whereas for ELKI it varies over a wide range (as you'd expect).

Cluster purity. Left: Weka, right: ELKI
The responses to this question on stackoverflow suggest that this happens in Weka because automatic normalization of distances is imposed. Even if this is true, the implementation is incorrect - normalization here should not be a default and the documentation does not warn you.

k-means

I was a little surprised that the k-means implementation in Weka is buggy. This is one of the simpler algorithms out there!

On the same dataset I've used above, running k-means with k=32 doesn't terminate. It does not terminate even when you set the maximum number of iterations to 500. I had the clustering run for around 40 min before I decided to kill it.

ELKI gave me this in under a second:


k-means, with k = 32
Unlike the problems with DBSCAN, this problem seems to be dataset-specific. For a bunch of datasets I tried, k-means worked as expected.

Performance

I guess a discussion on performance is moot if your results are incorrect. But here are some numbers (from the ELKI website) further strengthening the case for ELKI. I have highlighted DBSCAN and OPTICS since those were the algorithms I was interested in. The other numbers are equally impressive.

ELKI Benchmarks

With this we come to end of this short post. If you haven't used ELKI yet, I hope I have convinced you to give it a shot!



No comments :