Finding common clusters

Tag: algorithm , cluster-analysis , data-mining Author: zj532424029 Date: 2014-04-01

I've came across the following problem:

Imagine we have a set of n samples that we want to classify into k classes labeled 1-k. We run M different clustering algorithms and get M different outputs. The catch is that the same clusters in different outputs can be given a different label in each output.

How to find the common clusters between all outputs? The obious solution I think is to run over all possible pairs of samples, checking whether they are classified the same in each ouput. That gives complexity of O(n^2*M).

Can we do better (maybe if we add some assumptions)?



I'll give an example. We have 4 samples, k=2 and got the following outputs:

A 1 1 2
B 1 1 2
C 2 2 1
D 1 1 1

Than the only common cluster is (A,B) since it s the only pair that is classified the same in all outputs.

Other Answer1

For each sample, think of the outputs of the M clustering algorithms as the M characters of a string. Now you have n strings of length M and you need to find duplicates. One practical way of doing this to to compute a hash code for each string - in fact you could build a table mapping the hash code to a list of strings with that hash code. Strings with different hash codes must be different. If you have a collection of strings with the same hash code, start by comparing each of them with the first string with that hash code. If they are all the same you have confirmed that the hash code did not produce misleading collisions quickly. If they are not all the same, you have a sub-collection that are the same as the first string, and another sub-collection where you will have to repeat the comparisons.

If the hash code does not produce misleading collisions you can divide the strings into clusters in linear time. If it does, you may take quadratic time as above.

A linear time solution which may be impractical is to concatenate the strings, separating them by a character not seen so far, and then run a linear-time suffix tree or suffix array creation program. This will sort the strings into tree or array order, and you can then find divisions between clusters by going through the strings in order comparing each string with the next string in order.


Thanks, but if I understand you well, lines 1 and 3 in my example will result in a different hash, however we would like them to be similar.
Can you explain what lines 1 and 3 are and why you want them the same. In the question you label the examples A, B, C, and D and you say that the only common cluster is A and B. My solution will say that A is "112" and B is "112" so these are the same but C is "221" and D is "111" and these are different.
These lines mean that A,B where calssified together in the same cluster in all outputs, although the were assigned a different label.

Other Answer2

Min-Hash can be used to efficiently estimate the similarity between two clusters. Its time is linear in the number of elements, so it'd bring your runtime down to O(n*k^2*j) (where j is the number of hash functions used by Min-Hash, and higher values give more accurate results).


Thanks. I'm a bit familiar with such methods, however I was hoping to find an accuarate algorithm.

Other Answer3

from what i get is that you need to check whether any two outputs are actually structurally similar but you can only think of O(n^2) algorithm to do that. If your problem is the above one then an optimization is as follows :-

Psuedo Code :-

int arr1 = [1 1 2 2];
int arr2 = [2 2 1 1]; 

list sets1[k];
list sets2[k]; 

for(int i=0;i<n;i++) {
boolean flag = true;
for(int i=0;i<k;i++) {
  flag = flag && compare(sets1[arr1[i]-1],sets2[arr2[i]-1]);
  if(flag == false)
      return flag

return flag

Time Complexity :-

The compare function visits all elements in arr1 & arr2 atmost once hence it is O(n) overall.

Edit :-

Further if you need to evaluate whether all such outputs which are similar in less than O(M^2*n) then :-

1. calculate sets for all M
2. Calculate hash for each set using standard hash functions.
3. if two set are equal then their hashes are also equal with high probability
4. Sort k hash for each output in O(logk)
5. Get all equivalent set using hash map in O(M*logM)

Overall Complexity :- O(n*M) for sets calculation and O(M*logM) for getting similar outputs hence O(M*(n+logM))


My problem is to find the samples that are structurally the same in all outputs. Theoratically there can be none. Is that what your code does? Thanks.
@Roy if you mean by same structurally is that grouping of any element is done with same elements in both outputs then the code finds such similarity

Other Answer4

Analyze the clusters, not the data points, for example by computing the contingency table.

This will easily get you down to O(M*k*k*n) plus O(n log n) for sorting your cluster contents once, if you don't have them presorted already (for efficient intersection).

I don't think it pays off to analyze k-means results though. On reasonably complex data, they are only as good as random convex partitionings.


Thanks. I'm not really sure though of what do you mean by "analyze the clusters". Im also curious about your other comment, although I'm not exactly analyzing k-means results.
Compare clusters via the contingency table; do not compare single points.