Hi Jiri ,

Have you thought about using the gap statistic to find the natural number of clusters?

It will add some time to your calculation.  The bootstrapping phase to find the expected within cluster sum of squares can be easily parallelised.

Best regards,

Alastair

Sent from my Sony Xperia™ smartphone



---- Jiří Nádvorník wrote ----

Hi All,

 

I’m trying to solve a clustering problem on my astronomical data. It is simply repeated observations of stars on images with random errors, effectively creating clusters with Gaussian distributions. These clusters have from 1 to cca 1000 points and are approximately of the same size (~1 arcsec). So much for the introduction.

 

I already tried to solve this problem with simple methods but with poor results. Our regions are very dense and the size of the clusters is often (cca 20 % of the observations) comparable to their distances. Overlaps are, of course, happening too.

 

So I decided to use a more robust algorithm and compare the results. Due to the size of our dataset (cca 4e8 observations, cca 8e6 real objects), I am also counting on massive parallelization of the algorithm, focusing on saving real time.

 

I have one simple question for you. What algorithm would you use for the clustering?

 

I would like to use K-means or EM Gaussian mixtures, because the data is naturally Gaussian distributed. But then I would have to foretell the number of clusters – what algorithm would you use there?

 

Or should I use some more complicated Bayesian algorithm which does not need the number of clusters?

 

Thank you very much for your answers.

 

Jiri Nadvornik

Astronomical Institute AV CR

Stellar department

Czech Republic

nadvornik.ji@gmail.com