Clustering

3 Mar 2015

      Hi All,

I'm trying to solve a clustering problem on my astronomical data. It is
simply repeated observations of stars on images with random errors,
effectively creating clusters with Gaussian distributions. These clusters
have from 1 to cca 1000 points and are approximately of the same size (~1
arcsec). So much for the introduction.

I already tried to solve this problem with simple methods but with poor
results. Our regions are very dense and the size of the clusters is often
(cca 20 % of the observations) comparable to their distances. Overlaps are,
of course, happening too.

So I decided to use a more robust algorithm and compare the results. Due to
the size of our dataset (cca 4e8 observations, cca 8e6 real objects), I am
also counting on massive parallelization of the algorithm, focusing on
saving real time.

I have one simple question for you. What algorithm would you use for the
clustering? 

I would like to use K-means or EM Gaussian mixtures, because the data is
naturally Gaussian distributed. But then I would have to foretell the number
of clusters - what algorithm would you use there?

Or should I use some more complicated Bayesian algorithm which does not need
the number of clusters?

Thank you very much for your answers.

Jiri Nadvornik

Astronomical Institute AV CR 

Stellar department

Czech Republic

 mailto:nadvornik.ji@gmail.com nadvornik.ji@gmail.com

Jiří Nádvorník

Alastair McKinley

tags

participants (2)