Hi All,
I’m trying to solve a clustering problem on my astronomical data. It is simply repeated observations of stars on images with random errors, effectively creating clusters with Gaussian distributions. These clusters have from 1 to cca 1000 points and are approximately of the same size (~1 arcsec). So much for the introduction.
I already tried to solve this problem with simple methods but with poor results. Our regions are very dense and the size of the clusters is often (cca 20 % of the observations) comparable to their distances. Overlaps are, of course, happening too.
So I decided to use a more robust algorithm and compare the results. Due to the size of our dataset (cca 4e8 observations, cca 8e6 real objects), I am also counting on massive parallelization of the algorithm, focusing on saving real time.
I have one simple question for you. What algorithm would you use for the clustering?
I would like to use K-means or EM Gaussian mixtures, because the data is naturally Gaussian distributed. But then I would have to foretell the number of clusters – what algorithm would you use there?
Or should I use some more complicated Bayesian algorithm which does not need the number of clusters?
Thank you very much for your answers.
Jiri Nadvornik
Astronomical Institute AV CR
Stellar department
Czech Republic