Hi, I've had my eyes on MonetDB for quite some time and the Numpy implementation really pushed me over the edge (as in this is the right solution for us). A big shout out to Mark Raasveldt for both the function and support and bug-fixes along the way. We have a fairly small dataset which probably won't exceed 10 million rows, 500 columns. We do flight data analysis so each row can be though of as a flight, and all the columns are data-points connected to it. We then use Numpy to prepare the data for graphs and limit our selection using standard SQL. We generally SELECT between 1-5 columns and use the WHERE statement on *another* 1-5 columns to select our rows. However, as we develop more analyses we connect more and more data-points (columns) to each row. Now there is a few ways we could deal with this: 1. Store the data in a column-less format (such as JSON) and re-create the MonetDB database when we need to add or remove columns (since MonetDB is really fast to read data into it, and we have no problem automating this process) - This is what we've done so far in our proof of concept 2. Store everything in MonetDB and use Schema alterations as new columns are added (requires altering in a live database) 3. Split the data into multiple tables (one for each analysis) and use JOIN to select them. This makes it not so much one huge table, but several small. Still need schema alterations when doing anything besides adding analysis. (4. Use a lookup table syntax, where we store three columns: ID, NAME, VALUE and instead of having 10 million rows and 500 columns use 10*500 rows - one row for each data point) - Since MonetDB excels at being a columnar storage putting everything into one table with only three columns seems like a very bad fit. Nevertheless it's a though. Before I spend more time doing proof-of-concepts on scenario 2 and 3 I just wanted to ask: Am I missing something? Is there any obvious solution that I am not covering? Is there any way to have "open" columns (i.e dynamic columns instead of fixed schema where the columns are created upon insert) Again, thank you so much for the work everyone of you is doing on MonetDB. I hope to start contributing as I learn more (probably just adding documentation short-term, possibly committing fixes in the long-term) Regards, Niklas
On Nov 24, 2015, at 13:08, Niklas Bivald
wrote: Hi,
I've had my eyes on MonetDB for quite some time and the Numpy implementation really pushed me over the edge (as in this is the right solution for us). A big shout out to Mark Raasveldt for both the function and support and bug-fixes along the way. We have a fairly small dataset which probably won't exceed 10 million rows, 500 columns. We do flight data analysis so each row can be though of as a flight, and all the columns are data-points connected to it. We then use Numpy to prepare the data for graphs and limit our selection using standard SQL. We generally SELECT between 1-5 columns and use the WHERE statement on *another* 1-5 columns to select our rows.
However, as we develop more analyses we connect more and more data-points (columns) to each row. Now there is a few ways we could deal with this:
1. Store the data in a column-less format (such as JSON) and re-create the MonetDB database when we need to add or remove columns (since MonetDB is really fast to read data into it, and we have no problem automating this process) - This is what we've done so far in our proof of concept 2. Store everything in MonetDB and use Schema alterations as new columns are added (requires altering in a live database)
Hai Niklas, Whether this is a good idea for your application depends on how often do you need to add new columns, and the level of concurrency. Alter schema will at least block the whole table, if not the whole schema (I’m not so sure about this).
3. Split the data into multiple tables (one for each analysis) and use JOIN to select them. This makes it not so much one huge table, but several small. Still need schema alterations when doing anything besides adding analysis.
Same as above. Depends on your queries, JOIN may not be a problem.
(4. Use a lookup table syntax, where we store three columns: ID, NAME, VALUE and instead of having 10 million rows and 500 columns use 10*500 rows - one row for each data point) - Since MonetDB excels at being a columnar storage putting everything into one table with only three columns seems like a very bad fit. Nevertheless it's a though.
Largely depends on your queries. MoentDB doesn’t have preference to #columns in a table, but if your table is extremely long, it might take some time (also depends on if/how the data are sorted) for the first select to filter out the rows actually needed by the query...
Before I spend more time doing proof-of-concepts on scenario 2 and 3 I just wanted to ask:
Am I missing something? Is there any obvious solution that I am not covering? Is there any way to have "open" columns (i.e dynamic columns instead of fixed schema where the columns are created upon insert)
a.f.a.i.k., this is not on the agenda.
Again, thank you so much for the work everyone of you is doing on MonetDB. I hope to start contributing as I learn more (probably just adding documentation short-term, possibly committing fixes in the long-term)
Thank you very much for trying MonetDB. Any contribution will be highly appreciated. Regards, Jennie
Regards, Niklas _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (2)
-
Niklas Bivald
-
Ying Zhang