[MonetDB-users] General Questions on Columnar DB Design and Querying
Hi all, I've managed to get MonetDB running and loaded with a chunk of data (thanks to some great help from the list) and am now trying to figure out the "right" way to make use of it. I don't have a whole lot of exposure to columnar stores, and am therefore not sure if I'm doing things in an idiomatically correct way. I have a couple general questions about columnar DBs, and hope its appropriate to ask them here even thought they aren't specific to MonetDB. If these questions aren't considered appropriate for this list I apologize in advance and will happily move them to a different forum (but would appreciate pointers to where such a forum might be.) Two questions are general, and one is fairly specific. First the general: 1. Does working with a columnar store fundamentally change the way one would structure the tables? I'm somewhat familiar with star-schema designs where a central fact table holds the entire attribute set in as rows of indexes with supporting lookup tables for each column type, but it seems like columnar stores achieve the same thing without actually requiring the developer to specifically build those structures. Tables are constructed in a de-normalized RDBMS manner and the underlying DB takes care of the vectorization. Is this a valid description, or am I fundamentally misunderstanding the basics? 2. If there are significant differences in the way columnar stores should be designed, can folks point me to any resources (web, book, article, etc.) to help educate me on the topic? Now the specific: I have several sets of time-series data modeled as follows: create table sensordata ( sensor varchar(32), epoch_ts bigint, reading double, CONSTRAINT SENSORDATA_PKEY PRIMARY KEY (sensor, epoch_ts) ); I've populated that data, and now want to query data for a subset of sensors and generate a table that shows the value of the sensors at each point in time. I'm not sure what the right way to do this is. My desired out:ut would be: epoch_ts sensor1 sensor2 delta -------------- ------------ ------------ ------ <time1> 23.5 24.1 0.6 <time2> 19.3 25.2 5.9 ... <timeN> 25.1 19.3 -5.8 I tried to build that query as follows but end up with NULL results (note: I'm in no way a SQL pro so it might be really ugly.) select a.epoch_ts as epoch, a.reading as sensor1_reading, b.reading as sensor2_reading, b.reading - a.reading as delta from (select epoch_ts, reading from sensordata where sensor='sensor1') a, (select epoch_ts, reading from sensordata where sensor='sensor2') b where b.epoch_ts = a.epoch_ts order by epoch asc; Is this a sane way to approach the problem, or am I completely off track here. Each subselect returns results, but they obviously aren't joining correctly and it feels like it might simply be the wrong way to try to build the result set. I appreciate any direction you might provide. Thanks!
Hi Sean, On Sat, Nov 12, 2011 at 07:00:18PM -0600, Sean McNamara wrote:
Hi all,
I've managed to get MonetDB running and loaded with a chunk of data (thanks to some great help from the list) and am now trying to figure out the "right" way to make use of it. I don't have a whole lot of exposure to columnar stores, and am therefore not sure if I'm doing things in an idiomatically correct way. I have a couple general questions about columnar DBs, and hope its appropriate to ask them here even thought they aren't specific to MonetDB. If these questions aren't considered appropriate for this list I apologize in advance and will happily move them to a different forum (but would appreciate pointers to where such a forum might be.)
Two questions are general, and one is fairly specific. First the general:
1. Does working with a columnar store fundamentally change the way one would structure the tables? I'm somewhat familiar with star-schema designs where a central fact table holds the entire attribute set in as rows of indexes with supporting lookup tables for each column type, but it seems like columnar stores achieve the same thing without actually requiring the developer to specifically build those structures. Tables are constructed in a de-normalized RDBMS manner and the underlying DB takes care of the vectorization. Is this a valid description, or am I fundamentally misunderstanding the basics?
In general, the physical storage (and processing) model of a DBMS should not influence the logical schema design. The latter should IMHO mainly be driven by application characteristics and requirements. Schema normalization in genral and star schemas in particular aim at avoiding redundant storage of data. This is IMHO independent of whether the underlying DBMS uses a row-store or a column-store model. De-normalization results in redundant storage of identical data both with row-stores and column-stores, which is usualy not desireable.
2. If there are significant differences in the way columnar stores should be designed, can folks point me to any resources (web, book, article, etc.) to help educate me on the topic?
Generally not (see above). Specific cases might benefit from specific designs, but that depends on the actual DBMS, the application needs, data and workload characteristics, and can thus hardly be generalized.
Now the specific:
I have several sets of time-series data modeled as follows:
create table sensordata ( sensor varchar(32), epoch_ts bigint, reading double, CONSTRAINT SENSORDATA_PKEY PRIMARY KEY (sensor, epoch_ts) );
I've populated that data, and now want to query data for a subset of sensors and generate a table that shows the value of the sensors at each point in time. I'm not sure what the right way to do this is. My desired out:ut would be:
epoch_ts sensor1 sensor2 delta -------------- ------------ ------------ ------ <time1> 23.5 24.1 0.6 <time2> 19.3 25.2 5.9 ... <timeN> 25.1 19.3 -5.8
I tried to build that query as follows but end up with NULL results (note: I'm in no way a SQL pro so it might be really ugly.)
select a.epoch_ts as epoch, a.reading as sensor1_reading, b.reading as sensor2_reading, b.reading - a.reading as delta from (select epoch_ts, reading from sensordata where sensor='sensor1') a, (select epoch_ts, reading from sensordata where sensor='sensor2') b where b.epoch_ts = a.epoch_ts order by epoch asc;
Is this a sane way to approach the problem, or am I completely off track here. Each subselect returns results, but they obviously aren't joining correctly and it feels like it might simply be the wrong way to try to build the result set.
works fine with me (see below) --- provided the time stamps are indeed identical for both sensors ... sql>create table sensordata ( more> sensor varchar(32), more> epoch_ts bigint, more> reading double, more> CONSTRAINT SENSORDATA_PKEY PRIMARY KEY (sensor, epoch_ts) more>); operation successful (237.863ms) sql>insert into sensordata values ('sensor1',1,1.1); 1 affected row (52.619ms) sql>insert into sensordata values ('sensor1',2,2.1); 1 affected row (80.506ms) sql>insert into sensordata values ('sensor1',3,3.1); 1 affected row (21.852ms) sql>insert into sensordata values ('sensor1',4,5.1); 1 affected row (23.926ms) sql>insert into sensordata values ('sensor2',1,1.2); 1 affected row (22.177ms) sql>insert into sensordata values ('sensor2',2,2.2); 1 affected row (48.279ms) sql>insert into sensordata values ('sensor2',3,3.2); 1 affected row (23.667ms) sql>insert into sensordata values ('sensor2',5,5.2); 1 affected row (24.000ms) sql>select * from sensordata; +---------+----------+--------------------------+ | sensor | epoch_ts | reading | +=========+==========+==========================+ | sensor1 | 1 | 1.1000000000000001 | | sensor1 | 2 | 2.1000000000000001 | | sensor1 | 3 | 3.1000000000000001 | | sensor1 | 4 | 5.0999999999999996 | | sensor2 | 1 | 1.2 | | sensor2 | 2 | 2.2000000000000002 | | sensor2 | 3 | 3.2000000000000002 | | sensor2 | 5 | 5.2000000000000002 | +---------+----------+--------------------------+ 8 tuples (3.385ms) sql>select a.epoch_ts as epoch, more> a.reading as sensor1_reading, more> b.reading as sensor2_reading, more> b.reading - a.reading as delta more>from more> (select epoch_ts, reading from sensordata where sensor='sensor1') a, more> (select epoch_ts, reading from sensordata where sensor='sensor2') b more>where b.epoch_ts = a.epoch_ts more>order by epoch asc; +-------+--------------------------+--------------------------+--------------------------+ | epoch | sensor1_reading | sensor2_reading | delta | +=======+==========================+==========================+==========================+ | 1 | 1.1000000000000001 | 1.2 | 0.099999999999999867 | | 2 | 2.1000000000000001 | 2.2000000000000002 | 0.10000000000000009 | | 3 | 3.1000000000000001 | 3.2000000000000002 | 0.10000000000000009 | +-------+--------------------------+--------------------------+--------------------------+ 3 tuples (38.191ms) Stefan
I appreciate any direction you might provide.
Thanks!
------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Stefan.Manegold @ CWI.nl | DB Architectures (INS1) | | http://CWI.nl/~manegold/ | Science Park 123 (L321) | | Tel.: +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
participants (2)
-
Sean McNamara
-
Stefan Manegold