[MonetDB-users] Bulk Data Extraction & Embedded mserver5

older
[MonetDB-users] Installation error

Hering Cheng

27 May 2010 27 May '10

3:05 a.m.

Hi, I was wondering if you can give me some ideas on how to retrieve data from MonetDB at very high throughput. The query will be very simple, issued against a single table. This table contains billions of records and I'd like to retrieve the data at a rate of at least one million records per second. From my naive tests with mclient running on the same 16-core machine as mserver5, I am only able to extract the data at about 20,000 records per second with the Feb2010 release. As a baseline case, with the same data in compressed (gzip'd) ASCII form stored in a regular file on the same (SAN) file system as MonetDB, I am able to read at the desired speed of one million records per second. I understand that there is communication overhead between mserver5 and mclient (or whatever client I use). Therefore, one possibility is to embed my application within mserver5. The embedded application basically just needs to be able to issue a SQL (or even MAL) query against the enclosing mserver5 and process the result set. If this is a viable approach, I'd like some guidance on where the hooks are. Thanks. Hering Cheng

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Stefan Manegold

27 May 27 May

5:33 a.m.

Hi Hering, MonetDB is a full-fledged DBMS, not just a plain data store. Hence, just like any other DBMS, its purpose it not to merely store (large amounts of) data as retrieve it as-is in its entirety --- say, SELECT * FROM table; --- (that indeed what we have filesystems and plain files for), but to process and analyse the data using (possibly complex) queries that yield results that are usually significantly smaller than the stored data volume. Basically, I may assume that you do not intend to present data at a rate of

...

1M records per sec to the (human) user, do you?

Consequently, I may assume that your application does do some kind of processing of the retrived data, right? The whole porpose of a DBMS is to provide you with the tools an knowledge to do such processing on large amount of data very efficiently. Hence, instead of aiming at retrieving all data as-is at high throughput from the DBMS to post-process it in your application, you should rather aim at having the DBMS do all (or at least) most of that processing, by formulating the processing task in a DBMS query language like SQL (or MAL, if SQL does not provide sufficient functionality and you are nor limited to using a standard language). Basically, pushing (most of) the data processing logic of your application into the DBMS to retrieve only a small ("useful" and "usable") result is the recommended way to use a DBMS efficiently and effectively. Stefan ps: Documentation about embedded use of MonetDB is available on our website: http://monetdb.cwi.nl/MonetDB/Documentation/Embedded-Server.html http://monetdb.cwi.nl/SQL/Documentation/Embedded-Server.html On Wed, May 26, 2010 at 08:05:08PM -0700, Hering Cheng wrote:

...

Hi,

I was wondering if you can give me some ideas on how to retrieve data from MonetDB at very high throughput. The query will be very simple, issued against a single table. This table contains billions of records and I'd like to retrieve the data at a rate of at least one million records per second. From my naive tests with mclient running on the same 16-core machine as mserver5, I am only able to extract the data at about 20,000 records per second with the Feb2010 release.

As a baseline case, with the same data in compressed (gzip'd) ASCII form stored in a regular file on the same (SAN) file system as MonetDB, I am able to read at the desired speed of one million records per second.

I understand that there is communication overhead between mserver5 and mclient (or whatever client I use). Therefore, one possibility is to embed my application within mserver5. The embedded application basically just needs to be able to issue a SQL (or even MAL) query against the enclosing mserver5 and process the result set. If this is a viable approach, I'd like some guidance on where the hooks are.

Thanks. Hering Cheng

...

------------------------------------------------------------------------------

...

_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

Hering Cheng

2:26 p.m.

Thank you, Stefan, for the advice. I apologize if I gave the impression that I am using MonetDB as a mere data store. I do realize that one should leverage analytical capabilities of the DBMS engine as much as possible before getting the results out. Part of my system will indeed make extensive use of MonetDB's amazing query capability. Unfortunately, another part of my application deals with what is known as "complex event processing" (CEP), the most fundamental building block is the use of fixed or sliding time windows. An example would be to calculate averages of a column with a 5-minute window. Needless to say, the data are time series. The trick is to meld CEP logic with MonetDB. Ideally, I would enhance MonetDB with this capability, but I deem it to be beyond my own ability. Thus my question about how to feed data from MonetDB into my logic in bulk. On Wed, May 26, 2010 at 10:33 PM, Stefan Manegold wrote:

...

Hi Hering,

MonetDB is a full-fledged DBMS, not just a plain data store. Hence, just like any other DBMS, its purpose it not to merely store (large amounts of) data as retrieve it as-is in its entirety --- say, SELECT * FROM table; --- (that indeed what we have filesystems and plain files for), but to process and analyse the data using (possibly complex) queries that yield results that are usually significantly smaller than the stored data volume.

Basically, I may assume that you do not intend to present data at a rate of

...
1M records per sec to the (human) user, do you?

Consequently, I may assume that your application does do some kind of processing of the retrived data, right?

The whole porpose of a DBMS is to provide you with the tools an knowledge to do such processing on large amount of data very efficiently. Hence, instead of aiming at retrieving all data as-is at high throughput from the DBMS to post-process it in your application, you should rather aim at having the DBMS do all (or at least) most of that processing, by formulating the processing task in a DBMS query language like SQL (or MAL, if SQL does not provide sufficient functionality and you are nor limited to using a standard language). Basically, pushing (most of) the data processing logic of your application into the DBMS to retrieve only a small ("useful" and "usable") result is the recommended way to use a DBMS efficiently and effectively.

Stefan

ps: Documentation about embedded use of MonetDB is available on our website: http://monetdb.cwi.nl/MonetDB/Documentation/Embedded-Server.html http://monetdb.cwi.nl/SQL/Documentation/Embedded-Server.html

...
Hi,

I was wondering if you can give me some ideas on how to retrieve data from MonetDB at very high throughput. The query will be very simple, issued against a single table. This table contains billions of records and I'd like to retrieve the data at a rate of at least one million records per second. From my naive tests with mclient running on the same 16-core machine as mserver5, I am only able to extract the data at about 20,000 records per second with the Feb2010 release.

As a baseline case, with the same data in compressed (gzip'd) ASCII form stored in a regular file on the same (SAN) file system as MonetDB, I am able to read at the desired speed of one million records per second.

I understand that there is communication overhead between mserver5 and mclient (or whatever client I use). Therefore, one possibility is to embed my application within mserver5. The embedded application basically just needs to be able to issue a SQL (or even MAL) query against the enclosing mserver5 and process the result set. If this is a viable approach, I'd

On Wed, May 26, 2010 at 08:05:08PM -0700, Hering Cheng wrote: like

...
some guidance on where the hooks are.

Thanks. Hering Cheng

...
------------------------------------------------------------------------------

...
...
_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/http://www.cwi.nl/%7Emanegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |

Fabian Groffen

2:40 p.m.

On 27-05-2010 07:26:16 -0700, Hering Cheng wrote:

...

Thank you, Stefan, for the advice.

I apologize if I gave the impression that I am using MonetDB as a mere data store. I do realize that one should leverage analytical capabilities of the DBMS engine as much as possible before getting the results out. Part of my system will indeed make extensive use of MonetDB's amazing query capability.

Unfortunately, another part of my application deals with what is known as "complex event processing" (CEP), the most fundamental building block is the use of fixed or sliding time windows. An example would be to calculate averages of a column with a 5-minute window. Needless to say, the data are time series. The trick is to meld CEP logic with MonetDB. Ideally, I would enhance MonetDB with this capability, but I deem it to be beyond my own ability. Thus my question about how to feed data from MonetDB into my logic in bulk.

The problem very much is that the speed you see now roughly what you're going to get at max. The upcoming Jun2010 release will give you some improvements once you connect to the database from the same machine using a C-program based on Mapilib or just around mclient, but in fact, our protocol of sending over the data is all but optimised for speed in transmission of large volumes. It's very hard to get by this, and you will quickly enter deep levels down in the engine to reach BATs and try to serialise them if it's all about speed. But perhaps it is a better approach to get your CEP queries translated into something that is helped by user defined functions either in MAL or maybe even C if you really have only a few operations that you need to be performed very fast.

Peter Boncz

2:47 p.m.

Hi Fabian, Hering, Would it not be possible to COPY INTO a file? Performance of that will most likely be higher than going through Mapi, though it carries the disadvantages that the file will be available on the server computer only and it is still converting to ascii (and your app will then parse it back, most likely). Otherwise, as Fabian hints, if you want insane speed, then yes you go into shady kernel regions.. you could select into a temptable, and then obtain the BAT filenames. If you are just interested in fixed-size numeric types (32- or 64-bytes signed integers or 32- or 64-byte doubles) than this is simple as the .tail file just contains one column of that data, as a raw one-dimensional array. Strings are less obvious. Peter

...

-----Original Message----- From: Fabian Groffen [mailto:Fabian.Groffen@cwi.nl] Sent: Thursday, May 27, 2010 4:41 PM To: monetdb-users@lists.sourceforge.net Subject: Re: [MonetDB-users] Bulk Data Extraction & Embedded mserver5

...
Thank you, Stefan, for the advice.

I apologize if I gave the impression that I am using MonetDB as a mere data store. I do realize that one should leverage analytical capabilities of

...
DBMS engine as much as possible before getting the results out. Part of my system will indeed make extensive use of MonetDB's amazing query capability.

Unfortunately, another part of my application deals with what is known as "complex event processing" (CEP), the most fundamental building block is

On 27-05-2010 07:26:16 -0700, Hering Cheng wrote: the the

...
use of fixed or sliding time windows. An example would be to calculate averages of a column with a 5-minute window. Needless to say, the data are time series. The trick is to meld CEP logic with MonetDB. Ideally, I would enhance MonetDB with this capability, but I deem it to be beyond my own ability. Thus my question about how to feed data from MonetDB into my logic in bulk.

The problem very much is that the speed you see now roughly what you're going to get at max. The upcoming Jun2010 release will give you some improvements once you connect to the database from the same machine using a C-program based on Mapilib or just around mclient, but in fact, our protocol of sending over the data is all but optimised for speed in transmission of large volumes.

It's very hard to get by this, and you will quickly enter deep levels down in the engine to reach BATs and try to serialise them if it's all about speed. But perhaps it is a better approach to get your CEP queries translated into something that is helped by user defined functions either in MAL or maybe even C if you really have only a few operations that you need to be performed very fast.

--------------------------------------------------------------------------- ---

_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

Stefan Manegold

4:42 p.m.

On Thu, May 27, 2010 at 04:47:54PM +0200, Peter Boncz wrote:

...

Hi Fabian, Hering,

Would it not be possible to COPY INTO a file? Performance of that will most likely be higher than going through Mapi, though it carries the disadvantages that the file will be available on the server computer only and it is still converting to ascii (and your app will then parse it back, most likely).

Indeed that is possible; cf., "3.1.6.5 Copy into File" at http://monetdb.cwi.nl/SQL/Documentation/Data-Manipulation.html#Data-Manipula...

...

Otherwise, as Fabian hints, if you want insane speed, then yes you go into shady kernel regions.. you could select into a temptable, and then obtain the BAT filenames. If you are just interested in fixed-size numeric types (32- or 64-bytes signed integers or 32- or 64-byte doubles) than this is simple as the .tail file just contains one column of that data, as a raw one-dimensional array. Strings are less obvious.

It might be a bit scary to read directly from the data files in MonetDB's dbfarm while they are under the controle of MonetDB (less scary if there are no concurrent updates on the data while the external program is reading the files), but it is indeed possible (surely on operating-/file-systems that allow multiple processes to open the same file concurrently ;-)). As a side note, while not (yet?) available in open source MonetDB, you also might find our research papers on Stream Processing with MonetDB interesting: http://repository.cwi.nl/search/fullrecord.php?publnr=14291 http://repository.cwi.nl/search/fullrecord.php?publnr=14315 http://repository.cwi.nl/search/fullrecord.php?publnr=12229 As said, this is still very much on-going research and far from "production ready". Stefan

...

Peter

...
-----Original Message----- From: Fabian Groffen [mailto:Fabian.Groffen@cwi.nl] Sent: Thursday, May 27, 2010 4:41 PM To: monetdb-users@lists.sourceforge.net Subject: Re: [MonetDB-users] Bulk Data Extraction & Embedded mserver5

...
Thank you, Stefan, for the advice.

I apologize if I gave the impression that I am using MonetDB as a mere data store. I do realize that one should leverage analytical capabilities of

...
DBMS engine as much as possible before getting the results out. Part of my system will indeed make extensive use of MonetDB's amazing query capability.

Unfortunately, another part of my application deals with what is known as "complex event processing" (CEP), the most fundamental building block is

On 27-05-2010 07:26:16 -0700, Hering Cheng wrote: the the

...
use of fixed or sliding time windows. An example would be to calculate averages of a column with a 5-minute window. Needless to say, the data are time series. The trick is to meld CEP logic with MonetDB. Ideally, I would enhance MonetDB with this capability, but I deem it to be beyond my own ability. Thus my question about how to feed data from MonetDB into my logic in bulk.

The problem very much is that the speed you see now roughly what you're going to get at max. The upcoming Jun2010 release will give you some improvements once you connect to the database from the same machine using a C-program based on Mapilib or just around mclient, but in fact, our protocol of sending over the data is all but optimised for speed in transmission of large volumes.

It's very hard to get by this, and you will quickly enter deep levels down in the engine to reach BATs and try to serialise them if it's all about speed. But perhaps it is a better approach to get your CEP queries translated into something that is helped by user defined functions either in MAL or maybe even C if you really have only a few operations that you need to be performed very fast.

--------------------------------------------------------------------------- ---

_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

------------------------------------------------------------------------------

_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

Hering Cheng

30 Aug 30 Aug

3:14 p.m.

Other than taking the embedded route, can someone give me some guidance on how I can hook into the internals of MonetDB so I can "intervene" before the result set of a SQL query is sent to clients? Basically, I want to insert my logic within MonetDB so that the (initial) result set that would have been sent to clients (mclient, ODBC, JDBC) is intercepted by my code and re-processed. The output of my logic will then be forwarded to the clients as usual. Since my logic resides within MonetDB, I should be able to receive the initial result set at millions of records per second. I'd like some hints on which source files I should look at to get this to work. Thanks a lot. Hering On Thu, May 27, 2010 at 7:26 AM, Hering Cheng wrote:

...

Thank you, Stefan, for the advice.

I apologize if I gave the impression that I am using MonetDB as a mere data store. I do realize that one should leverage analytical capabilities of the DBMS engine as much as possible before getting the results out. Part of my system will indeed make extensive use of MonetDB's amazing query capability.

Unfortunately, another part of my application deals with what is known as "complex event processing" (CEP), the most fundamental building block is the use of fixed or sliding time windows. An example would be to calculate averages of a column with a 5-minute window. Needless to say, the data are time series. The trick is to meld CEP logic with MonetDB. Ideally, I would enhance MonetDB with this capability, but I deem it to be beyond my own ability. Thus my question about how to feed data from MonetDB into my logic in bulk.

On Wed, May 26, 2010 at 10:33 PM, Stefan Manegold wrote:

...
Hi Hering,

MonetDB is a full-fledged DBMS, not just a plain data store. Hence, just like any other DBMS, its purpose it not to merely store (large amounts of) data as retrieve it as-is in its entirety --- say, SELECT * FROM table; --- (that indeed what we have filesystems and plain files for), but to process and analyse the data using (possibly complex) queries that yield results that are usually significantly smaller than the stored data volume.

Basically, I may assume that you do not intend to present data at a rate of

...
1M records per sec to the (human) user, do you?

Consequently, I may assume that your application does do some kind of processing of the retrived data, right?

The whole porpose of a DBMS is to provide you with the tools an knowledge to do such processing on large amount of data very efficiently. Hence, instead of aiming at retrieving all data as-is at high throughput from the DBMS to post-process it in your application, you should rather aim at having the DBMS do all (or at least) most of that processing, by formulating the processing task in a DBMS query language like SQL (or MAL, if SQL does not provide sufficient functionality and you are nor limited to using a standard language). Basically, pushing (most of) the data processing logic of your application into the DBMS to retrieve only a small ("useful" and "usable") result is the recommended way to use a DBMS efficiently and effectively.

Stefan

ps: Documentation about embedded use of MonetDB is available on our website: http://monetdb.cwi.nl/MonetDB/Documentation/Embedded-Server.html http://monetdb.cwi.nl/SQL/Documentation/Embedded-Server.html

...
Hi,

I was wondering if you can give me some ideas on how to retrieve data from MonetDB at very high throughput. The query will be very simple, issued against a single table. This table contains billions of records and I'd like to retrieve the data at a rate of at least one million records per second. From my naive tests with mclient running on the same 16-core machine as mserver5, I am only able to extract the data at about 20,000 records per second with the Feb2010 release.

As a baseline case, with the same data in compressed (gzip'd) ASCII form stored in a regular file on the same (SAN) file system as MonetDB, I am able to read at the desired speed of one million records per second.

I understand that there is communication overhead between mserver5 and mclient (or whatever client I use). Therefore, one possibility is to embed my application within mserver5. The embedded application basically just needs to be able to issue a SQL (or even MAL) query against the enclosing mserver5 and process the result set. If this is a viable approach, I'd

On Wed, May 26, 2010 at 08:05:08PM -0700, Hering Cheng wrote: like

...
some guidance on where the hooks are.

Thanks. Hering Cheng

...
------------------------------------------------------------------------------

...
...
_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/http://www.cwi.nl/%7Emanegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |

Martin Kersten

27 May 27 May

7:24 a.m.

Hering Cheng wrote:

...

Hi,

I was wondering if you can give me some ideas on how to retrieve data from MonetDB at very high throughput. The query will be very simple, issued against a single table. This table contains billions of records and I'd like to retrieve the data at a rate of at least one million records per second. From my naive tests with mclient running on the same 16-core machine as mserver5, I am only able to extract the data at about 20,000 records per second with the Feb2010 release. Without an indication of the table structure (few columns to thousands) and a skeleton of the SQL query you are firing, and a synopsis how your application interface is setup, an answer would be hard to give.

Scanning 1M records inside the kernel is not a big deal. See http://www.cwi.nl/~mk/ontimeReport for scanning >100M rows

...

As a baseline case, with the same data in compressed (gzip'd) ASCII form stored in a regular file on the same (SAN) file system as MonetDB, I am able to read at the desired speed of one million records per second.

A DBMS is not a file store!

...

I understand that there is communication overhead between mserver5 and mclient (or whatever client I use). Therefore, one possibility is to embed my application within mserver5. The embedded application basically just needs to be able to issue a SQL (or even MAL) query against the enclosing mserver5 and process the result set. If this is a viable approach, I'd like some guidance on where the hooks are.

Thanks. Hering Cheng

------------------------------------------------------------------------

------------------------------------------------------------------------------

------------------------------------------------------------------------

_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

Hering Cheng

2:08 p.m.

Thank you, Martin. The report is very impressive. I only wish that it were more widely disseminated. I thought it would be great if it is accessible from your "Benchmark" (http://monetdb.cwi.nl/SQL/Benchmark/index.html) page. On Thu, May 27, 2010 at 12:24 AM, Martin Kersten wrote:

...

Hering Cheng wrote:

...
Hi,

I was wondering if you can give me some ideas on how to retrieve data from MonetDB at very high throughput. The query will be very simple, issued against a single table. This table contains billions of records and I'd like to retrieve the data at a rate of at least one million records per second. From my naive tests with mclient running on the same 16-core machine as mserver5, I am only able to extract the data at about 20,000 records per second with the Feb2010 release.

Without an indication of the table structure (few columns to thousands) and a skeleton of the SQL query you are firing, and a synopsis how your application interface is setup, an answer would be hard to give.

Scanning 1M records inside the kernel is not a big deal. See http://www.cwi.nl/~mk/ontimeReport http://www.cwi.nl/%7Emk/ontimeReport for scanning >100M rows

...
As a baseline case, with the same data in compressed (gzip'd) ASCII form stored in a regular file on the same (SAN) file system as MonetDB, I am able to read at the desired speed of one million records per second.

A DBMS is not a file store!

...
I understand that there is communication overhead between mserver5 and mclient (or whatever client I use). Therefore, one possibility is to embed my application within mserver5. The embedded application basically just needs to be able to issue a SQL (or even MAL) query against the enclosing mserver5 and process the result set. If this is a viable approach, I'd like some guidance on where the hooks are.

Thanks. Hering Cheng

------------------------------------------------------------------------

------------------------------------------------------------------------------

------------------------------------------------------------------------

_______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users

5305

Age (days ago)

5400

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Fabian Groffen
Hering Cheng
Martin Kersten
Peter Boncz
Stefan Manegold