Requirement gathering and questions about binary bulk loading.
Hello fellow MonetDB'ers, The Pentaho MonetDB Bulk Loader transformation step is now working pretty well in our tests. http://ci.pentaho.com/view/Data%20Integration/job/Kettle contains a continuous integration build of the software if anyone wanted to play with it. New Transformation -> Steps -> Bulk Loading (category)-> MonetDB Bulk Loader Thank you for the previous information about enabling log output from the MonetDB JDBC driver. After killing a stalled "monetdb start ~/sg-farm" thread and restarting it, MonetDB decided to work really well. It was suggested early on that I take a look at the binary bulk loading approach to get the fastest results. Now that I have a streaming ascii version of a bulk loader done for Pentaho Data Integration, I wanted to come back to the binary approach and see what that might look like and what to be aware of. I read http://www.monetdb.org/Documentation/Cookbooks/SQLrecipies/BinaryBulkLoad It looks like the requirements are: 1. Write a physical file per column to the disk the monetdb farm exists on. 2. Tables with DATES, TIMESTAMPS or UTF-8 characters ie: Simplified Chinese are verboten and must be handled with COPY INTO .... USING DELIMITERS Assuming the above is a correct understanding, given the command below: copy into Tmp from ('path_to_file_i', 'path_to_file_f', 'path_to_file_s'); 1. If run from a computer separate from the machine running the monetdb server, would the 'path_to_file_i' be a pathname relative to the monetdb farm root directory on the remote server machine? 2. This process should work for appending new data on to an existing table, right? 3. Am I out of luck if my table has any of the verboten data types mentioned above mixed in with supported binary bulkloadable types; aka C-array style data types? 4. If not out of luck for a mixed type table, what is the procedure to load the table? In practice I imagine I will not need to fret too much about DATES needing a COPY INTO .... USING DELIMITERS ascii streaming approach because it is likely part of a dimension table with integer keys matching integers on a fact table, which is far more likely to have the supported types MonetDB Binary Bulk-loading likes. Is it true that MonetDB does not handle joins well, but instead prefers single, denormalized tables with many columns?
From what you know of the internals of MonetDb, is the effort to make some kind of streaming binary bulk loader worth it in terms of execution time on the database converting to native binary types? The documentation used the words, 'slightly faster', but I do not have a feel for when 'slightly' makes a big difference.
Thanks again for considering these questions. Brandon
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-02-06 06:51, Brandon Jackson wrote:
Hello fellow MonetDB'ers,
The Pentaho MonetDB Bulk Loader transformation step is now working pretty well in our tests. http://ci.pentaho.com/view/Data%20Integration/job/Kettle contains a continuous integration build of the software if anyone wanted to play with it. New Transformation -> Steps -> Bulk Loading (category)-> MonetDB Bulk Loader Thank you for the previous information about enabling log output from the MonetDB JDBC driver. After killing a stalled "monetdb start ~/sg-farm" thread and restarting it, MonetDB decided to work really well.
It was suggested early on that I take a look at the binary bulk loading approach to get the fastest results. Now that I have a streaming ascii version of a bulk loader done for Pentaho Data Integration, I wanted to come back to the binary approach and see what that might look like and what to be aware of.
I read http://www.monetdb.org/Documentation/Cookbooks/SQLrecipies/BinaryBulkLoad
It
looks like that page is out of date. See below.
It looks like the requirements are:
1. Write a physical file per column to the disk the monetdb farm exists on.
Correct. And make sure you use the correct binary format for the column.
2. Tables with DATES, TIMESTAMPS or UTF-8 characters ie: Simplified Chinese are verboten and must be handled with COPY INTO .... USING DELIMITERS
Dates and timestamps are stored as integers (int and lng, respectively), so if you convert the data to the expected value, you can, at least in principle, use the binary approach. String data (technically, any variable-sized data) cannot be done using the same trick as we do with fixed-sized data. However, if the SQL extension of COPY BINARY INTO will still copy in the data. It's just not as efficient as the binary method for fixed-sized data.
Assuming the above is a correct understanding, given the command below: copy into Tmp from ('path_to_file_i', 'path_to_file_f', 'path_to_file_s');
The command these days is: COPY BINARY INTO table FROM 'path_to_file_i', 'path_to_file_f', 'path_to_file_s';
1. If run from a computer separate from the machine running the monetdb server, would the 'path_to_file_i' be a pathname relative to the monetdb farm root directory on the remote server machine?
Since the server attaches the files (apart from the string data) by doing a rename() system call, the files must be on the same file system partition as the data base itself.
2. This process should work for appending new data on to an existing table, right?
Looking at the MAL plan (using the EXPLAIN prefix to the SQL query), it looks like an append is done, so it looks like the answer is yes.
3. Am I out of luck if my table has any of the verboten data types mentioned above mixed in with supported binary bulkloadable types; aka C-array style data types?
Strings can be handled. The content of the file is expected to be one value per line. No quotes, no embedded newlines. Also no checks that the data is proper UTF-8 and isn't too long for the size of the column.
4. If not out of luck for a mixed type table, what is the procedure to load the table?
Same thing. COPY BINARY INTO table FROM '/full/path1','/full/path2',...; As long as you only have integral (you must get the width of the column right!), floating point (float or double, again make sure you have the correct width), and string.
In practice I imagine I will not need to fret too much about DATES needing a COPY INTO .... USING DELIMITERS ascii streaming approach because it is likely part of a dimension table with integer keys matching integers on a fact table, which is far more likely to have the supported types MonetDB Binary Bulk-loading likes.
Is it true that MonetDB does not handle joins well, but instead prefers single, denormalized tables with many columns?
MonetDB uses knowledge about primary/foreign key relationships when coming up with an execution plan. Also, if the data is normalized, the database is probably smaller, so less I/O, so faster.
From what you know of the internals of MonetDb, is the effort to make some kind of streaming binary bulk loader worth it in terms of execution time on the database converting to native binary types? The documentation used the words, 'slightly faster', but I do not have a feel for when 'slightly' makes a big difference.
I guess the "slightly" refers to the fact that data still needs to be written to disk by the application that produces the data. If you already have the data in CSV format, it may well be faster to let MonetDB do the parsing, since it uses multiple threads to do part of the work. If you have an application that produces the data, if you can stream the data in CSV format easily, that may also be fast since MonetDB will parse that data in parallel (both to the application and using multiple threads itself).
Thanks again for considering these questions.
Brandon _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
- -- Sjoerd Mullender -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQCVAwUBURTxLz7g04AjvIQpAQJldQQAuaS2DBpUMEaCqWf4dr+uvkCtHO47SetU ZFhsSRDapWWQNtPQq9ScBgurgliwEd1sqXD2sYPPT+EnILCh7iAuYkzGG7cRajN9 kQx5XYaSQ9Dzg5GvsfMlAolBxZ9GMPAiYCQr6ipfL7Fh7wckiYxfPAh5Lf4ofvLZ q+LfuKoZ6Oc= =MAVU -----END PGP SIGNATURE-----
participants (2)
-
Brandon Jackson
-
Sjoerd Mullender