Requirement gathering and questions about binary bulk loading.

6 Feb 2013

      Hello fellow MonetDB'ers,

The Pentaho MonetDB Bulk Loader transformation step is now working
pretty well in our tests.
http://ci.pentaho.com/view/Data%20Integration/job/Kettle contains a
continuous integration build of the software if anyone wanted to play
with it.
New Transformation -> Steps -> Bulk Loading  (category)-> MonetDB Bulk Loader
Thank you for the previous information about enabling log output from
the MonetDB JDBC driver.
After killing a stalled "monetdb start ~/sg-farm" thread and
restarting it, MonetDB decided to work really well.

It was suggested early on that I take a look at the binary bulk
loading approach to get the fastest results.
Now that I have a streaming ascii version of a bulk loader done for
Pentaho Data Integration, I wanted to come back to the binary approach
and see what that might look like and what to be aware of.

I read http://www.monetdb.org/Documentation/Cookbooks/SQLrecipies/BinaryBulkLoad

It looks like the requirements are:

1. Write a physical file per column to the disk the monetdb farm exists on.
2. Tables with DATES, TIMESTAMPS or UTF-8 characters ie: Simplified
Chinese are verboten and must be handled with COPY INTO .... USING
DELIMITERS

Assuming the above is a correct understanding, given the command below:
copy into Tmp from ('path_to_file_i', 'path_to_file_f', 'path_to_file_s');

1. If run from a computer separate from the machine running the
monetdb server, would the 'path_to_file_i' be a pathname relative to
the monetdb farm root directory on the remote server machine?
2. This process should work for appending new data on to an existing
table, right?
3. Am I out of luck if my table has any of the verboten data types
mentioned above mixed in with supported binary bulkloadable types; aka
C-array style data types?
4. If not out of luck for a mixed type table, what is the procedure to
load the table?

In practice I imagine I will not need to fret too much about DATES
needing a COPY INTO .... USING DELIMITERS ascii streaming approach
because it is likely part of a dimension table with integer keys
matching integers on a fact table, which is far more likely to have
the supported types MonetDB Binary Bulk-loading likes.

Is it true that MonetDB does not handle joins well, but instead
prefers single, denormalized tables with many columns?
...
From what you know of the internals of MonetDb, is the effort to make
some kind of streaming binary bulk loader worth it in terms of
execution time on the database converting to native binary types?
The documentation used the words, 'slightly faster', but I do not have
a feel for when 'slightly' makes a big difference.
Thanks again for considering these questions.

Brandon

Brandon Jackson

Sjoerd Mullender

tags

participants (2)