Re: DEC2016-SP2 and BINARY bulk load

9 Mar 2017

      Hi Lynn,

to understand your problem in more detail,
could you possibly share the schema of your table
(i.e., the types of all columns),
as well as the output of the following two commands on your machine:

df -h
ps auxwwwf | grep mserver5

These would help me/us to understand where your data and dbfarm are located.

Feel free to send these info by private to me in case you think this list
is "too public" --- I'd appreciate, though, if you'd to keep the general discussion
on this list for others to jump in and/or benefit.

Best,
Stefan

----- On Mar 9, 2017, at 6:15 PM, Lynn Carol Johnson lcj34@cornell.edu wrote:
...
Regarding the “copy” suggesiont.  I have a folder
/workdir/lcj34/monetdb/savedBinaries_fullGenome and a folder
/workdir/lcj34/monetdb/fullGenome_binariesToLoad.
There are 2 folders because monetdb used to delete (steal) the binaries when it
was done copying.  Because I always drop the table and reload it, the “saved”
dir has all the previously working binaries.  I add the new ones to it, copy
everything to the folder from which they will be loaded, and copy from there.
There are no differences in the files, and they are on the same file system,
same disk, etc.
I appreciate your help with this.
On 3/9/17, 12:00 PM, "users-list on behalf of Stefan Manegold"
 wrote:
Hi Lynn,
see my comments inlined below
----- On Mar 9, 2017, at 5:50 PM, Lynn Carol Johnson lcj34@cornell.edu wrote:
...
HI Stefan, Martin –
We are open to any suggestions.
Stephan, I understand your reasons for making the change from “steal” to “copy”.
Is it possible to have both options?
We might consider that ...
...
Before loading the new tables  I do a
“copy *.bin”  from the “saved” directory to the “loading” directory.  This
takes less than an hour.    Are there multiple steps involved in the monetdb
copy that results in it not finishing overnight?
are your "saved" and "loading" directories on different file systems (or even
   different I/O sub systems / disks/ raids?
   Could you try loading directly from the "Saved" rather than copying to the
   "loading"
   and loading from there?
   That might help (a lot?) in case "saved" is on another i/o subsystem than you
   dbfarm.
   while "loading" is on the same as your dbfarm ...
...
From the perspective of a user that deals with very large data, a big selling
point of monetdb is the ability to create data a column at a time and load
using the COPY BINARY command.  I have often bragged on monetdb’s ability to
reload the entire table of 2 billion rows, 100+ columns in 3-4 minutes.  I
would still brag if it took an hour.
Our data base contains a row for each of the 2 billion positions on the maize
genome.    Each row has a chromosome and position column as well as many other
traits scored for this position.  We expect to continue to add columns as more
data is curated.  The data is then used for genomic analysis.  While I’m not
presented with new data every week, I generally gather data once a month that
needs to be added.  Having the db down for several days to reload can be
problematic.
In fact, I recently made the case in the group that we need (more) efficient
   way(s)
   to extend a table by one (or more) column(s) an load extra data in that
   column(s)
   efficiently (i.e., without for loading into a new table and that copying into
   the
   extended table using a "alter table" with a join ...).
   Your senario confirms that we should increase the priority to pursue this
   direction;
   now only find the time / manpower to implement it ...
...
Let me know if there is other data I can collect to help analyze this problem.
I might / will, but need some more time (I'm "on the road" as we speak ...)
Best,
   Stefan
...
Thanks - Lynn
On 3/9/17, 7:49 AM, "users-list on behalf of Martin Kersten"
 wrote:
Hi Stefan, Lynn
One way to move forward is to move the old functionality into the DataVault.
   There a directory of MonetDB compliant binary data can potentially be
   handled without the overhead incurred by copying.
On 09/03/2017 13:41, Stefan Manegold wrote:
...
Hi Lynn,
I understood and understand that your machine exclusively used by MonetDB.
My question was more to investigate (using iostat & top) what CPU and/or IO
load MonetDB creates while loading the data, in other words trying to understand
whether the experienced performace degradiation can be realted to either
increased
I/O or CPU activity.
Afterall,  copying 726 GB does cost some time, expecially since this does not
fit into RAM (and thus into file system caches) ...
I also understood and understand that you're loading into empty tables.
The resaons for us to discontinue the "move feature" include that
(a) it only worked if both to be loaded data and dbfarm are on the same
filesystem,
(b) and it only worked when loading into empty tables.
While both seem yo (must) hold for you to get the fast performance, they do not
hold in many other cases. Moreover, people ded not like the fact the (some of)
the files they loaded from "were gone" after the load, even if the load did not
succeed ...
You could consider downgrading (well, most probabaly you'd then first have to
dump
your entire DB(s) and then re-load them after the downgrade),
but it's very unlikely, that we revive the move feature.
hence, you'd never be able to upgrade again.
I'd prefer that we could understand where the significant degrading comes from,
and whether it's something we could fix ...
Best,
Stefan
ps: you dbfarm is not necesarily where your monetdb installation (program) is,
    but rather where you find the merovingian.log
----- On Mar 9, 2017, at 1:26 PM, Lynn Carol Johnson lcj34@cornell.edu wrote:
...
HI Stephen  -
Copying the machine administrators on this email.
The machine on which monetdb lives has 256GB RAM, 24 fast cores, 16TB of RAID10
storage and 1TB of SSD scratch space.  This machine is used mostly for monetdb.
When I run “top” I see nothing little else running on it.  No other IO going
on.  The access to this machine is limited and I generally know who other than
myself is on it.
Before loading new columns I ALWAYs do a “drop table”.  Then I run “create table
…” with the new columns added to the list.  Follow this with a “COPY BINARY
INTO …”  .   So I monetdb is always loading into an empty table.
On disk, the binary files take up 726G of space.
Jarek – I load from a directory in /workdir/lcj34/monetdbFiles.  Monetdb lives
in /local/monetdb.  Are these the same file system?
Stephen:  I the last load command around 3:30 yesterday afternoon and it was
still not complete this morning. We’ve been very happy with monetdb, but this
is quite troubling.  Is it a problem to downgrade the monetdb load from
dec2016-SP2 back to jun2016-sp2?
Thanks - Lynn
On 3/9/17, 2:00 AM, "users-list on behalf of Stefan Manegold"
 wrote:
Hi Lynn Carol,
COPY BINARY INTO now indeed copies the data,
   and that can add some cost to it,
   in particular in case your data is large
   and your dbfarm is on the same I/O system,
   or even the same (single) disk.
To better understand your situation,
   could you possibly share how big your (binary) data
   is that you load (i.e., how much diskspace do the 135 files
   with 2 billions values each occupy),
   and whether you I/O system is a single hard disk, or a RAID system,
   or an SSD?
   Also, how much RAM does you machine have?
Given that you used to exploit the old COPY BINARY INTO's "move feature",
   I assume your data and your dbfarm are on the same filesystem.
(NB. in the old version, we could do the "move trick" only when bulk-loading
   into an empty table; when loading more data into a non-empty table, we also
   had to copy the data ...)
In case your machine has more than one filesystem, each on a different
   hard disk / SSD / RAID, you coudl try to have your to-be-loaded data on one
   and your dbfarm on the other, spreading the I/O load over both
   (one mostly reading, the other mostly writing).
You can also inspect you systems I/O activity during the load, e.g.,
   using iostat.
Best,
   Stefan
----- On Mar 8, 2017, at 10:12 PM, Lynn Carol Johnson lcj34@cornell.edu wrote:
...
BTW, my 135 binary files together are 726G. I note the dec2016 release says:
BATattach now copies the input file instead of "stealing" it.
Could this be why it’s gone from 3 minutes to over 3 hours to load this data? My
files and monetdb are on the same machine – no network access. And “top” shows
nothing of significance running on the machine except mserver5.
I loved the speed with which I could add new columns to my table by dropping it,
re-create the table, COPY BINARY INTO table. Hoping you have ideas to get this
back, or an idea on what could be wrong.
Thanks - Lynn
From: users-list  on behalf of
Lynn Carol Johnson 
Reply-To: Communication channel for MonetDB users 
Date: Wednesday, March 8, 2017 at 3:19 PM
To: Communication channel for MonetDB users 
Subject: DEC2016-SP2 and BINARY bulk load
Hi all –
I have always used the COPY BINARY INTO … commands to load my 2.0 Billion row
genetic data into a monetdb table. With 135 columns, it has been blindingly
fast.
Last week I moved from the June2016-SP2 release to dec2016-SP2. My binary loads
are taking WAY longer. I killed one after 3 hours (via “call sys.stop(pid)” so
it could clean up properly). I then started the load again, thinking perhaps
the problem was related to the new columns I was adding.
I have since dropped the table and remade it using the same data and scripts
that worked in just over 3 minutes in February on the jun2016-SP2 load. It is
really chugging along – I’m up to 30 minutes and counting. I don’t have access
to the sql log files, but the Merovingian.log shows nothing.
I do notice that previously the binary files, once loaded, were removed from the
loading directly. This does not happen now. Were these files previously “moved”
and now they are copied?
Has anyone see this performance issue with Dec2016-SP2 COPY BINARY INTO ….
Commands?
Thanks - Lynn
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
--
   | Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
   | www.CWI.nl/~manegold/  | Science Park 123 (L321) |
   | +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |
   _______________________________________________
   users-list mailing list
   users-list@monetdb.org
   https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
   users-list mailing list
   users-list@monetdb.org
   https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
--
   | Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
   | www.CWI.nl/~manegold/  | Science Park 123 (L321) |
   | +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |
   _______________________________________________
   users-list mailing list
   users-list@monetdb.org
   https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
-- 
| Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |