Re: [Monetdb-developers] Performance of COPY INTO: data source & constraint checking

24 Jul 2009

      I come back to this issue with some more interesting results.

I did mention in the past that 'COPY INTO ... FROM stdin ..' seems to be
a lot slower than 'COPY INTO ... FROM <filename> ...'

Actually, I have just found something that may shift the problem away
from the stdin itself.

Look at the following two methods of restoring a previously dumped
(rather small) database. 

RESTORE METHOD 1 (far too slow):

$ mclient < backup.sql
[ 62	]
[ 70199	]
[ 288	]
[ 70322	]
[ 69	]
[ 423	]
[ 280596	]
[ 4302	]
[ 49038	]
[ 1	]
[ 10	]
[ 6943	]
[ 6943	]
[ 3424	]
[ 3424	]
[ 3424	]
[ 1	]
[ 6943	]
[ 3424	]
TIME elapsed: 68 seconds

RESTORE METHOD 2:

$ echo "\< backup.sql" | mclient
[ 62	]
[ 70199	]
[ 288	]
[ 70322	]
[ 69	]
[ 423	]
[ 280596	]
[ 4302	]
[ 49038	]
[ 1	]
[ 10	]
[ 6943	]
[ 6943	]
[ 3424	]
[ 3424	]
[ 3424	]
[ 1	]
[ 6943	]
[ 3424	]
TIME elapsed: 1 seconds

Can anyone explain this difference?

One further consideration: Method 2 works fine, but cannot be used to
load gzipped dumps. Maybe this can be a feature request: '\<' should
accept gzipped input.

Roberto

On Mon, 2009-05-18 at 13:03 +0200, Stefan Manegold wrote:
...
On Mon, May 18, 2009 at 11:57:57AM +0200, Stefan de Konink wrote:
...
On Mon, 18 May 2009, Stefan Manegold wrote:
...
In case you came to your conclusion from experimental/emperical study, I
would be very interested to know about the experiemntes performed and the
results measured.
In that case, I would be verythankful if you could share the informative
details with us, i.e. detailed descriptions of
- experimental setup (HW, SW, datasets, workload/queries)
- measured results (what was measured how, and what are the results?)
- (your) discussion/analysis/interpretation of the results
I thought the MADAM meeting was for this.
Ok. fine.
... we should then revive the ideas of keeping note of MADAMs (e.g., on a
Wiki or alike) as reference for those who cannot be present ...
...
...
Could you please elaborate on whether you consider you setup/schema
"representaive" and/or whether/to which extend you think that your results
for one specific case could be generalized to other cases?
Technically the only thing I have to prove to get my code reverted is that
there are cases where late contraints hurt performance.
IMHO, our task should be the following:
- analyze whether there is indeed a significat performance difference
  between early and late constraint checking;
  (5% do not justify any further time to be spent on this)
- and if so, analyze under which circumstances which altertative performs
  (significatly) better or worse than the other;
- and (only) then try to find out where the (significat) performance
  difference comes from
- and (only) then 
  + either try to eliminate the difference;
  + or 
    * make dump/restore use the most suitable alternative (as far as
      possible with reasonable effort);
    * and document our experiences to give adivce for users
Stefan
...
Stefan
-- 
| M.Sc. Roberto Cornacchia
| CWI (Centrum voor Wiskunde en Informatica)
| Science Park 123, 1098XG Amsterdam, The Netherlands
| tel: +31 20 592 4322 , http://www.cwi.nl/~roberto