[MonetDB-users] copy from stdin oddness
Hi, This is on CentOS-5/x86_64 running : [kbsingh@koala data]$ rpm -qa | grep -i Monet MonetDB-1.22.0-1.el5.kb.oid64 MonetDB5-server-5.4.0-1.el5.kb.oid64 MonetDB-SQL-2.22.0-1.el5.kb.oid64 MonetDB-client-1.22.0-1.el5.kb.oid64 MonetDB-SQL-server5-2.22.0-1.el5.kb.oid64 server started with : [root@koala ~]# mserver5 --dbinit="include sql;" on the client : [kbsingh@koala data]$ echo "\d sample" | mclient -lsql -ukb -Pkb CREATE TABLE "kb"."sample" ( "id" char(33), "sdate" char(20), "key" varchar(255), "country" char(2), "region" char(2), "city" varchar(50), "pid" int ); [kbsingh@koala data]$ echo "select 'hello';" | mclient -lsql -ukb -Pkb % . # table_name % single_value # name % char # type % 5 # length [ "hello" ] [kbsingh@koala data]$ echo "copy 1000 records into sample from '/home/kbsingh/data/data/1000.utf8' using delimiters '\t','\n','';" | mclient -lsql -ukb -Pkb [ 1000 ] also: [kbsingh@koala data]$ mclient -lsql -ukb -Pkb -s "copy 1000 records into sample from '/home/kbsingh/data/data/1000.utf8' using delimiters '\t','\n','';" [ 1000 ] so, works so far. However: [kbsingh@koala data]$ cat 1000.utf8 | mclient -lsql -ukb -Pkb -s "copy 1000 records into sample from STDIN using delimiters '\t','\n','';" MAPI = kb@localhost:50000 QUERY = copy 1000 records into sample from STDIN using delimiters '\t','\n',''; ERROR = !SQLException:sql:missing sep line 0 field 0 !SQLException:importTable:failed to import table given that mclient is on the right side of the pipe, the data is surely being made available on the stdin, but why does mclient fail like this ? also, adding an OFFSET makes mclient just die quietly like this: [kbsingh@koala data]$ cat 1000.utf8 | mclient -lsql -ukb -Pkb -s "copy 5 offset 1000 records into sample from STDIN using delimiters '\t','\n','';" [kbsingh@koala data]$ so, what am I getting so wrong here ? the load chain I need to setup will look like this : zcat andy1.gz | iconv -f latin1 -t utf8 | mclient ...... that compressed file is 62 GB, the changes of me being able to expand that run the load from a specific filename is zero. -- Karanbir Singh : http://www.karan.org/ : 2522219@icq
Hello Karanbir,
This sounds like a BOM (Byte Order Mark,
http://unicode.org/faq/utf_bom.html#BOM) is not dealt with correctly.
If you try:
xxd /home/kbsingh/data/data/1000.utf8 | head
does it start with 'EF BB BF'?
A little experiment (on the head) reveals a bug in mclient (it does
not handle correctly the optional BOM at the beginning of the input):
$ cat selectWithBOM.py
print "\xEF\xBB\xBFSELECT 1;"
$ python selectWithBOM.py > queryWithBOM.sql
$ xxd queryWithBOM.sql
0000000: efbb bf53 454c 4543 5420 313b 0a ...SELECT 1;.
$ cat queryWithBOM.sql
SELECT 1;
$ echo "SELECT 1;" | mclient -lsql
% . # table_name
% single_value # name
% tinyint # type
% 1 # length
[ 1 ]
$ cat queryWithBOM.sql | mclient -lsql
(Hangs)
I guess a bug should be filed.
If your data starts with the BOM, a workaround would be to strip the
first three bytes of your data (as the BOM is not very meaningful when
using UTF-8).
Greetings,
Wouter
2009/4/30 Karanbir Singh
Hi,
This is on CentOS-5/x86_64 running :
[kbsingh@koala data]$ rpm -qa | grep -i Monet MonetDB-1.22.0-1.el5.kb.oid64 MonetDB5-server-5.4.0-1.el5.kb.oid64 MonetDB-SQL-2.22.0-1.el5.kb.oid64 MonetDB-client-1.22.0-1.el5.kb.oid64 MonetDB-SQL-server5-2.22.0-1.el5.kb.oid64
server started with : [root@koala ~]# mserver5 --dbinit="include sql;"
on the client :
[kbsingh@koala data]$ echo "\d sample" | mclient -lsql -ukb -Pkb CREATE TABLE "kb"."sample" ( "id" char(33), "sdate" char(20), "key" varchar(255), "country" char(2), "region" char(2), "city" varchar(50), "pid" int );
[kbsingh@koala data]$ echo "select 'hello';" | mclient -lsql -ukb -Pkb % . # table_name % single_value # name % char # type % 5 # length [ "hello" ]
[kbsingh@koala data]$ echo "copy 1000 records into sample from '/home/kbsingh/data/data/1000.utf8' using delimiters '\t','\n','';" | mclient -lsql -ukb -Pkb [ 1000 ]
also:
[kbsingh@koala data]$ mclient -lsql -ukb -Pkb -s "copy 1000 records into sample from '/home/kbsingh/data/data/1000.utf8' using delimiters '\t','\n','';" [ 1000 ]
so, works so far. However:
[kbsingh@koala data]$ cat 1000.utf8 | mclient -lsql -ukb -Pkb -s "copy 1000 records into sample from STDIN using delimiters '\t','\n','';" MAPI = kb@localhost:50000 QUERY = copy 1000 records into sample from STDIN using delimiters '\t','\n',''; ERROR = !SQLException:sql:missing sep line 0 field 0 !SQLException:importTable:failed to import table
given that mclient is on the right side of the pipe, the data is surely being made available on the stdin, but why does mclient fail like this ?
also, adding an OFFSET makes mclient just die quietly like this:
[kbsingh@koala data]$ cat 1000.utf8 | mclient -lsql -ukb -Pkb -s "copy 5 offset 1000 records into sample from STDIN using delimiters '\t','\n','';" [kbsingh@koala data]$
so, what am I getting so wrong here ? the load chain I need to setup will look like this : zcat andy1.gz | iconv -f latin1 -t utf8 | mclient ......
that compressed file is 62 GB, the changes of me being able to expand that run the load from a specific filename is zero.
-- Karanbir Singh : http://www.karan.org/ : 2522219@icq
------------------------------------------------------------------------------ Register Now & Save for Velocity, the Web Performance & Operations Conference from O'Reilly Media. Velocity features a full day of expert-led, hands-on workshops and two days of sessions from industry leaders in dedicated Performance & Operations tracks. Use code vel09scf and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
Hi Wouter, Wouter Alink wrote:
Hello Karanbir,
This sounds like a BOM (Byte Order Mark, http://unicode.org/faq/utf_bom.html#BOM) is not dealt with correctly.
Thats interesting, and not something I'd considered at all. However :
If you try:
xxd /home/kbsingh/data/data/1000.utf8 | head
does it start with 'EF BB BF'?
[kbsingh@koala ~]$ xxd /home/kbsingh/data/data/1000.utf8 | head 0000000: 3664 6266 6339 6431 6635 3464 3137 3366 6dbfc9d1f54d173f 0000010: 6130 3962 6664 6131 3965 3566 6335 3062 a09bfda19e5fc50b So that does not seem to be the issue in this case.
A little experiment (on the head) reveals a bug in mclient (it does not handle correctly the optional BOM at the beginning of the input):
$ cat selectWithBOM.py print "\xEF\xBB\xBFSELECT 1;" $ python selectWithBOM.py > queryWithBOM.sql $ xxd queryWithBOM.sql 0000000: efbb bf53 454c 4543 5420 313b 0a ...SELECT 1;. $ cat queryWithBOM.sql SELECT 1; $ echo "SELECT 1;" | mclient -lsql % . # table_name % single_value # name % tinyint # type % 1 # length [ 1 ] $ cat queryWithBOM.sql | mclient -lsql (Hangs)
I guess a bug should be filed.
Good call, should I go ahead and do that using your test case here ? or would you like to file the bugreport yourself ? The only reason I am hesitant to do this is that while there seems to be this issue, its not an issue that my data suffers from here.
If your data starts with the BOM, a workaround would be to strip the first three bytes of your data (as the BOM is not very meaningful when using UTF-8).
I dont think that its the case here, so what are the workaround options available ? Essentially : I need to load about 600 to 700 G worth of data thats going to be delivered to me in a .gz file, expanding that to raw text is not something I'd like to consider unless thats was the _only_ way to get data loaded here. -- Karanbir Singh : http://www.karan.org/ : 2522219@icq
Hello Karanbir,
Well, that can't be the cause then. :) I will file the bug-report
regarding the BOM.
When I try to write the command exactly as you did (with both '-s' and
a pipe, i get)
$ echo 'CREATE TABLE aap (a int);' | mclient -lsql -dtest
$ cat data.dat
1
2
3
4
5
$ N=4; head -n $N data.dat | mclient -lsql -dtest -s "copy $N records
into aap from STDIN;"
MAPI = monetdb@localhost:50000
QUERY = copy 4 records into aap from STDIN;
ERROR = !SQLException:sql:value ';' while parsing ';' from line 0
field 0 not inserted, expecting type int
!SQLException:importTable:failed to import table
It seems that mclient is confused about what to read first (statement
or stdin) and perhaps it is a bug? I think an mclient guru might be
able to answer this?
I also tried (and this seems to work fine):
$ (N=4; echo "copy $N records into aap from STDIN;"; head -n $N
data.dat) | mclient -lsql -dtest
[ 4 ]
$ mclient -lsql -dtest -s "select * from aap;"
% sys.aap # table_name
% a # name
% int # type
% 1 # length
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
So this could be another workaround? Have you tried this already?
Wouter
p.s. Otherwise, I guess another workaround would be to create a
(temporary) pipe on your filesystem (but i'm not sure whether that
works):
$ mkfifo /tmp/workaroundpipe
$ (cat mydata > /tmp/workaroundpipe) &
$ mclient -lsql -s "copy x records into x from '/tmp/workaroundpipe';"
2009/5/5 Karanbir Singh
Hi Wouter,
Wouter Alink wrote:
Hello Karanbir,
This sounds like a BOM (Byte Order Mark, http://unicode.org/faq/utf_bom.html#BOM) is not dealt with correctly.
Thats interesting, and not something I'd considered at all. However :
If you try:
xxd /home/kbsingh/data/data/1000.utf8 | head
does it start with 'EF BB BF'?
[kbsingh@koala ~]$ xxd /home/kbsingh/data/data/1000.utf8 | head 0000000: 3664 6266 6339 6431 6635 3464 3137 3366 6dbfc9d1f54d173f 0000010: 6130 3962 6664 6131 3965 3566 6335 3062 a09bfda19e5fc50b
So that does not seem to be the issue in this case.
A little experiment (on the head) reveals a bug in mclient (it does not handle correctly the optional BOM at the beginning of the input):
$ cat selectWithBOM.py print "\xEF\xBB\xBFSELECT 1;" $ python selectWithBOM.py > queryWithBOM.sql $ xxd queryWithBOM.sql 0000000: efbb bf53 454c 4543 5420 313b 0a ...SELECT 1;. $ cat queryWithBOM.sql SELECT 1; $ echo "SELECT 1;" | mclient -lsql % . # table_name % single_value # name % tinyint # type % 1 # length [ 1 ] $ cat queryWithBOM.sql | mclient -lsql (Hangs)
I guess a bug should be filed.
Good call, should I go ahead and do that using your test case here ? or would you like to file the bugreport yourself ? The only reason I am hesitant to do this is that while there seems to be this issue, its not an issue that my data suffers from here.
If your data starts with the BOM, a workaround would be to strip the first three bytes of your data (as the BOM is not very meaningful when using UTF-8).
I dont think that its the case here, so what are the workaround options available ? Essentially : I need to load about 600 to 700 G worth of data thats going to be delivered to me in a .gz file, expanding that to raw text is not something I'd like to consider unless thats was the _only_ way to get data loaded here.
-- Karanbir Singh : http://www.karan.org/ : 2522219@icq
------------------------------------------------------------------------------ The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
Hi Wouter, Wouter Alink wrote:
Well, that can't be the cause then. :) I will file the bug-report regarding the BOM.
ta!
It seems that mclient is confused about what to read first (statement or stdin) and perhaps it is a bug? I think an mclient guru might be able to answer this?
I've now moved to the latest release set, Feb-2009, and still have the same issue in either pipe or stdin from the cli when used in combination with -s ; so its safe to assume the 'issue' persists.
$ (N=4; echo "copy $N records into aap from STDIN;"; head -n $N data.dat) | mclient -lsql -dtest
Various other options seem to work if the copy command goes in with the data via STDIN. Just to be as close to your case as possible : $ time ( echo "copy 632074442 records into sample from STDIN using delimiters '\t','\n','';" ; zcat web_tracker20090420.txt.gz ) | recode iso8859-1..utf-8 | mclient -lsql -ukb -Pkb [ 632074442 ] real 689m22.520s user 165m59.511s sys 313m45.179s
So this could be another workaround? Have you tried this already?
have now :) well, I did manage to get the stream working with the sql going in along with the data :) Thanks for your help on this matter. - KB
participants (2)
-
Karanbir Singh
-
Wouter Alink