Re: [MonetDB-users] copy from stdin oddness

5 May 2009

      Hi Wouter,

Wouter Alink wrote:
...
Hello Karanbir,
This sounds like a BOM (Byte Order Mark,
http://unicode.org/faq/utf_bom.html#BOM) is not dealt with correctly.
Thats interesting, and not something I'd considered at all. However :
...
If you try:
xxd /home/kbsingh/data/data/1000.utf8 | head
does it start with 'EF BB BF'?
[kbsingh@koala ~]$ xxd /home/kbsingh/data/data/1000.utf8 | head
0000000: 3664 6266 6339 6431 6635 3464 3137 3366  6dbfc9d1f54d173f
0000010: 6130 3962 6664 6131 3965 3566 6335 3062  a09bfda19e5fc50b

So that does not seem to be the issue in this case.
...
A little experiment (on the head) reveals a bug in mclient (it does
not handle correctly the optional BOM at the beginning of the input):
$ cat selectWithBOM.py
print "\xEF\xBB\xBFSELECT 1;"
$ python selectWithBOM.py > queryWithBOM.sql
$ xxd queryWithBOM.sql
0000000: efbb bf53 454c 4543 5420 313b 0a         ...SELECT 1;.
$ cat queryWithBOM.sql
SELECT 1;
$ echo "SELECT 1;" | mclient -lsql
% . # table_name
% single_value # name
% tinyint # type
% 1 # length
[ 1     ]
$ cat queryWithBOM.sql | mclient -lsql
(Hangs)
I guess a bug should be filed.
Good call, should I go ahead and do that using your test case here ? or 
would you like to file the bugreport yourself ? The only reason I am 
hesitant to do this is that while there seems to be this issue, its not 
an issue that my data suffers from here.
...
If your data starts with the BOM, a workaround would be to strip the
first three bytes of your data (as the BOM is not very meaningful when
using UTF-8).
I dont think that its the case here, so what are the workaround options 
available ? Essentially : I need to load about 600 to 700 G worth of 
data thats going to be delivered to me in a .gz file, expanding that to 
raw text is not something I'd like to consider unless thats was the 
_only_ way to get data loaded here.

-- 
Karanbir Singh : http://www.karan.org/  : 2522219@icq