Hi Wouter, Wouter Alink wrote:
Hello Karanbir,
This sounds like a BOM (Byte Order Mark, http://unicode.org/faq/utf_bom.html#BOM) is not dealt with correctly.
Thats interesting, and not something I'd considered at all. However :
If you try:
xxd /home/kbsingh/data/data/1000.utf8 | head
does it start with 'EF BB BF'?
[kbsingh@koala ~]$ xxd /home/kbsingh/data/data/1000.utf8 | head 0000000: 3664 6266 6339 6431 6635 3464 3137 3366 6dbfc9d1f54d173f 0000010: 6130 3962 6664 6131 3965 3566 6335 3062 a09bfda19e5fc50b So that does not seem to be the issue in this case.
A little experiment (on the head) reveals a bug in mclient (it does not handle correctly the optional BOM at the beginning of the input):
$ cat selectWithBOM.py print "\xEF\xBB\xBFSELECT 1;" $ python selectWithBOM.py > queryWithBOM.sql $ xxd queryWithBOM.sql 0000000: efbb bf53 454c 4543 5420 313b 0a ...SELECT 1;. $ cat queryWithBOM.sql SELECT 1; $ echo "SELECT 1;" | mclient -lsql % . # table_name % single_value # name % tinyint # type % 1 # length [ 1 ] $ cat queryWithBOM.sql | mclient -lsql (Hangs)
I guess a bug should be filed.
Good call, should I go ahead and do that using your test case here ? or would you like to file the bugreport yourself ? The only reason I am hesitant to do this is that while there seems to be this issue, its not an issue that my data suffers from here.
If your data starts with the BOM, a workaround would be to strip the first three bytes of your data (as the BOM is not very meaningful when using UTF-8).
I dont think that its the case here, so what are the workaround options available ? Essentially : I need to load about 600 to 700 G worth of data thats going to be delivered to me in a .gz file, expanding that to raw text is not something I'd like to consider unless thats was the _only_ way to get data loaded here. -- Karanbir Singh : http://www.karan.org/ : 2522219@icq