[MonetDB-users] Copy into issues with mixed ascii / hex data
I have a tab-delimited data file that is mostly ascii, but has a few hex characters in it. For example: ... 3263640152 ABC<99> ... where <99> is a hex value for (tm) - trademark. When I try and load this data file into monetdb, I get the following error: [root@bidbtest ~]# /usr/local/MonetDB/bin/mclient -lsql --database=myschema -t -s "COPY 20603 records into test_table from '/tmp/testdata.txt' using delimiters '\t','\n','\"' null as '';" user(root):monetdb password: MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket QUERY = COPY 20603 records into test_table from '/tmp/testdata.txt' using delimiters '\t','\n','"' null as ''; ERROR = !SQLException:importTable:value 'failed to import table Timer 790.415 msec The table def is: CREATE TABLE creative_dim ( id bigint NOT NULL, headline varchar(100) DEFAULT NULL ) How can I go about loading this file? Ideally, I would like to get these characters loaded as is. I saw that mclient has a -E flag that accepts an encoding type. However, I'm not sure what to set it to for a file with ascii / hex data. A complicating factor is that I don't know all the hex characters that are in my file. I luckily spotted the <99>/tm one above as it was one of the first lines in the file. But after replacing all instances of that character with the empty string, I still get a similar error to what I pasted above. I manually paged through the file and found different hex characters. It isn't practical for me to page through an entire file and monetdb doesn't tell me which line number the problem is on. By the way, I am using MonetDB v5.20.5. Thanks for the help! -- View this message in context: http://old.nabble.com/Copy-into-issues-with-mixed-ascii---hex-data-tp3023894... Sent from the monetdb-users mailing list archive at Nabble.com.
On 2010-11-17 20:20, paulr135 wrote:
I have a tab-delimited data file that is mostly ascii, but has a few hex characters in it. For example: ... 3263640152 ABC<99> ...
where <99> is a hex value for (tm) - trademark.
When I try and load this data file into monetdb, I get the following error: [root@bidbtest ~]# /usr/local/MonetDB/bin/mclient -lsql --database=myschema -t -s "COPY 20603 records into test_table from '/tmp/testdata.txt' using delimiters '\t','\n','\"' null as '';" user(root):monetdb password: MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket QUERY = COPY 20603 records into test_table from '/tmp/testdata.txt' using delimiters '\t','\n','"' null as ''; ERROR = !SQLException:importTable:value 'failed to import table Timer 790.415 msec
The table def is: CREATE TABLE creative_dim ( id bigint NOT NULL, headline varchar(100) DEFAULT NULL )
How can I go about loading this file? Ideally, I would like to get these characters loaded as is.
I saw that mclient has a -E flag that accepts an encoding type. However, I'm not sure what to set it to for a file with ascii / hex data.
A complicating factor is that I don't know all the hex characters that are in my file. I luckily spotted the <99>/tm one above as it was one of the first lines in the file. But after replacing all instances of that character with the empty string, I still get a similar error to what I pasted above. I manually paged through the file and found different hex characters. It isn't practical for me to page through an entire file and monetdb doesn't tell me which line number the problem is on.
By the way, I am using MonetDB v5.20.5.
Thanks for the help!
The MonetDB server only accepts text in the UTF-8 encoding. Mclient can translate from encodings to UTF-8 using the -E option. However, when you use the COPY INTO command with an actual file name, the file is read directly by the server and not by mclient. This means that such a file must be encoded using UTF-8. Your file seems to be encoded in one of the Windows formats such as cp1252. If you want to load this file using a COPY INTO command, you will need to convert the file to UTF-8 using e.g. the iconv program: iconv -f cp1252 -t utf-8 < /tmp/testdata.txt > /tmp/testdata-utf8.txt and then use that instead. Another thing you can try is to do this (all on one line): mclient -lsql --database=myschema -t -i -s "COPY 20603 records into test_table from stdin using delimiters '\t','\n','\"' null as ''" < /tmp/testdata.txt I added a -i option (that's important) to your command, and I removed the semicolon (I'm not sure that's needed, but it certainly won't harm--mclient adds one) and then mclient can read the file and convert it for you. I hope this helps. -- Sjoerd Mullender
Sjoerd Mullender-2 wrote:
mclient -lsql --database=myschema -t -i -s "COPY 20603 records into test_table from stdin using delimiters '\t','\n','\"' null as ''" < /tmp/testdata.txt
Did you mean to add the -E flag to the command above? I get errors if I don't put it in (mclient spits out a few rows of data and tells me it "failed to import table". I tried it with that flag and it partially worked: [root@bidbtest ~]# mclient -E cp1252 -lsql --database=myschema -t -i -s "COPY 20603 records into test_table from stdin using delimiters '\t','\n','\"' null as ''" < /tmp/testdata.txt user:monetdb password: [ 20547 ] Timer 5824.428 msec The obvious issue is that I am missing 56 records from my file - each of those records I'm missing have hex data in them. This didn't work for all of my files, however. For some larger files, I get [root@bidbtest import]# mclient -E cp1252 -lsql --database=myschema -t -i -s "COPY 2642786 records into a_click from stdin using delimiters '\t','\n','\"' null as '\N'" < /tmp/a_click.tbl user:monetdb password: Timer 7.856 msec MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket ...(some data rows are spit out)... ERROR = !SQLException:importTable:value 'http://neuilly\\frintranetguest:inxo-03del@frintranet.fr.deloitte.co...' from line 13494 field 6 not inserted, expecting type str !failed to import table !current transaction is aborted (please ROLLBACK) Timer 4636.574 msec MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket ...(hundreds of data rows are spit out)... I checked the file on line 13494 and field 6 is of type string (3rd line, below). ... 2 127228269 0D4B69FA512A11DFAE5196C7A77DC2E3 0DC2EC5AA6CB 2062958 http://www.islamweb.net/services/currency/default.htm \N 2 127228667 2276781A473E11DF9CCA79E3CD83EF10 0DBF68E35593 2062957 http://www.sayaraat.com/change.htm \N 2 127228546 C4B18426512F11DF85C472B3CD83EF10 0DC2F5E9CBB3 2062956 http://neuilly\\frintranetguest:inxo-03del@frintranet.fr.deloitte.co... \N ... I originally thought that there was a problem with the "\\" in the file, so I used sed to delete all lines with "\\". However, I get a similar error message as above. Sjoerd Mullender-2 wrote:
Your file seems to be encoded in one of the Windows formats such as cp1252. If you want to load this file using a COPY INTO command, you will need to convert the file to UTF-8 using e.g. the iconv program: iconv -f cp1252 -t utf-8 < /tmp/testdata.txt > /tmp/testdata-utf8.txt and then use that instead.
You are correct - I dug around and found out that the source for this data was a mysql table which has its charset set to latin1 - which is the same as cp1252 (http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html) I also tried the iconv method with little success. [root@bidbtest import]# iconv -f cp1252 -t utf-8//IGNORE -c < a_click.tbl > a_click.tbl_cp1252 [root@bidbtest import]# /usr/local/MonetDB/bin/mclient -lsql --database=myschema -t -s "COPY 2642786 records into a_click from '/tmp/a_click.tbl_cp1252' using delimiters '\t','\n','\"' null as '\N';" user(root):monetdb password: MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket QUERY = COPY 2642786 records into a_click from '/tmp/a_click.tbl_cp1252' using delimiters '\t','\n','"' null as '\N'; ERROR = !SQLException:importTable:missing separator ' ' line 159725 field 5 !failed to import table Timer 368.344 msec Again, line 159725 didn't seem like it had any mistakes in the data (3rd line, below) ... 2 127150530 EC33DA849AB02E4F4F7D5A72EAA680BD 0DB70FA4FF8A 1916744 http://translate.googleusercontent.com/translate_c?hl=pt-BR&sl=en&u... \N 2 127148183 7A4C09B627E021CCCFA0ADD15E7FAA43 0DB6B3FF70C3 1916743 http://www.myfootprint.org/en/visitor_information/ \N 2 127150355 C596B7337BD4A8BF1690269D1584CA87 0DB708CF4917 1916742 http://www.portugal-linha.pt/cambios/ \N ... -- View this message in context: http://old.nabble.com/Copy-into-issues-with-mixed-ascii---hex-data-tp3023894... Sent from the monetdb-users mailing list archive at Nabble.com.
On 2010-11-19 20:22, paulr135 wrote:
Sjoerd Mullender-2 wrote:
mclient -lsql --database=myschema -t -i -s "COPY 20603 records into test_table from stdin using delimiters '\t','\n','\"' null as ''" < /tmp/testdata.txt
Did you mean to add the -E flag to the command above? I get errors if I
Yes.
don't put it in (mclient spits out a few rows of data and tells me it "failed to import table". I tried it with that flag and it partially worked: [root@bidbtest ~]# mclient -E cp1252 -lsql --database=myschema -t -i -s "COPY 20603 records into test_table from stdin using delimiters '\t','\n','\"' null as ''" < /tmp/testdata.txt user:monetdb password: [ 20547 ] Timer 5824.428 msec
The obvious issue is that I am missing 56 records from my file - each of those records I'm missing have hex data in them.
This didn't work for all of my files, however. For some larger files, I get [root@bidbtest import]# mclient -E cp1252 -lsql --database=myschema -t -i -s "COPY 2642786 records into a_click from stdin using delimiters '\t','\n','\"' null as '\N'" < /tmp/a_click.tbl user:monetdb password: Timer 7.856 msec MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket ...(some data rows are spit out)... ERROR = !SQLException:importTable:value 'http://neuilly\\frintranetguest:inxo-03del@frintranet.fr.deloitte.co...' from line 13494 field 6 not inserted, expecting type str !failed to import table !current transaction is aborted (please ROLLBACK) Timer 4636.574 msec MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket ...(hundreds of data rows are spit out)...
I checked the file on line 13494 and field 6 is of type string (3rd line, below). ... 2 127228269 0D4B69FA512A11DFAE5196C7A77DC2E3 0DC2EC5AA6CB 2062958 http://www.islamweb.net/services/currency/default.htm \N 2 127228667 2276781A473E11DF9CCA79E3CD83EF10 0DBF68E35593 2062957 http://www.sayaraat.com/change.htm \N 2 127228546 C4B18426512F11DF85C472B3CD83EF10 0DC2F5E9CBB3 2062956 http://neuilly\\frintranetguest:inxo-03del@frintranet.fr.deloitte.co... \N ...
I originally thought that there was a problem with the "\\" in the file, so I used sed to delete all lines with "\\". However, I get a similar error message as above.
This error you would also get if the string is too long, i.e. longer than the number of characters that are allowed according to the database schema. I would agree if you said that the error message lacks in clarity.
Sjoerd Mullender-2 wrote:
Your file seems to be encoded in one of the Windows formats such as cp1252. If you want to load this file using a COPY INTO command, you will need to convert the file to UTF-8 using e.g. the iconv program: iconv -f cp1252 -t utf-8 < /tmp/testdata.txt > /tmp/testdata-utf8.txt and then use that instead.
You are correct - I dug around and found out that the source for this data was a mysql table which has its charset set to latin1 - which is the same as cp1252 (http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html)
I also tried the iconv method with little success. [root@bidbtest import]# iconv -f cp1252 -t utf-8//IGNORE -c < a_click.tbl > a_click.tbl_cp1252 [root@bidbtest import]# /usr/local/MonetDB/bin/mclient -lsql --database=myschema -t -s "COPY 2642786 records into a_click from '/tmp/a_click.tbl_cp1252' using delimiters '\t','\n','\"' null as '\N';" user(root):monetdb password: MAPI = (monetdb) /usr/local/MonetDB/var/MonetDB5/dbfarm/mapi_socket QUERY = COPY 2642786 records into a_click from '/tmp/a_click.tbl_cp1252' using delimiters '\t','\n','"' null as '\N'; ERROR = !SQLException:importTable:missing separator ' ' line 159725 field 5 !failed to import table Timer 368.344 msec
Again, line 159725 didn't seem like it had any mistakes in the data (3rd line, below) ... 2 127150530 EC33DA849AB02E4F4F7D5A72EAA680BD 0DB70FA4FF8A 1916744 http://translate.googleusercontent.com/translate_c?hl=pt-BR&sl=en&u... \N 2 127148183 7A4C09B627E021CCCFA0ADD15E7FAA43 0DB6B3FF70C3 1916743 http://www.myfootprint.org/en/visitor_information/ \N 2 127150355 C596B7337BD4A8BF1690269D1584CA87 0DB708CF4917 1916742 http://www.portugal-linha.pt/cambios/ \N ...
-- Sjoerd Mullender
participants (2)
-
paulr135
-
Sjoerd Mullender