Bulk loading file with non-UTF8 characters
I need to bulk load some files with characters like ñ and é. I would do this using the LATIN1 encoding. I read that bulk loading from file assumes UTF8, but bulk loading from STDIN will use encoding of mclient. I would like to know: (a) How do I determine and change the client encoding? (b) What is an example for piping STDIN to mclient? Regarding the latter, I tried: cat /my/file.csv | mclient -d mydb -u monetdb -s "COPY INTO mytable FROM (STDIN) USING DELIMITERS ',', '\n' , '"' NULL AS '';" I wasn't sure how the embedded " would be handled, but the terminal basically became unresponsive until I ctl-C'ed out of it. Note the the file was similar in size to several others (without non-UTF8 characters) that loaded in seconds within mclient. Best, --Lee -- Lee Hachadoorian Assistant Professor in Geography, Dartmouth College http://geospatial.commons.gc.cuny.edu http://freecity.commons.gc.cuny.edu
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 2013-12-06 10:38, Lee Hachadoorian wrote:
I need to bulk load some files with characters like ñ and é. I would do this using the LATIN1 encoding. I read that bulk loading from file assumes UTF8, but bulk loading from STDIN will use encoding of mclient. I would like to know:
(a) How do I determine and change the client encoding? (b) What is an example for piping STDIN to mclient?
Regarding the latter, I tried:
cat /my/file.csv | mclient -d mydb -u monetdb -s "COPY INTO mytable FROM (STDIN) USING DELIMITERS ',', '\n' , '"' NULL AS '';"
I wasn't sure how the embedded " would be handled, but the terminal basically became unresponsive until I ctl-C'ed out of it. Note the the file was similar in size to several others (without non-UTF8 characters) that loaded in seconds within mclient.
Best, --Lee
See [1] for recipes for bulk loading, including the relevant invocation of mclient. The default character set that mclient uses depends on the environment. You can specify a specific character set using the -E option. See the manual at [2]. Quoting commands is an art. Look at the command in [1] how you can do it correctly. Hint: when quoting using double quotes ("), you need to use a backslash (\) to quote embedded backslashes and double quotes. [1] http://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/LoadingBulkData [2] http://www.monetdb.org/Documentation/mclient-man-page - -- Sjoerd Mullender -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.15 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUqGr8D7g04AjvIQpAQJuvAP/V1QHpkmSCw7LvU8/yny5pbPXlrbX4/08 kj/r0d/uy6GdpBj2NQSECmBtleBntDzU025fY57nqge6QnwjC97vFdQuunjSbC7s wTzA5CCICTh9zUP0vM87K0dN/GpNVbcCQBVG0haf492eF6Tn7Xh41g4I6+N6o8+D vAXpsveTSLc= =WXvD -----END PGP SIGNATURE-----
On Fri, Dec 6, 2013 at 5:50 AM, Sjoerd Mullender
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 2013-12-06 10:38, Lee Hachadoorian wrote:
I need to bulk load some files with characters like ñ and é. I would do this using the LATIN1 encoding. I read that bulk loading from file assumes UTF8, but bulk loading from STDIN will use encoding of mclient. I would like to know:
(a) How do I determine and change the client encoding? (b) What is an example for piping STDIN to mclient?
Regarding the latter, I tried:
cat /my/file.csv | mclient -d mydb -u monetdb -s "COPY INTO mytable FROM (STDIN) USING DELIMITERS ',', '\n' , '"' NULL AS '';"
I wasn't sure how the embedded " would be handled, but the terminal basically became unresponsive until I ctl-C'ed out of it. Note the the file was similar in size to several others (without non-UTF8 characters) that loaded in seconds within mclient.
Best, --Lee
See [1] for recipes for bulk loading, including the relevant invocation of mclient. The default character set that mclient uses depends on the environment. You can specify a specific character set using the -E option. See the manual at [2].
Quoting commands is an art. Look at the command in [1] how you can do it correctly. Hint: when quoting using double quotes ("), you need to use a backslash (\) to quote embedded backslashes and double quotes.
[1] http://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/LoadingBulkData [2] http://www.monetdb.org/Documentation/mclient-man-page
- -- Sjoerd Mullender
Thanks Sjoerd. OK, now the file is getting to mclient, and I am at least at the point where I am getting the same error message as when I COPY ... FROM 'file' within mclient. $ mclient -E LATIN9 -d universe -u monetdb -s "COPY INTO acs2010_5yr.geoheader FROM STDIN USING DELIMITERS ',', '\\n', '\"'" - < /data/lee/postgres_upload/acs2010_5yr/All_Geographies_Not_Tracts_Block_Groups/g20105az.csv password: value from line 10560 field 50 not inserted, expecting type varchar(200) failed to import table current transaction is aborted (please ROLLBACK) syntax error, unexpected sqlINT in: "04" I tried it both with the terminal's character encoding set to UTF8 and set to LATIN9. Does that make a difference? When I sed the problematic line with the terminal in LATIN9, it is human-readable, while in UTF8 it is not. At this point I could just use iconv on the file, but I'm would like to know how to get this working. Best, --Lee
On 2013-12-06 20:01, Lee Hachadoorian wrote:
On Fri, Dec 6, 2013 at 5:50 AM, Sjoerd Mullender
mailto:sjoerd@acm.org> wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 2013-12-06 10:38, Lee Hachadoorian wrote: > I need to bulk load some files with characters like ñ and é. I > would do this using the LATIN1 encoding. I read that bulk loading > from file assumes UTF8, but bulk loading from STDIN will use > encoding of mclient. I would like to know: > > (a) How do I determine and change the client encoding? (b) What is > an example for piping STDIN to mclient? > > Regarding the latter, I tried: > > cat /my/file.csv | mclient -d mydb -u monetdb -s "COPY INTO mytable > FROM (STDIN) USING DELIMITERS ',', '\n' , '"' NULL AS '';" > > I wasn't sure how the embedded " would be handled, but the > terminal basically became unresponsive until I ctl-C'ed out of it. > Note the the file was similar in size to several others (without > non-UTF8 characters) that loaded in seconds within mclient. > > Best, --Lee >
See [1] for recipes for bulk loading, including the relevant invocation of mclient. The default character set that mclient uses depends on the environment. You can specify a specific character set using the -E option. See the manual at [2].
Quoting commands is an art. Look at the command in [1] how you can do it correctly. Hint: when quoting using double quotes ("), you need to use a backslash (\) to quote embedded backslashes and double quotes.
[1] http://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/LoadingBulkData [2] http://www.monetdb.org/Documentation/mclient-man-page
- -- Sjoerd Mullender
Thanks Sjoerd. OK, now the file is getting to mclient, and I am at least at the point where I am getting the same error message as when I COPY ... FROM 'file' within mclient.
$ mclient -E LATIN9 -d universe -u monetdb -s "COPY INTO acs2010_5yr.geoheader FROM STDIN USING DELIMITERS ',', '\\n', '\"'" - < /data/lee/postgres_upload/acs2010_5yr/All_Geographies_Not_Tracts_Block_Groups/g20105az.csv password: value from line 10560 field 50 not inserted, expecting type varchar(200) failed to import table current transaction is aborted (please ROLLBACK) syntax error, unexpected sqlINT in: "04"
I tried it both with the terminal's character encoding set to UTF8 and set to LATIN9. Does that make a difference? When I sed the problematic line with the terminal in LATIN9, it is human-readable, while in UTF8 it is not.
The -E option overrides the setting mclient takes from the environment (i.e. the terminal setting). The first message is important (value from line 10560 field 50 not inserted, expecting type varchar(200)). The second message about the int "04" is just because the server reads the next line that happens to start with 04 after it encountered the error and decides that that is not valid SQL. I turns out you also need to specify the -i option to mclient for it to translate the input using the encoding. Without -i, mclient takes a more efficient path, but that path doesn't involve translating the character encoding. It can be argued that's a bug.
At this point I could just use iconv on the file, but I'm would like to know how to get this working.
Best, --Lee
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
On Fri, Dec 6, 2013 at 3:11 PM, Sjoerd Mullender
On 2013-12-06 20:01, Lee Hachadoorian wrote:
On Fri, Dec 6, 2013 at 5:50 AM, Sjoerd Mullender
mailto:sjoerd@acm.org> wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 2013-12-06 10:38, Lee Hachadoorian wrote: > I need to bulk load some files with characters like ñ and é. I > would do this using the LATIN1 encoding. I read that bulk loading > from file assumes UTF8, but bulk loading from STDIN will use > encoding of mclient. I would like to know: > > (a) How do I determine and change the client encoding? (b) What is > an example for piping STDIN to mclient? > > Regarding the latter, I tried: > > cat /my/file.csv | mclient -d mydb -u monetdb -s "COPY INTO mytable > FROM (STDIN) USING DELIMITERS ',', '\n' , '"' NULL AS '';" > > I wasn't sure how the embedded " would be handled, but the > terminal basically became unresponsive until I ctl-C'ed out of it. > Note the the file was similar in size to several others (without > non-UTF8 characters) that loaded in seconds within mclient. > > Best, --Lee >
See [1] for recipes for bulk loading, including the relevant invocation of mclient. The default character set that mclient uses depends on the environment. You can specify a specific character set using the -E option. See the manual at [2].
Quoting commands is an art. Look at the command in [1] how you can do it correctly. Hint: when quoting using double quotes ("), you need to use a backslash (\) to quote embedded backslashes and double quotes.
[1]
http://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/LoadingBulkData
[2] http://www.monetdb.org/Documentation/mclient-man-page
- -- Sjoerd Mullender
Thanks Sjoerd. OK, now the file is getting to mclient, and I am at least at the point where I am getting the same error message as when I COPY ... FROM 'file' within mclient.
$ mclient -E LATIN9 -d universe -u monetdb -s "COPY INTO acs2010_5yr.geoheader FROM STDIN USING DELIMITERS ',', '\\n', '\"'" - <
/data/lee/postgres_upload/acs2010_5yr/All_Geographies_Not_Tracts_Block_Groups/g20105az.csv
password: value from line 10560 field 50 not inserted, expecting type varchar(200) failed to import table current transaction is aborted (please ROLLBACK) syntax error, unexpected sqlINT in: "04"
I tried it both with the terminal's character encoding set to UTF8 and set to LATIN9. Does that make a difference? When I sed the problematic line with the terminal in LATIN9, it is human-readable, while in UTF8 it is not.
The -E option overrides the setting mclient takes from the environment (i.e. the terminal setting). The first message is important (value from line 10560 field 50 not inserted, expecting type varchar(200)). The second message about the int "04" is just because the server reads the next line that happens to start with 04 after it encountered the error and decides that that is not valid SQL.
I turns out you also need to specify the -i option to mclient for it to translate the input using the encoding. Without -i, mclient takes a more efficient path, but that path doesn't involve translating the character encoding. It can be argued that's a bug.
That worked. Thanks Sjoerd. --Lee
participants (3)
-
Lee Hachadoorian
-
Sjoerd Mullender
-
Sjoerd Mullender