On 2013-12-06 20:01, Lee Hachadoorian wrote:
> On Fri, Dec 6, 2013 at 5:50 AM, Sjoerd Mullender <sjoerd@acm.org
The -E option overrides the setting mclient takes from the environment> <mailto:sjoerd@acm.org>> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 2013-12-06 10:38, Lee Hachadoorian wrote:
> > I need to bulk load some files with characters like ñ and é. I
> > would do this using the LATIN1 encoding. I read that bulk loading
> > from file assumes UTF8, but bulk loading from STDIN will use
> > encoding of mclient. I would like to know:
> >
> > (a) How do I determine and change the client encoding? (b) What is
> > an example for piping STDIN to mclient?
> >
> > Regarding the latter, I tried:
> >
> > cat /my/file.csv | mclient -d mydb -u monetdb -s "COPY INTO mytable
> > FROM (STDIN) USING DELIMITERS ',', '\n' , '"' NULL AS '';"
> >
> > I wasn't sure how the embedded " would be handled, but the
> > terminal basically became unresponsive until I ctl-C'ed out of it.
> > Note the the file was similar in size to several others (without
> > non-UTF8 characters) that loaded in seconds within mclient.
> >
> > Best, --Lee
> >
>
> See [1] for recipes for bulk loading, including the relevant
> invocation of mclient.
> The default character set that mclient uses depends on the
> environment. You can specify a specific character set using the -E
> option. See the manual at [2].
>
> Quoting commands is an art. Look at the command in [1] how you can do
> it correctly. Hint: when quoting using double quotes ("), you need to
> use a backslash (\) to quote embedded backslashes and double quotes.
>
> [1]
> http://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/LoadingBulkData
> [2] http://www.monetdb.org/Documentation/mclient-man-page
>
> - --
> Sjoerd Mullender
>
>
> Thanks Sjoerd. OK, now the file is getting to mclient, and I am at least
> at the point where I am getting the same error message as when I COPY
> ... FROM 'file' within mclient.
>
> $ mclient -E LATIN9 -d universe -u monetdb -s "COPY INTO
> acs2010_5yr.geoheader FROM STDIN USING DELIMITERS ',', '\\n', '\"'" - <
> /data/lee/postgres_upload/acs2010_5yr/All_Geographies_Not_Tracts_Block_Groups/g20105az.csv
> password:
> value from line 10560 field 50 not inserted, expecting type varchar(200)
> failed to import table
> current transaction is aborted (please ROLLBACK)
> syntax error, unexpected sqlINT in: "04"
>
> I tried it both with the terminal's character encoding set to UTF8 and
> set to LATIN9. Does that make a difference? When I sed the problematic
> line with the terminal in LATIN9, it is human-readable, while in UTF8 it
> is not.
(i.e. the terminal setting).
The first message is important (value from line 10560 field 50 not
inserted, expecting type varchar(200)). The second message about the
int "04" is just because the server reads the next line that happens to
start with 04 after it encountered the error and decides that that is
not valid SQL.
I turns out you also need to specify the -i option to mclient for it to
translate the input using the encoding. Without -i, mclient takes a
more efficient path, but that path doesn't involve translating the
character encoding. It can be argued that's a bug.