MonetDB: bulk insert failed due to weird characters
Hi, I'm using "COPY INTO" to insert some data into MonetDB. The csv that I'm copying from has some weird characters like \367\251\240 in some string fields and it broke the insertion. The error that I got is: !value from line x field y not inserted, expecting type clob Does anyone know how to get around this problem? The version I'm using is MonetDB Database Server Toolkit v1.1 (Oct2014-SP4). Thanks. -- Tri Vuong
Hi, internally, MonetDB only handles UTF-8 encoded (textual) data. Thus, as documented [1], for bulk data load from a file, MonetDB requires the file to use UTF-8 encoding, while for bulk data load via STDIN, MonetDB requires the data to use the same encoding as mclient (see also [2]). If your data is not in the required encoding, you'd first need to convert it before loading it into MonetDB. Best, Stefan [1] https://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto [2] https://www.monetdb.org/Documentation/mclient-man-page ----- On Jul 23, 2015, at 6:42 PM, Tri Vuong trivektor@gmail.com wrote:
Hi,
I'm using "COPY INTO" to insert some data into MonetDB. The csv that I'm copying from has some weird characters like \367\251\240 in some string fields and it broke the insertion. The error that I got is:
!value from line x field y not inserted, expecting type clob
Does anyone know how to get around this problem? The version I'm using is MonetDB Database Server Toolkit v1.1 (Oct2014-SP4). Thanks.
-- Tri Vuong
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
A less cryptic error message would be useful here though... perhaps you[1] should consider: - Mention that clob stands for "string with unbounded length" - Mention that the reason the data is not a proper clob is an encoding problem - Mention which encoding is expected. - (Maybe) Mention the position in the octet stream at which the decoding failure occurred - (Maybe) Provide the context of decoded characters, e.g. "The quick brown fox ... etc ... lazy d\xDE\xAD\xBE\xEF" - Repeat Stefan's explanation on this wiki page: https://www.monetdb.org/Documentation/Manuals/SQLreference/BuiltinTypes Eyal [1] - Soon I'll need to start writing "we", I suppose, and maybe start owning a few of these issues :-) On 23/07/2015 22:41, Stefan Manegold wrote:
Hi,
internally, MonetDB only handles UTF-8 encoded (textual) data.
Thus, as documented [1], for bulk data load from a file, MonetDB requires the file to use UTF-8 encoding, while for bulk data load via STDIN, MonetDB requires the data to use the same encoding as mclient (see also [2]).
If your data is not in the required encoding, you'd first need to convert it before loading it into MonetDB.
Best, Stefan
[1] https://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto [2] https://www.monetdb.org/Documentation/mclient-man-page
----- On Jul 23, 2015, at 6:42 PM, Tri Vuong trivektor@gmail.com wrote:
Hi,
I'm using "COPY INTO" to insert some data into MonetDB. The csv that I'm copying from has some weird characters like \367\251\240 in some string fields and it broke the insertion. The error that I got is:
!value from line x field y not inserted, expecting type clob
Does anyone know how to get around this problem? The version I'm using is MonetDB Database Server Toolkit v1.1 (Oct2014-SP4). Thanks.
-- Tri Vuong
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
A less cryptic error message would be useful here though... perhaps you[1] should consider:
Indeed there is room for improvement of the error message; however, not everything is trivial and straight forward, I'll give some hits inlined below
- Mention that clob stands for "string with unbounded length"
we give that standard (abbreviated) type name, not the type description.
- Mention that the reason the data is not a proper clob is an encoding problem
well, that's not necessarily clear; it could also be arbitrary binary data ...
- Mention which encoding is expected.
We could do that, but the server always expects UTF-8, so it will alwasy say 'UTF-8 expected'; the server is not aware if the client does conversion from locale encoding to UTF-8.
- (Maybe) Mention the position in the octet stream at which the decoding failure occurred
Yes, I'd love that, also for many other bulk-loading error messages
- (Maybe) Provide the context of decoded characters, e.g. "The quick brown fox ... etc ... lazy d\xDE\xAD\xBE\xEF"
We could consider that --- but then we have to show the non-UTF-8 characters as hex or octal values, and I'm sure users will complain that their provided data does not contain these (literal) character sequences ...
- Repeat Stefan's explanation on this wiki page:
https://www.monetdb.org/Documentation/Manuals/SQLreference/BuiltinTypes
indeed --- at least mention that strings are UTF-8 strings
Eyal
[1] - Soon I'll need to start writing "we", I suppose, and maybe start owning a few of these issues :-)
Looking forward to ;-) Stefan
On 23/07/2015 22:41, Stefan Manegold wrote:
Hi,
internally, MonetDB only handles UTF-8 encoded (textual) data.
Thus, as documented [1], for bulk data load from a file, MonetDB requires the file to use UTF-8 encoding, while for bulk data load via STDIN, MonetDB requires the data to use the same encoding as mclient (see also [2]).
If your data is not in the required encoding, you'd first need to convert it before loading it into MonetDB.
Best, Stefan
[1] https://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto [2] https://www.monetdb.org/Documentation/mclient-man-page
----- On Jul 23, 2015, at 6:42 PM, Tri Vuong trivektor@gmail.com wrote:
Hi,
I'm using "COPY INTO" to insert some data into MonetDB. The csv that I'm copying from has some weird characters like \367\251\240 in some string fields and it broke the insertion. The error that I got is:
!value from line x field y not inserted, expecting type clob
Does anyone know how to get around this problem? The version I'm using is MonetDB Database Server Toolkit v1.1 (Oct2014-SP4). Thanks.
-- Tri Vuong
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
On 23/07/15 22:40, Stefan Manegold wrote:
- (Maybe) Mention the position in the octet stream at which the decoding failure occurred
Yes, I'd love that, also for many other bulk-loading error messages
That should have been covered in the release candidate as the rejects() table containing detailed context information about all failed row.
- (Maybe) Provide the context of decoded characters, e.g. "The quick brown fox ... etc ... lazy d\xDE\xAD\xBE\xEF"
We could consider that --- but then we have to show the non-UTF-8 characters as hex or octal values, and I'm sure users will complain that their provided data does not contain these (literal) character sequences ...
- Repeat Stefan's explanation on this wiki page:
https://www.monetdb.org/Documentation/Manuals/SQLreference/BuiltinTypes
indeed --- at least mention that strings are UTF-8 strings
regards, Martin
So I've converted all of the strings in my csv to UTF-8 but Monet was still
complaining
!value
'\357\322\226\257\240bO\271\23340\344j\231I06Wdn\277fx\334-\320\366\323'
from line 43255 field 12 not inserted, expecting type clob
I think these are valid UTF-8 characters ïÒ¯ bO¹40äjI06Wdn¿fxÜ-ÐöÓ. No?
On Thu, Jul 23, 2015 at 12:41 PM, Stefan Manegold
Hi,
internally, MonetDB only handles UTF-8 encoded (textual) data.
Thus, as documented [1], for bulk data load from a file, MonetDB requires the file to use UTF-8 encoding, while for bulk data load via STDIN, MonetDB requires the data to use the same encoding as mclient (see also [2]).
If your data is not in the required encoding, you'd first need to convert it before loading it into MonetDB.
Best, Stefan
[1] https://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto [2] https://www.monetdb.org/Documentation/mclient-man-page
----- On Jul 23, 2015, at 6:42 PM, Tri Vuong trivektor@gmail.com wrote:
Hi,
I'm using "COPY INTO" to insert some data into MonetDB. The csv that I'm copying from has some weird characters like \367\251\240 in some string fields and it broke the insertion. The error that I got is:
!value from line x field y not inserted, expecting type clob
Does anyone know how to get around this problem? The version I'm using is MonetDB Database Server Toolkit v1.1 (Oct2014-SP4). Thanks.
-- Tri Vuong
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Tri Vuong
If I encode your characters in UTF-8 I get (in hex): %D6%B3%C2%AF%D6%B3%E2%80%99%D6%B2%E2%80%93%D6%B2%C2%AF%20bO%D6%B2%C2%B9%D6%B2%E2%80%BA40%D6%B3%E2%82%AAj%D6%B2%E2%84%A2I06Wdn%D6%B2%C2%BFfx%D6%B3%C2%9C-%D6%B3%C2%90%D6%B3%C2%B6%D6%B3%E2%80%9C the start of this, in octal, are the following octets: 326 263 302 257 326 263 342 which are not quite what you've listed. Eyal On 24/07/2015 00:23, Tri Vuong wrote:
So I've converted all of the strings in my csv to UTF-8 but Monet was still complaining
!value '\357\322\226\257\240bO\271\23340\344j\231I06Wdn\277fx\334-\320\366\323' from line 43255 field 12 not inserted, expecting type clob
I think these are valid UTF-8 charactersֲ ֳ¯ֳ’ֲ–ֲ¯ bOֲ¹ֲ›40ֳ₪jֲ™I06Wdnֲ¿fxֳ-ֳֳ¶ֳ“. No?
On Thu, Jul 23, 2015 at 12:41 PM, Stefan Manegold
mailto:Stefan.Manegold@cwi.nl> wrote: Hi,
internally, MonetDB only handles UTF-8 encoded (textual) data.
Thus, as documented [1], for bulk data load from a file, MonetDB requires the file to use UTF-8 encoding, while for bulk data load via STDIN, MonetDB requires the data to use the same encoding as mclient (see also [2]).
If your data is not in the required encoding, you'd first need to convert it before loading it into MonetDB.
Best, Stefan
[1] https://www.monetdb.org/Documentation/Manuals/SQLreference/CopyInto [2] https://www.monetdb.org/Documentation/mclient-man-page
----- On Jul 23, 2015, at 6:42 PM, Tri Vuong trivektor@gmail.com mailto:trivektor@gmail.com wrote:
> Hi, > > > > I'm using "COPY INTO" to insert some data into MonetDB. The csv that I'm copying > from has some weird characters like \367\251\240 in some string fields and it > broke the insertion. The error that I got is: > > !value from line x field y not inserted, expecting type clob > > Does anyone know how to get around this problem? The version I'm using is > MonetDB Database Server Toolkit v1.1 (Oct2014-SP4). Thanks. > > -- > Tri Vuong > > _______________________________________________ > users-list mailing list > users-list@monetdb.org mailto:users-list@monetdb.org > https://www.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architecturesֲ ֲ (DA) | | www.CWI.nl/~manegold/ http://www.CWI.nl/~manegold/ֲ | Science Park 123 (L321) | | +31 (0)20 592-4212 tel:%2B31%20%280%2920%20592-4212ֲ ֲ ֲ | 1098 XG Amsterdamֲ (NL) | _______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Tri Vuong
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (4)
-
Eyal Rozenberg
-
Martin Kersten
-
Stefan Manegold
-
Tri Vuong