Drastic speed improvement for str.toLower() / str.toUpper
Hello, I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549 that string case conversion is very inefficient in MonetDB. I had a look at the code. For each UTF8 character it performs a hash lookup from the origin case bat and finds the corresponding character in the destination case bat. However, this is an overkill for ASCII characters: - for letters, [A-Z] + 32 = [a-z] - all other ASCII characters stay the same With the assumption that single-byte characters are very frequent in most texts, it makes sense to invest in a simple test and perform the hash lookup only for multi-byte characters. I tested this on 831MB (over 360K tuples) of standard English text: - original str.toLower/str.toUpper: 101 seconds (8 MB/s) - modified version: 3.6 seconds (230 MB/s) I guess that even when the text is highly multi-byte oriented the added test wouldn't hurt that much. A side-observation perhaps worth investigating is why that hash lookup is so expensive. Please find my patch in attachment. Roberto
A correction. By mistake, I had measured the original version with sequential_pipe and the modified version with default_pipe, thus not fair. However, now it becomes even more interesting. Original: - sequential_pipe: 101s (8MB/s) - default_pipe: 214s (3.9 MB/s) Modified: - sequential_pipe: 15s (55 MB/s) - default_pipe: 3.6s (230 MB/s) So parallel execution hurts the original version, while it speeds up the modified version. Roberto On 14 January 2016 at 17:34, Roberto Cornacchia < roberto.cornacchia@gmail.com> wrote:
Hello,
I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549 that string case conversion is very inefficient in MonetDB.
I had a look at the code. For each UTF8 character it performs a hash lookup from the origin case bat and finds the corresponding character in the destination case bat.
However, this is an overkill for ASCII characters: - for letters, [A-Z] + 32 = [a-z] - all other ASCII characters stay the same
With the assumption that single-byte characters are very frequent in most texts, it makes sense to invest in a simple test and perform the hash lookup only for multi-byte characters.
I tested this on 831MB (over 360K tuples) of standard English text: - original str.toLower/str.toUpper: 101 seconds (8 MB/s) - modified version: 3.6 seconds (230 MB/s)
I guess that even when the text is highly multi-byte oriented the added test wouldn't hurt that much.
A side-observation perhaps worth investigating is why that hash lookup is so expensive.
Please find my patch in attachment.
Roberto
I've just checked a change in that is similar to what you did here. I've changed the order in which the tests are done, since I think my order is more efficient. On 01/14/2016 05:34 PM, Roberto Cornacchia wrote:
Hello,
I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549 that string case conversion is very inefficient in MonetDB.
I had a look at the code. For each UTF8 character it performs a hash lookup from the origin case bat and finds the corresponding character in the destination case bat.
However, this is an overkill for ASCII characters: - for letters, [A-Z] + 32 = [a-z] - all other ASCII characters stay the same
With the assumption that single-byte characters are very frequent in most texts, it makes sense to invest in a simple test and perform the hash lookup only for multi-byte characters.
I tested this on 831MB (over 360K tuples) of standard English text: - original str.toLower/str.toUpper: 101 seconds (8 MB/s) - modified version: 3.6 seconds (230 MB/s)
I guess that even when the text is highly multi-byte oriented the added test wouldn't hurt that much.
A side-observation perhaps worth investigating is why that hash lookup is so expensive.
Please find my patch in attachment.
Roberto
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- Sjoerd Mullender
You're right, your version is slightly faster in my test set.
Roberto
On 14 January 2016 at 22:08, Sjoerd Mullender
I've just checked a change in that is similar to what you did here. I've changed the order in which the tests are done, since I think my order is more efficient.
On 01/14/2016 05:34 PM, Roberto Cornacchia wrote:
Hello,
I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549 that string case conversion is very inefficient in MonetDB.
I had a look at the code. For each UTF8 character it performs a hash lookup from the origin case bat and finds the corresponding character in the destination case bat.
However, this is an overkill for ASCII characters: - for letters, [A-Z] + 32 = [a-z] - all other ASCII characters stay the same
With the assumption that single-byte characters are very frequent in most texts, it makes sense to invest in a simple test and perform the hash lookup only for multi-byte characters.
I tested this on 831MB (over 360K tuples) of standard English text: - original str.toLower/str.toUpper: 101 seconds (8 MB/s) - modified version: 3.6 seconds (230 MB/s)
I guess that even when the text is highly multi-byte oriented the added test wouldn't hurt that much.
A side-observation perhaps worth investigating is why that hash lookup is so expensive.
Please find my patch in attachment.
Roberto
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- Sjoerd Mullender
_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
participants (2)
-
Roberto Cornacchia
-
Sjoerd Mullender