Hello,
I had a look at the code. For each UTF8 character it performs a hash lookup from the origin case bat and finds the corresponding character in the destination case bat.
However, this is an overkill for ASCII characters:
- for letters, [A-Z] + 32 = [a-z]
- all other ASCII characters stay the same
With the assumption that single-byte characters are very frequent in most texts, it makes sense to invest in a simple test and perform the hash lookup only for multi-byte characters.
I tested this on 831MB (over 360K tuples) of standard English text:
- original str.toLower/str.toUpper: 101 seconds (8 MB/s)
- modified version: 3.6 seconds (230 MB/s)
I guess that even when the text is highly multi-byte oriented the added test wouldn't hurt that much.
A side-observation perhaps worth investigating is why that hash lookup is so expensive.
Please find my patch in attachment.
Roberto