How varchar/text column works ?
Hi, i have been working on monetdb for quite some time and am trying to understand how string column works so as i can (try to ) make it import string columns as fast as possible.The tool is external and generates monetdb binary compatible columns( and may be used to optimize tables in offline mode). As far as reading code and understanding it. here is my part String columns is kept in 2 parts, x.tail and x.theap ( where x is number ) x.tail contains pointers into x.theap x.theap has 2 regions/parts first part contains a 8KB hash table, to detect collisions and second part contains 8 zeroes,8 Byte hash, variable length String which is null terminated and aligned on 8 byte boundary (i am talking about x64 bit arch) and next string follows and so on However, the problem i am facing is , the tools works and generates columns till x.tail contains pointers addressing x.theap positions less than 255. The moment pointers are 2 byte tail entry, monetdb is confused. The reason is tail pointers are varsized for efficiency, and if there are few small string, pointers will fit in small memory and not waste 8 bytes per entry. However, i am unable to understand how monetdb figures out, how big entry is currently being used ? in case of small strings, it's 1 byte (tested) in case of some bigger string, it's 2 bytes (tested) and even in large string tables, it 4 bytes (tested) and large (>4 GB) sting tables, it can be 8 bytes ??? ( this case has not been tested) i have understood, it has something to with bat->T->shift , bat->T->width but still it doesn't work.So, i am definitely missing something. So, my questions are 1) is it an efficient method ( not talking about space efficiency) to have var-sized tail pointers. because it might involve lots of casting and so on during calculations.. 2) how to disable var-sized tail pointers (if it's possible) so as all pointers are 4byte or 8 byte long ( any compile time option or runtime option ) 3) if 2) not possible , then how does monetdb calculate tail pointer width. for example, i am using following code, to initialize new bat with my tail and theap pointers cap=num_of_tail_entries; bs = BATcreatedesc(TYPE_void, TYPE_str, 1); bn = &bs->B; BATsetdims(bn); BATkey(bn, TRUE); BATsetcapacity(bn, cap); BATsetcount(bn, cap); bn->tsorted = 0; bn->trevsorted = 0; bn->tdense = 0; bn->tkey = 0; bn->tvarsized=1; // try to set shift to 1 and thus width to 2 bytes //bn->T->shift=1; // this property seems to be used to calculate tail ptr size //bn->T->width=2; // this property seems to be used to calculate tail ptr size bn->T->varsized=1; then i load both heap and theap. The above code works and i am able to query the table as long as tail pointers are 1 byte long. However, as soon as pointers size is 2 bytes,it fails, so i must set something in BAT init to tell tail ptr size. Any clues are welcome... ( if more information is required, please ask.) Thanks for any help . Rgds mike
Problem solved, it was typo, that was getting missed.. fixing it solved the problem. thanks, mike On 09/16/2014 09:20 PM, Mike wrote:
Hi, i have been working on monetdb for quite some time and am trying to understand how string column works so as i can (try to ) make it import string columns as fast as possible.The tool is external and generates monetdb binary compatible columns( and may be used to optimize tables in offline mode).
As far as reading code and understanding it. here is my part String columns is kept in 2 parts, x.tail and x.theap ( where x is number ) x.tail contains pointers into x.theap
x.theap has 2 regions/parts first part contains a 8KB hash table, to detect collisions and second part contains 8 zeroes,8 Byte hash, variable length String which is null terminated and aligned on 8 byte boundary (i am talking about x64 bit arch) and next string follows and so on
However, the problem i am facing is , the tools works and generates columns till x.tail contains pointers addressing x.theap positions less than 255. The moment pointers are 2 byte tail entry, monetdb is confused. The reason is tail pointers are varsized for efficiency, and if there are few small string, pointers will fit in small memory and not waste 8 bytes per entry. However, i am unable to understand how monetdb figures out, how big entry is currently being used ? in case of small strings, it's 1 byte (tested) in case of some bigger string, it's 2 bytes (tested) and even in large string tables, it 4 bytes (tested) and large (>4 GB) sting tables, it can be 8 bytes ??? ( this case has not been tested)
i have understood, it has something to with bat->T->shift , bat->T->width but still it doesn't work.So, i am definitely missing something.
So, my questions are 1) is it an efficient method ( not talking about space efficiency) to have var-sized tail pointers. because it might involve lots of casting and so on during calculations.. 2) how to disable var-sized tail pointers (if it's possible) so as all pointers are 4byte or 8 byte long ( any compile time option or runtime option ) 3) if 2) not possible , then how does monetdb calculate tail pointer width.
for example, i am using following code, to initialize new bat with my tail and theap pointers cap=num_of_tail_entries; bs = BATcreatedesc(TYPE_void, TYPE_str, 1); bn = &bs->B; BATsetdims(bn); BATkey(bn, TRUE); BATsetcapacity(bn, cap); BATsetcount(bn, cap);
bn->tsorted = 0; bn->trevsorted = 0; bn->tdense = 0; bn->tkey = 0; bn->tvarsized=1;
// try to set shift to 1 and thus width to 2 bytes //bn->T->shift=1; // this property seems to be used to calculate tail ptr size //bn->T->width=2; // this property seems to be used to calculate tail ptr size bn->T->varsized=1;
then i load both heap and theap.
The above code works and i am able to query the table as long as tail pointers are 1 byte long. However, as soon as pointers size is 2 bytes,it fails, so i must set something in BAT init to tell tail ptr size.
Any clues are welcome... ( if more information is required, please ask.)
Thanks for any help .
Rgds mike _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
participants (1)
-
Mike