hej stefan, it was indeed more a general question. no direct problem. but thanks for the detailed answer. we can indeed not avoid overallocation at index building. however, once such index tables are filled, nothing will be added anymore. and there would be enough time here, for any kind of memory optimizations. however, that seems not neccessary according to what you just explained. best -henning Stefan Manegold wrote:
[I once again felt free to share this with the community ...]
Henning,
in case BAT capacities are significantly larger than their actual content (count), this might indeed have a negative influence on performance. (1) in case the BAT is memory-mapped when loaded, it "only" blocks some more address space than strictly necessary (no problem on 64-bit systems, potentially a problem on 32-bit systems); (2) in case the BAT is malloced when loaded, it also occupied some more memory than strictly necessary (potential problem on but 64- & 32-bit systems).
However, unless there is some accurate estimation, it is often hard (or virually impossible) to "guess" a BATs size before filling it; hence, a "generous" initial size allocation is good to avoid expensive BAT extents.
In your case, I'm lost concerning which BATs your taling about. The shredder-generated pre_* (actually rid_*) BATs need to be allocated before reading the document; hence, there is know knowledge about the number of nodes in the document, and as far as I can tell no trivial way to estimate this accurately. Hence, the shredder needs to guess something --- JanF can tell more, I guess...
In case of the TIJAH indices, I have no clue at all, how/where/when they are built and whether there might be better information available to not overallocate but allocate only just enough space. You or some of your colleagues in Twente should know all the details.
Finally, is there any concrete case where you actually experiences any problems due to "over-allocation", or are you just wondering?
Stefan
ps: in the case given below, the batsize just fits the BAT's capacity; only the count is smaller than the capacity (obviously, it cannot be larger) --- if you want/need to know why, you better ask him/her who allocated/created/filled the "tj_DFLT_FT_INDEX_size1" BAT ...
On Mon, Oct 15, 2007 at 02:36:46PM +0200, Henning Rode wrote:
hej stefan,
sorry, that i did not answer earlier. i justed wanted to report the actual sizes of pf/tijah indices in a paper. so that is done now.
still, i was asking myself, whether it might have any kind of performance influences, that BAT capacities are so much higher than the actual BAT counts. This is of course handy, when we still want to add new entries, but once we indexed a collection, we usually only query it.
in case of our "pre_size" BAT this difference between BATsize and BATdsksize can easily be 250MB or more.
best -henning
mil>var t := bat("tj_DFLT_FT_INDEX_size1"); mil>t.count().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |203091470
| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ mil>t.capacity().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |260898816
| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ mil>t.batsize().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |1043599360
| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ mil>t.batdsksize().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |812366848
mil>var x := t.copy(); mil>x.count().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |203091470
| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ mil>x.capacity().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |260898816
| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ mil>x.batsize().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |1043599360
| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ mil>x.batdsksize().print(); +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |812366848
| +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ mil>t.access(BAT_READ);
Stefan Manegold wrote:
Henning,
you should also check & report b.capacity(), i.e.,
b.count(); b.capacity(); b.info().reverse().like("batBuns").like("size").print(); b.batsize(); b.batdsksize();
var c := b.copy();
c.count(); c.capacity(); c.info().reverse().like("batBuns").like("size").print(); c.batsize(); c.batdsksize();
Stefan
On Sat, Oct 06, 2007 at 07:43:28PM +0200, Stefan Manegold wrote:
[felt free to cc the monetdb-developers list as more people might be interested or want to contribute]
Henning,
are you just "concerned" or are you having concrete problems with the bat sizes?
In cany case, to give any reasonable answer we'd need to know more about the details. In particular how large is the BAT your talking about.
I.e., with "b" being your BAT and "c := b.copy()", please check & report
b.count(); b.info().reverse().like("batBuns").like("size").print(); b.batsize(); b.batdsksize();
c.count(); c.info().reverse().like("batBuns").like("size").print(); c.batsize(); c.batdsksize();
Stefan
On Fri, Oct 05, 2007 at 01:47:01PM +0200, Henning Rode wrote:
hej stefan,
thanks for the answer. so in conclusion, the over-allocation of memory is quite normal, and nothing to worry about.
i was more surprised that the copied BAT still has this considerable over-allocation of memory, though it exactly knows how many entries it needs to hold.
groeten -henning -- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4312 |