Re: [Monetdb-developers] [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33
Peter, I have some questions to make sure I understand your new code correctly: 1) I don't see any plance in the hash code (at least not in gdk_search.mx) where the "free" element of a hash heap is set (or used) other than the initialization to 0 in HEAPalloc; thus, I guess, "free" for hash heaps is always 0; hence, shouln't we use "size" instead of "free" for the madvise & preload size of hash heaps (as we did in the original BATpreload/BATaccess code)? 2) Am I right that for string heaps you conclude from a strong order correlation between the off-heap and the string heap (due sequential load/insertion) that also the first and last BUN in the offset point to the "first" and "last" string in the string heap? Well, indeed, since access is to be considered in page size granularity, this might be reasonable ... 3) (This was the same in the previous version of the code) For BUN heaps, in case of views (slices), the base pointer of the view's heap might not be the same as the parent's heap, in fact, it might not be page-aligned. If I understand the MT_mmap_tab[] array correctly, it identifies heap by their page-aligned base pointer of the parent's heap. Hence, BATaccess() on a slice view BAT with non-aligned heap->base pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() does nothing with that heap. With preload==1 it does hence not resgister the posix_madvise() call that access_heap() does. COnsequently, with preload==-1, MT_mmap_inform() will never reset the advise set via slice views, unless there is (also) access to the original parent's heap (i.e., with page-aligned heap->base pointer. I jjust noticed this, but do not yet understand, whether and if so which consequences this (might) have ... Stefan On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote:
Update of /cvsroot/monetdb/MonetDB/src/gdk In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734
Modified Files: Tag: Feb2010 gdk_posix.mx gdk_storage.mx Log Message: did experimentation with sequential mmap I/O. - on very fast subsystems (such as 16xssd) it is three times slower than optimally tuned direct I/O (1GB/s vs 3GB/s) - with less disks the difference is smaller (e.g. 140 vs 200MB/s) regrettably, nothing helped to get it higher.
the below checkin makes the following changes: - simplified BATaccess code by separating out routine - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) - observe that large string heaps have a high sequential correletaion hense always WILLNEED fetching is overkill - move the madvise() call back to BATaccess at the start of the access but removing the advise is done in vmtrim, as you need the overview when the last user is away. - the basic advise is SEQUENTIAL (ie decent I/O)
Index: gdk_storage.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v retrieving revision 1.149.2.32 retrieving revision 1.149.2.33 diff -u -d -r1.149.2.32 -r1.149.2.33 --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 @@ -697,156 +697,95 @@ return BATload_intern(i); } @- BAT preload -To avoid random disk access to large (memory-mapped) BATs it may help to issue a preload -request. -Of course, it does not make sense to touch more then we can physically accomodate. +To avoid random disk access to large (memory-mapped) BATs it may help to issue a preload request. +Of course, it does not make sense to touch more then we can physically accomodate (budget). @c -size_t -BATaccess(BAT *b, int what, int advise, int preload) { - size_t *i, *limit; - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; - size_t step = MT_pagesize()/sizeof(size_t); - size_t pages = (size_t) (0.8 * MT_npages()); - - assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||advise==MMAP_WILLNEED||advise==MMAP_DONTNEED); - - /* VAR heaps (inherent random access) */ - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, preload, MMAP_WILLNEED, 0); - } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->vheap\n", BATgetId(b), advise); - limit = (size_t *) (b->H->vheap->base + b->H->vheap->free) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->H->vheap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } +/* modern linux tends to use 128K readaround = 64K readahead + * changes have been going on in 2009, towards true readahead + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c + * + * Peter Feb2010: I tried to do prefetches further apart, to trigger multiple readahead + * units in parallel, but it does improve performance visibly + */ +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, int touch, int preload, int advise) { + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = 0, page = MT_pagesize(); + int t = GDKms(); + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { + MT_mmap_inform(h->base, h->size, preload, advise, 0); + if (preload > 0) { + void* alignedbase = (void*) (((size_t) base) & ~(page-1)); + size_t alignedsz = (sz + (page-1)) & ~(page-1); + int ret = posix_madvise(alignedbase, sz, advise); + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", + h->filename, PTRFMTCAST alignedbase, alignedsz >> 20, advise, errno); } } - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, preload, MMAP_WILLNEED, 0); - } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->vheap\n", BATgetId(b), advise); - limit = (size_t *) (b->T->vheap->base + b->T->vheap->free - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->T->vheap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if (touch && preload > 0) { + /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); + size_t *hi = (size_t *) (base + sz); + for (hi -= 8*page; lo <= hi; lo += 8*page) { + /* try to trigger loading of multiple pages without blocking */ + v0 += lo[0*page]; v1 += lo[1*page]; v2 += lo[2*page]; v3 += lo[3*page]; + v4 += lo[4*page]; v5 += lo[5*page]; v6 += lo[6*page]; v7 += lo[7*page]; } + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; } + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) = %dms \n", id, hp, preload, (int) (sz>>20), + (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNKNOWN", GDKms()-t); + return v0+v1+v2+v3+v4+v5+v6+v7; +}
- /* BUN heaps (no need to preload for sequential access) */ - if ( what&USE_HEAD && b->H->heap.base ) { - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > MT_MMAP_TILE) { - MT_mmap_inform(b->H->heap.base, b->H->heap.size, preload, advise, 0); - } - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->heap\n", BATgetId(b), advise); - limit = (size_t *) (Hloc(b, BUNlast(b)) - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } - } - } - if ( what&USE_TAIL && b->T->heap.base ) { - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > MT_MMAP_TILE) { - MT_mmap_inform(b->T->heap.base, b->T->heap.size, preload, advise, 0); +size_t +BATaccess(BAT *b, int what, int advise, int preload) { + ssize_t budget = (ssize_t) (0.8 * MT_npages()); + size_t v = 0, sz; + str id = BATgetId(b); + BATiter bi = bat_iterator(b); + + assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||advise==MMAP_WILLNEED||advise==MMAP_DONTNEED); + if (BATcount(b) == 0) return 0; + + /* HASH indices (inherent random access). handle first as they *will* be access randomly (one can always hope for locality on the other heaps) */ + if ( what&USE_HHASH || what&USE_THASH ) { + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && b->H->hash->heap->base) { + budget -= sz = (b->H->hash->heap->free > (size_t) budget)?budget:(ssize_t)b->T->hash->heap->free; + v += access_heap(id, "hhash", b->H->hash->heap, b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); } - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->heap\n", BATgetId(b), advise); - limit = (size_t *) (Tloc(b, BUNlast(b)) - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && b->T->hash->heap->base) { + budget -= sz = (b->T->hash->heap->free > (size_t) budget)?budget:(ssize_t)b->T->hash->heap->free; + v += access_heap(id, "thash", b->T->hash->heap, b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); } + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); }
- /* HASH indices (inherent random access) */ - if ( what&USE_HHASH || what&USE_THASH ) - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && b->H->hash->heap->base ) { - if (b->H->hash->heap->storage != STORE_MEM && b->H->hash->heap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->H->hash->heap->base, b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); + /* we only touch stuff that is going to be read randomly (WILLNEED). Note varheaps are sequential wrt to the references, or small */ + if ( what&USE_HEAD) { + if (b->H->heap.base) { + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = BUNhloc(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->hash\n", BATgetId(b), advise); - limit = (size_t *) (b->H->hash->heap->base + b->H->hash->heap->size - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->H->hash->heap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if (b->H->vheap && b->H->vheap->base) { + char *lo = BUNhead(bi, BUNfirst(b)), *hi = BUNhead(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "hheap", b->H->vheap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } } - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && b->T->hash->heap->base ) { - if (b->T->hash->heap->storage != STORE_MEM && b->T->hash->heap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->T->hash->heap->base, b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); + if ( what&USE_TAIL) { + if (b->T->heap.base) { + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = BUNtloc(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->hash\n", BATgetId(b), advise); - limit = (size_t *) (b->T->hash->heap->base + b->T->hash->heap->size - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->T->hash->heap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if (b->T->vheap && b->T->vheap->base) { + char *lo = BUNtail(bi, BUNfirst(b)), *hi = BUNtail(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "theap", b->T->vheap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } } - if ( what&USE_HHASH || what&USE_THASH ) - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); - - return v1 + v2 + v3 + v4; + return v; } @}
Index: gdk_posix.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v retrieving revision 1.176.2.21 retrieving revision 1.176.2.22 diff -u -d -r1.176.2.21 -r1.176.2.22 --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 @@ -909,10 +909,8 @@ unload = MT_mmap_tab[i].usecnt == 0; } (void) pthread_mutex_unlock(&MT_mmap_lock); - if (i >= 0 && preload > 0) - ret = posix_madvise(base, len, advise); - else if (unload) - ret = posix_madvise(base, len, MMAP_NORMAL); + if (unload) + ret = posix_madvise(base, len, BUF_SEQUENTIAL); if (ret) { stream_printf(GDKerr, "#MT_mmap_inform: posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? MT_mmap_tab[i].fd : -1),
------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Monetdb-checkins mailing list Monetdb-checkins@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-checkins
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |
Hi Stefan Thanks, indeed in all areas improvements are needed: 1) indeed (scary use of free!) this should be corrected 2) typically yes. I do recall now that BATfetchjoin heap sharing will invalidate the otherwise always applying order correlation. If we have a way to detect that a heap is shared, we should treat those shared string heaps as WILLNEED. 3) also correct. The MT_mmap_find() could easily find entries by range overlap, then inform would find the relevant heap Finally, now sequential advise will not trigger preloading, but I actually think it can help (if you have enough memory). Maybe prefetch sequential heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory. Peter -----Original Message----- From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] Sent: vrijdag 19 februari 2010 1:34 To: monetdb-developers@lists.sourceforge.net; Peter Boncz Cc: monetdb-checkins@lists.sourceforge.net Subject: Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33 Peter, I have some questions to make sure I understand your new code correctly: 1) I don't see any plance in the hash code (at least not in gdk_search.mx) where the "free" element of a hash heap is set (or used) other than the initialization to 0 in HEAPalloc; thus, I guess, "free" for hash heaps is always 0; hence, shouln't we use "size" instead of "free" for the madvise & preload size of hash heaps (as we did in the original BATpreload/BATaccess code)? 2) Am I right that for string heaps you conclude from a strong order correlation between the off-heap and the string heap (due sequential load/insertion) that also the first and last BUN in the offset point to the "first" and "last" string in the string heap? Well, indeed, since access is to be considered in page size granularity, this might be reasonable ... 3) (This was the same in the previous version of the code) For BUN heaps, in case of views (slices), the base pointer of the view's heap might not be the same as the parent's heap, in fact, it might not be page-aligned. If I understand the MT_mmap_tab[] array correctly, it identifies heap by their page-aligned base pointer of the parent's heap. Hence, BATaccess() on a slice view BAT with non-aligned heap->base pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() does nothing with that heap. With preload==1 it does hence not resgister the posix_madvise() call that access_heap() does. COnsequently, with preload==-1, MT_mmap_inform() will never reset the advise set via slice views, unless there is (also) access to the original parent's heap (i.e., with page-aligned heap->base pointer. I jjust noticed this, but do not yet understand, whether and if so which consequences this (might) have ... Stefan On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote: > Update of /cvsroot/monetdb/MonetDB/src/gdk > In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734 > > Modified Files: > Tag: Feb2010 > gdk_posix.mx gdk_storage.mx > Log Message: > did experimentation with sequential mmap I/O. > - on very fast subsystems (such as 16xssd) it is three times slower than optimally tuned direct I/O (1GB/s vs 3GB/s) > - with less disks the difference is smaller (e.g. 140 vs 200MB/s) > regrettably, nothing helped to get it higher. > > the below checkin makes the following changes: > - simplified BATaccess code by separating out routine > - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) > - observe that large string heaps have a high sequential correletaion > hense always WILLNEED fetching is overkill > - move the madvise() call back to BATaccess at the start of the access but removing > the advise is done in vmtrim, as you need the overview when the last user is away. > - the basic advise is SEQUENTIAL (ie decent I/O) > > > > Index: gdk_storage.mx > =================================================================== > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v > retrieving revision 1.149.2.32 > retrieving revision 1.149.2.33 > diff -u -d -r1.149.2.32 -r1.149.2.33 > --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 > +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 > @@ -697,156 +697,95 @@ > return BATload_intern(i); > } > @- BAT preload > -To avoid random disk access to large (memory-mapped) BATs it may help to issue a preload > -request. > -Of course, it does not make sense to touch more then we can physically accomodate. > +To avoid random disk access to large (memory-mapped) BATs it may help to issue a preload request. > +Of course, it does not make sense to touch more then we can physically accomodate (budget). > @c > -size_t > -BATaccess(BAT *b, int what, int advise, int preload) { > - size_t *i, *limit; > - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; > - size_t step = MT_pagesize()/sizeof(size_t); > - size_t pages = (size_t) (0.8 * MT_npages()); > - > - assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > - > - /* VAR heaps (inherent random access) */ > - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { > - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > MT_MMAP_TILE) { > - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, preload, MMAP_WILLNEED, 0); > - } > - if (preload > 0 && pages > 0) { > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->vheap\n", BATgetId(b), advise); > - limit = (size_t *) (b->H->vheap->base + b->H->vheap->free) - 4 * step; > - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ > - i = (size_t *) (((size_t)b->H->vheap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { > - v1 += *i; > - v2 += *(i + step); > - v3 += *(i + 2*step); > - v4 += *(i + 3*step); > - } > - limit += 4 * step; > - for (; i <= limit && pages > 0; i+= step, pages--) { > - v1 += *i; > - } > +/* modern linux tends to use 128K readaround = 64K readahead > + * changes have been going on in 2009, towards true readahead > + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c > + * > + * Peter Feb2010: I tried to do prefetches further apart, to trigger multiple readahead > + * units in parallel, but it does improve performance visibly > + */ > +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, int touch, int preload, int advise) { > + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = 0, page = MT_pagesize(); > + int t = GDKms(); > + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { > + MT_mmap_inform(h->base, h->size, preload, advise, 0); > + if (preload > 0) { > + void* alignedbase = (void*) (((size_t) base) & ~(page-1)); > + size_t alignedsz = (sz + (page-1)) & ~(page-1); > + int ret = posix_madvise(alignedbase, sz, advise); > + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", > + h->filename, PTRFMTCAST alignedbase, alignedsz >> 20, advise, errno); > } > } > - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { > - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > MT_MMAP_TILE) { > - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, preload, MMAP_WILLNEED, 0); > - } > - if (preload > 0 && pages > 0) { > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->vheap\n", BATgetId(b), advise); > - limit = (size_t *) (b->T->vheap->base + b->T->vheap->free - sizeof(size_t)) - 4 * step; > - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ > - i = (size_t *) (((size_t)b->T->vheap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > - for (; i <= limit && pages > 3; i+= 4*step, pages-= 4) { > - v1 += *i; > - v2 += *(i + step); > - v3 += *(i + 2*step); > - v4 += *(i + 3*step); > - } > - limit += 4 * step; > - for (; i <= limit && pages > 0; i+= step, pages--) { > - v1 += *i; > - } > + if (touch && preload > 0) { > + /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ > + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > + size_t *hi = (size_t *) (base + sz); > + for (hi -= 8*page; lo <= hi; lo += 8*page) { > + /* try to trigger loading of multiple pages without blocking */ > + v0 += lo[0*page]; v1 += lo[1*page]; v2 += lo[2*page]; v3 += lo[3*page]; > + v4 += lo[4*page]; v5 += lo[5*page]; v6 += lo[6*page]; v7 += lo[7*page]; > } > + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; > } > + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) = %dms \n", id, hp, preload, (int) (sz>>20), > + (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK NOWN", GDKms()-t); > + return v0+v1+v2+v3+v4+v5+v6+v7; > +} > > - /* BUN heaps (no need to preload for sequential access) */ > - if ( what&USE_HEAD && b->H->heap.base ) { > - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > MT_MMAP_TILE) { > - MT_mmap_inform(b->H->heap.base, b->H->heap.size, preload, advise, 0); > - } > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->heap\n", BATgetId(b), advise); > - limit = (size_t *) (Hloc(b, BUNlast(b)) - sizeof(size_t)) - 4 * step; > - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ > - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { > - v1 += *i; > - v2 += *(i + step); > - v3 += *(i + 2*step); > - v4 += *(i + 3*step); > - } > - limit += 4 * step; > - for (; i <= limit && pages > 0; i+= step, pages--) { > - v1 += *i; > - } > - } > - } > - if ( what&USE_TAIL && b->T->heap.base ) { > - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > MT_MMAP_TILE) { > - MT_mmap_inform(b->T->heap.base, b->T->heap.size, preload, advise, 0); > +size_t > +BATaccess(BAT *b, int what, int advise, int preload) { > + ssize_t budget = (ssize_t) (0.8 * MT_npages()); > + size_t v = 0, sz; > + str id = BATgetId(b); > + BATiter bi = bat_iterator(b); > + > + assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > + if (BATcount(b) == 0) return 0; > + > + /* HASH indices (inherent random access). handle first as they *will* be access randomly (one can always hope for locality on the other heaps) */ > + if ( what&USE_HHASH || what&USE_THASH ) { > + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); > + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && b->H->hash->heap->base) { > + budget -= sz = (b->H->hash->heap->free > (size_t) budget)?budget:(ssize_t)b->T->hash->heap->free; > + v += access_heap(id, "hhash", b->H->hash->heap, b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > } > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->heap\n", BATgetId(b), advise); > - limit = (size_t *) (Tloc(b, BUNlast(b)) - sizeof(size_t)) - 4 * step; > - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ > - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > - for (; i <= limit && pages > 3; i+= 4*step, pages-= 4) { > - v1 += *i; > - v2 += *(i + step); > - v3 += *(i + 2*step); > - v4 += *(i + 3*step); > - } > - limit += 4 * step; > - for (; i <= limit && pages > 0; i+= step, pages--) { > - v1 += *i; > - } > + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && b->T->hash->heap->base) { > + budget -= sz = (b->T->hash->heap->free > (size_t) budget)?budget:(ssize_t)b->T->hash->heap->free; > + v += access_heap(id, "thash", b->T->hash->heap, b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > } > + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); > } > > - /* HASH indices (inherent random access) */ > - if ( what&USE_HHASH || what&USE_THASH ) > - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); > - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && b->H->hash->heap->base ) { > - if (b->H->hash->heap->storage != STORE_MEM && b->H->hash->heap->size > MT_MMAP_TILE) { > - MT_mmap_inform(b->H->hash->heap->base, b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); > + /* we only touch stuff that is going to be read randomly (WILLNEED). Note varheaps are sequential wrt to the references, or small */ > + if ( what&USE_HEAD) { > + if (b->H->heap.base) { > + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = BUNhloc(bi, BUNlast(b)-1); > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, (advise == BUF_WILLNEED), preload, advise); > } > - if (preload > 0 && pages > 0) { > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->hash\n", BATgetId(b), advise); > - limit = (size_t *) (b->H->hash->heap->base + b->H->hash->heap->size - sizeof(size_t)) - 4 * step; > - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ > - i = (size_t *) (((size_t)b->H->hash->heap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { > - v1 += *i; > - v2 += *(i + step); > - v3 += *(i + 2*step); > - v4 += *(i + 3*step); > - } > - limit += 4 * step; > - for (; i <= limit && pages > 0; i+= step, pages--) { > - v1 += *i; > - } > + if (b->H->vheap && b->H->vheap->base) { > + char *lo = BUNhead(bi, BUNfirst(b)), *hi = BUNhead(bi, BUNlast(b)-1); > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > + v += access_heap(id, "hheap", b->H->vheap, lo, sz, (advise == BUF_WILLNEED), preload, advise); > } > } > - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && b->T->hash->heap->base ) { > - if (b->T->hash->heap->storage != STORE_MEM && b->T->hash->heap->size > MT_MMAP_TILE) { > - MT_mmap_inform(b->T->hash->heap->base, b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); > + if ( what&USE_TAIL) { > + if (b->T->heap.base) { > + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = BUNtloc(bi, BUNlast(b)-1); > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, (advise == BUF_WILLNEED), preload, advise); > } > - if (preload > 0 && pages > 0) { > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->hash\n", BATgetId(b), advise); > - limit = (size_t *) (b->T->hash->heap->base + b->T->hash->heap->size - sizeof(size_t)) - 4 * step; > - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ > - i = (size_t *) (((size_t)b->T->hash->heap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { > - v1 += *i; > - v2 += *(i + step); > - v3 += *(i + 2*step); > - v4 += *(i + 3*step); > - } > - limit += 4 * step; > - for (; i <= limit && pages > 0; i+= step, pages--) { > - v1 += *i; > - } > + if (b->T->vheap && b->T->vheap->base) { > + char *lo = BUNtail(bi, BUNfirst(b)), *hi = BUNtail(bi, BUNlast(b)-1); > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > + v += access_heap(id, "theap", b->T->vheap, lo, sz, (advise == BUF_WILLNEED), preload, advise); > } > } > - if ( what&USE_HHASH || what&USE_THASH ) > - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); > - > - return v1 + v2 + v3 + v4; > + return v; > } > @} > > > Index: gdk_posix.mx > =================================================================== > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v > retrieving revision 1.176.2.21 > retrieving revision 1.176.2.22 > diff -u -d -r1.176.2.21 -r1.176.2.22 > --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 > +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 > @@ -909,10 +909,8 @@ > unload = MT_mmap_tab[i].usecnt == 0; > } > (void) pthread_mutex_unlock(&MT_mmap_lock); > - if (i >= 0 && preload > 0) > - ret = posix_madvise(base, len, advise); > - else if (unload) > - ret = posix_madvise(base, len, MMAP_NORMAL); > + if (unload) > + ret = posix_madvise(base, len, BUF_SEQUENTIAL); > if (ret) { > stream_printf(GDKerr, "#MT_mmap_inform: posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", > (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? MT_mmap_tab[i].fd : -1), > > > ---------------------------------------------------------------------------- -- > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Monetdb-checkins mailing list > Monetdb-checkins@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/monetdb-checkins > > -- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |
On Fri, Feb 19, 2010 at 01:46:23AM +0100, Peter Boncz wrote: > Hi Stefan > > Thanks, indeed in all areas improvements are needed: > 1) indeed (scary use of free!) this should be corrected Done. > 2) typically yes. I do recall now that BATfetchjoin heap sharing will > invalidate the otherwise always applying order correlation. If we have a way > to detect that a heap is shared, we should treat those shared string heaps > as WILLNEED. I'll leave that for tomorrow, or later ... > 3) also correct. The MT_mmap_find() could easily find entries by range > overlap, then inform would find the relevant heap something like this, I suppose: Index: MonetDB/src/gdk/gdk_posix.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v retrieving revision 1.176.2.22 diff -u -r1.176.2.22 gdk_posix.mx --- MonetDB/src/gdk/gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 +++ MonetDB/src/gdk/gdk_posix.mx 19 Feb 2010 01:18:34 -0000 @@ -587,7 +587,8 @@ int i, prev = MT_MMAP_BUFSIZE; for (i = MT_mmap_first; i >= 0; i = MT_mmap_tab[i].next) { - if (MT_mmap_tab[i].base == base) { + if (MT_mmap_tab[i].base <= (char*) base && + (char*) base < MT_mmap_tab[i].base + MT_mmap_tab[i].len) { return prev; } prev = i; > Finally, now sequential advise will not trigger preloading, but I actually > think it can help (if you have enough memory). Maybe prefetch sequential > heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory. indeed ... Stefan > Peter > > -----Original Message----- > From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] > Sent: vrijdag 19 februari 2010 1:34 > To: monetdb-developers@lists.sourceforge.net; Peter Boncz > Cc: monetdb-checkins@lists.sourceforge.net > Subject: * Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, > 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33 > > Peter, > > I have some questions to make sure I understand your new code correctly: > > 1) > I don't see any plance in the hash code (at least not in gdk_search.mx) > where the "free" element of a hash heap is set (or used) other than the > initialization to 0 in HEAPalloc; > thus, I guess, "free" for hash heaps is always 0; > hence, shouln't we use "size" instead of "free" for the madvise & preload > size of hash heaps (as we did in the original BATpreload/BATaccess code)? > > 2) > Am I right that for string heaps you conclude from a strong order > correlation between the off-heap and the string heap (due sequential > load/insertion) that also the first and last BUN in the offset point to the > "first" and "last" string in the string heap? > Well, indeed, since access is to be considered in page size granularity, > this might be reasonable ... > > > 3) > (This was the same in the previous version of the code) > For BUN heaps, in case of views (slices), the base pointer of the view's > heap might not be the same as the parent's heap, in fact, it might not be > page-aligned. > If I understand the MT_mmap_tab[] array correctly, it identifies heap by > their page-aligned base pointer of the parent's heap. > Hence, BATaccess() on a slice view BAT with non-aligned heap->base > pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned > heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() > does nothing with that heap. With preload==1 it does hence not resgister the > posix_madvise() call that access_heap() does. COnsequently, with > preload==-1, MT_mmap_inform() will never reset the advise set via slice > views, unless there is (also) access to the original parent's heap (i.e., > with page-aligned heap->base pointer. > I jjust noticed this, but do not yet understand, whether and if so which > consequences this (might) have ... > > > Stefan > > > On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote: > > Update of /cvsroot/monetdb/MonetDB/src/gdk > > In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734 > > > > Modified Files: > > Tag: Feb2010 > > gdk_posix.mx gdk_storage.mx > > Log Message: > > did experimentation with sequential mmap I/O. > > - on very fast subsystems (such as 16xssd) it is three times slower than > optimally tuned direct I/O (1GB/s vs 3GB/s) > > - with less disks the difference is smaller (e.g. 140 vs 200MB/s) > > regrettably, nothing helped to get it higher. > > > > the below checkin makes the following changes: > > - simplified BATaccess code by separating out routine > > - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) > > - observe that large string heaps have a high sequential correletaion > > hense always WILLNEED fetching is overkill > > - move the madvise() call back to BATaccess at the start of the access but > removing > > the advise is done in vmtrim, as you need the overview when the last > user is away. > > - the basic advise is SEQUENTIAL (ie decent I/O) > > > > > > > > Index: gdk_storage.mx > > =================================================================== > > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v > > retrieving revision 1.149.2.32 > > retrieving revision 1.149.2.33 > > diff -u -d -r1.149.2.32 -r1.149.2.33 > > --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 > > +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 > > @@ -697,156 +697,95 @@ > > return BATload_intern(i); > > } > > @- BAT preload > > -To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload > > -request. > > -Of course, it does not make sense to touch more then we can physically > accomodate. > > +To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload request. > > +Of course, it does not make sense to touch more then we can physically > accomodate (budget). > > @c > > -size_t > > -BATaccess(BAT *b, int what, int advise, int preload) { > > - size_t *i, *limit; > > - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; > > - size_t step = MT_pagesize()/sizeof(size_t); > > - size_t pages = (size_t) (0.8 * MT_npages()); > > - > > - > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > > - > > - /* VAR heaps (inherent random access) */ > > - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { > > - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, > preload, MMAP_WILLNEED, 0); > > - } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->vheap\n", BATgetId(b), advise); > > - limit = (size_t *) (b->H->vheap->base + > b->H->vheap->free) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->H->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > +/* modern linux tends to use 128K readaround = 64K readahead > > + * changes have been going on in 2009, towards true readahead > > + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c > > + * > > + * Peter Feb2010: I tried to do prefetches further apart, to trigger > multiple readahead > > + * units in parallel, but it does improve performance > visibly > > + */ > > +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, > int touch, int preload, int advise) { > > + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = > 0, page = MT_pagesize(); > > + int t = GDKms(); > > + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { > > + MT_mmap_inform(h->base, h->size, preload, advise, 0); > > + if (preload > 0) { > > + void* alignedbase = (void*) (((size_t) base) & > ~(page-1)); > > + size_t alignedsz = (sz + (page-1)) & ~(page-1); > > + int ret = posix_madvise(alignedbase, sz, advise); > > + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", > > + h->filename, PTRFMTCAST alignedbase, > alignedsz >> 20, advise, errno); > > } > > } > > - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { > > - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, > preload, MMAP_WILLNEED, 0); > > - } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->vheap\n", BATgetId(b), advise); > > - limit = (size_t *) (b->T->vheap->base + > b->T->vheap->free - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->T->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (touch && preload > 0) { > > + /* we need to ensure alignment, here, as b might be a view > and heap.base of views are not necessarily aligned */ > > + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - > 1) & (~(sizeof(size_t) - 1))); > > + size_t *hi = (size_t *) (base + sz); > > + for (hi -= 8*page; lo <= hi; lo += 8*page) { > > + /* try to trigger loading of multiple pages without > blocking */ > > + v0 += lo[0*page]; v1 += lo[1*page]; v2 += > lo[2*page]; v3 += lo[3*page]; > > + v4 += lo[4*page]; v5 += lo[5*page]; v6 += > lo[6*page]; v7 += lo[7*page]; > > } > > + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; > > } > > + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) > = %dms \n", id, hp, preload, (int) (sz>>20), > > + > (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK > NOWN", GDKms()-t); > > + return v0+v1+v2+v3+v4+v5+v6+v7; > > +} > > > > - /* BUN heaps (no need to preload for sequential access) */ > > - if ( what&USE_HEAD && b->H->heap.base ) { > > - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->heap.base, b->H->heap.size, > preload, advise, 0); > > - } > > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->heap\n", BATgetId(b), advise); > > - limit = (size_t *) (Hloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > - } > > - } > > - if ( what&USE_TAIL && b->T->heap.base ) { > > - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->heap.base, b->T->heap.size, > preload, advise, 0); > > +size_t > > +BATaccess(BAT *b, int what, int advise, int preload) { > > + ssize_t budget = (ssize_t) (0.8 * MT_npages()); > > + size_t v = 0, sz; > > + str id = BATgetId(b); > > + BATiter bi = bat_iterator(b); > > + > > + > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > > + if (BATcount(b) == 0) return 0; > > + > > + /* HASH indices (inherent random access). handle first as they > *will* be access randomly (one can always hope for locality on the other > heaps) */ > > + if ( what&USE_HHASH || what&USE_THASH ) { > > + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); > > + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base) { > > + budget -= sz = (b->H->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; > > + v += access_heap(id, "hhash", b->H->hash->heap, > b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > > } > > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->heap\n", BATgetId(b), advise); > > - limit = (size_t *) (Tloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base) { > > + budget -= sz = (b->T->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; > > + v += access_heap(id, "thash", b->T->hash->heap, > b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > > } > > + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); > > } > > > > - /* HASH indices (inherent random access) */ > > - if ( what&USE_HHASH || what&USE_THASH ) > > - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); > > - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base ) { > > - if (b->H->hash->heap->storage != STORE_MEM && > b->H->hash->heap->size > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->hash->heap->base, > b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); > > + /* we only touch stuff that is going to be read randomly (WILLNEED). > Note varheaps are sequential wrt to the references, or small */ > > + if ( what&USE_HEAD) { > > + if (b->H->heap.base) { > > + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = > BUNhloc(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->hash\n", BATgetId(b), advise); > > - limit = (size_t *) (b->H->hash->heap->base + > b->H->hash->heap->size - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->H->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (b->H->vheap && b->H->vheap->base) { > > + char *lo = BUNhead(bi, BUNfirst(b)), *hi = > BUNhead(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "hheap", b->H->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > } > > - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base ) { > > - if (b->T->hash->heap->storage != STORE_MEM && > b->T->hash->heap->size > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->hash->heap->base, > b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); > > + if ( what&USE_TAIL) { > > + if (b->T->heap.base) { > > + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = > BUNtloc(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->hash\n", BATgetId(b), advise); > > - limit = (size_t *) (b->T->hash->heap->base + > b->T->hash->heap->size - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->T->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (b->T->vheap && b->T->vheap->base) { > > + char *lo = BUNtail(bi, BUNfirst(b)), *hi = > BUNtail(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "theap", b->T->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > } > > - if ( what&USE_HHASH || what&USE_THASH ) > > - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); > > - > > - return v1 + v2 + v3 + v4; > > + return v; > > } > > @} > > > > > > Index: gdk_posix.mx > > =================================================================== > > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v > > retrieving revision 1.176.2.21 > > retrieving revision 1.176.2.22 > > diff -u -d -r1.176.2.21 -r1.176.2.22 > > --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 > > +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 > > @@ -909,10 +909,8 @@ > > unload = MT_mmap_tab[i].usecnt == 0; > > } > > (void) pthread_mutex_unlock(&MT_mmap_lock); > > - if (i >= 0 && preload > 0) > > - ret = posix_madvise(base, len, advise); > > - else if (unload) > > - ret = posix_madvise(base, len, MMAP_NORMAL); > > + if (unload) > > + ret = posix_madvise(base, len, BUF_SEQUENTIAL); > > if (ret) { > > stream_printf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = > %d\n", > > (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? > MT_mmap_tab[i].fd : -1), > > > > > > > ---------------------------------------------------------------------------- > -- > > Download Intel® Parallel Studio Eval > > Try the new software tools for yourself. Speed compiling, find bugs > > proactively, and fine-tune applications for parallel performance. > > See why Intel Parallel Studio got high marks during beta. > > http://p.sf.net/sfu/intel-sw-dev > > _______________________________________________ > > Monetdb-checkins mailing list > > Monetdb-checkins@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/monetdb-checkins > > > > > > -- > | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | > | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | > | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | > | The Netherlands | Fax : +31 (20) 592-4199 | > > -- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |
Seems ok: Index: MonetDB/src/gdk/gdk_posix.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v retrieving revision 1.176.2.22 diff -u -r1.176.2.22 gdk_posix.mx --- MonetDB/src/gdk/gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 +++ MonetDB/src/gdk/gdk_posix.mx 19 Feb 2010 01:18:34 -0000 @@ -587,7 +587,8 @@ int i, prev = MT_MMAP_BUFSIZE; for (i = MT_mmap_first; i >= 0; i = MT_mmap_tab[i].next) { - if (MT_mmap_tab[i].base == base) { + if (MT_mmap_tab[i].base <= (char*) base && + (char*) base < MT_mmap_tab[i].base + MT_mmap_tab[i].len) { return prev; } prev = i; > Finally, now sequential advise will not trigger preloading, but I actually > think it can help (if you have enough memory). Maybe prefetch sequential > heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory. indeed ... Stefan > Peter > > -----Original Message----- > From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] > Sent: vrijdag 19 februari 2010 1:34 > To: monetdb-developers@lists.sourceforge.net; Peter Boncz > Cc: monetdb-checkins@lists.sourceforge.net > Subject: * Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, > 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33 > > Peter, > > I have some questions to make sure I understand your new code correctly: > > 1) > I don't see any plance in the hash code (at least not in gdk_search.mx) > where the "free" element of a hash heap is set (or used) other than the > initialization to 0 in HEAPalloc; > thus, I guess, "free" for hash heaps is always 0; > hence, shouln't we use "size" instead of "free" for the madvise & preload > size of hash heaps (as we did in the original BATpreload/BATaccess code)? > > 2) > Am I right that for string heaps you conclude from a strong order > correlation between the off-heap and the string heap (due sequential > load/insertion) that also the first and last BUN in the offset point to the > "first" and "last" string in the string heap? > Well, indeed, since access is to be considered in page size granularity, > this might be reasonable ... > > > 3) > (This was the same in the previous version of the code) > For BUN heaps, in case of views (slices), the base pointer of the view's > heap might not be the same as the parent's heap, in fact, it might not be > page-aligned. > If I understand the MT_mmap_tab[] array correctly, it identifies heap by > their page-aligned base pointer of the parent's heap. > Hence, BATaccess() on a slice view BAT with non-aligned heap->base > pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned > heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() > does nothing with that heap. With preload==1 it does hence not resgister the > posix_madvise() call that access_heap() does. COnsequently, with > preload==-1, MT_mmap_inform() will never reset the advise set via slice > views, unless there is (also) access to the original parent's heap (i.e., > with page-aligned heap->base pointer. > I jjust noticed this, but do not yet understand, whether and if so which > consequences this (might) have ... > > > Stefan > > > On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote: > > Update of /cvsroot/monetdb/MonetDB/src/gdk > > In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734 > > > > Modified Files: > > Tag: Feb2010 > > gdk_posix.mx gdk_storage.mx > > Log Message: > > did experimentation with sequential mmap I/O. > > - on very fast subsystems (such as 16xssd) it is three times slower than > optimally tuned direct I/O (1GB/s vs 3GB/s) > > - with less disks the difference is smaller (e.g. 140 vs 200MB/s) > > regrettably, nothing helped to get it higher. > > > > the below checkin makes the following changes: > > - simplified BATaccess code by separating out routine > > - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) > > - observe that large string heaps have a high sequential correletaion > > hense always WILLNEED fetching is overkill > > - move the madvise() call back to BATaccess at the start of the access but > removing > > the advise is done in vmtrim, as you need the overview when the last > user is away. > > - the basic advise is SEQUENTIAL (ie decent I/O) > > > > > > > > Index: gdk_storage.mx > > =================================================================== > > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v > > retrieving revision 1.149.2.32 > > retrieving revision 1.149.2.33 > > diff -u -d -r1.149.2.32 -r1.149.2.33 > > --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 > > +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 > > @@ -697,156 +697,95 @@ > > return BATload_intern(i); > > } > > @- BAT preload > > -To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload > > -request. > > -Of course, it does not make sense to touch more then we can physically > accomodate. > > +To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload request. > > +Of course, it does not make sense to touch more then we can physically > accomodate (budget). > > @c > > -size_t > > -BATaccess(BAT *b, int what, int advise, int preload) { > > - size_t *i, *limit; > > - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; > > - size_t step = MT_pagesize()/sizeof(size_t); > > - size_t pages = (size_t) (0.8 * MT_npages()); > > - > > - > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > > - > > - /* VAR heaps (inherent random access) */ > > - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { > > - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, > preload, MMAP_WILLNEED, 0); > > - } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->vheap\n", BATgetId(b), advise); > > - limit = (size_t *) (b->H->vheap->base + > b->H->vheap->free) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->H->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > +/* modern linux tends to use 128K readaround = 64K readahead > > + * changes have been going on in 2009, towards true readahead > > + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c > > + * > > + * Peter Feb2010: I tried to do prefetches further apart, to trigger > multiple readahead > > + * units in parallel, but it does improve performance > visibly > > + */ > > +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, > int touch, int preload, int advise) { > > + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = > 0, page = MT_pagesize(); > > + int t = GDKms(); > > + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { > > + MT_mmap_inform(h->base, h->size, preload, advise, 0); > > + if (preload > 0) { > > + void* alignedbase = (void*) (((size_t) base) & > ~(page-1)); > > + size_t alignedsz = (sz + (page-1)) & ~(page-1); > > + int ret = posix_madvise(alignedbase, sz, advise); > > + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", > > + h->filename, PTRFMTCAST alignedbase, > alignedsz >> 20, advise, errno); > > } > > } > > - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { > > - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, > preload, MMAP_WILLNEED, 0); > > - } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->vheap\n", BATgetId(b), advise); > > - limit = (size_t *) (b->T->vheap->base + > b->T->vheap->free - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->T->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (touch && preload > 0) { > > + /* we need to ensure alignment, here, as b might be a view > and heap.base of views are not necessarily aligned */ > > + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - > 1) & (~(sizeof(size_t) - 1))); > > + size_t *hi = (size_t *) (base + sz); > > + for (hi -= 8*page; lo <= hi; lo += 8*page) { > > + /* try to trigger loading of multiple pages without > blocking */ > > + v0 += lo[0*page]; v1 += lo[1*page]; v2 += > lo[2*page]; v3 += lo[3*page]; > > + v4 += lo[4*page]; v5 += lo[5*page]; v6 += > lo[6*page]; v7 += lo[7*page]; > > } > > + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; > > } > > + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) > = %dms \n", id, hp, preload, (int) (sz>>20), > > + > (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK > NOWN", GDKms()-t); > > + return v0+v1+v2+v3+v4+v5+v6+v7; > > +} > > > > - /* BUN heaps (no need to preload for sequential access) */ > > - if ( what&USE_HEAD && b->H->heap.base ) { > > - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->heap.base, b->H->heap.size, > preload, advise, 0); > > - } > > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->heap\n", BATgetId(b), advise); > > - limit = (size_t *) (Hloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > - } > > - } > > - if ( what&USE_TAIL && b->T->heap.base ) { > > - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->heap.base, b->T->heap.size, > preload, advise, 0); > > +size_t > > +BATaccess(BAT *b, int what, int advise, int preload) { > > + ssize_t budget = (ssize_t) (0.8 * MT_npages()); > > + size_t v = 0, sz; > > + str id = BATgetId(b); > > + BATiter bi = bat_iterator(b); > > + > > + > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > > + if (BATcount(b) == 0) return 0; > > + > > + /* HASH indices (inherent random access). handle first as they > *will* be access randomly (one can always hope for locality on the other > heaps) */ > > + if ( what&USE_HHASH || what&USE_THASH ) { > > + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); > > + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base) { > > + budget -= sz = (b->H->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; > > + v += access_heap(id, "hhash", b->H->hash->heap, > b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > > } > > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->heap\n", BATgetId(b), advise); > > - limit = (size_t *) (Tloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base) { > > + budget -= sz = (b->T->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; > > + v += access_heap(id, "thash", b->T->hash->heap, > b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > > } > > + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); > > } > > > > - /* HASH indices (inherent random access) */ > > - if ( what&USE_HHASH || what&USE_THASH ) > > - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); > > - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base ) { > > - if (b->H->hash->heap->storage != STORE_MEM && > b->H->hash->heap->size > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->hash->heap->base, > b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); > > + /* we only touch stuff that is going to be read randomly (WILLNEED). > Note varheaps are sequential wrt to the references, or small */ > > + if ( what&USE_HEAD) { > > + if (b->H->heap.base) { > > + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = > BUNhloc(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->hash\n", BATgetId(b), advise); > > - limit = (size_t *) (b->H->hash->heap->base + > b->H->hash->heap->size - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->H->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (b->H->vheap && b->H->vheap->base) { > > + char *lo = BUNhead(bi, BUNfirst(b)), *hi = > BUNhead(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "hheap", b->H->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > } > > - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base ) { > > - if (b->T->hash->heap->storage != STORE_MEM && > b->T->hash->heap->size > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->hash->heap->base, > b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); > > + if ( what&USE_TAIL) { > > + if (b->T->heap.base) { > > + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = > BUNtloc(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->hash\n", BATgetId(b), advise); > > - limit = (size_t *) (b->T->hash->heap->base + > b->T->hash->heap->size - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->T->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (b->T->vheap && b->T->vheap->base) { > > + char *lo = BUNtail(bi, BUNfirst(b)), *hi = > BUNtail(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "theap", b->T->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > } > > - if ( what&USE_HHASH || what&USE_THASH ) > > - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); > > - > > - return v1 + v2 + v3 + v4; > > + return v; > > } > > @} > > > > > > Index: gdk_posix.mx > > =================================================================== > > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v > > retrieving revision 1.176.2.21 > > retrieving revision 1.176.2.22 > > diff -u -d -r1.176.2.21 -r1.176.2.22 > > --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 > > +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 > > @@ -909,10 +909,8 @@ > > unload = MT_mmap_tab[i].usecnt == 0; > > } > > (void) pthread_mutex_unlock(&MT_mmap_lock); > > - if (i >= 0 && preload > 0) > > - ret = posix_madvise(base, len, advise); > > - else if (unload) > > - ret = posix_madvise(base, len, MMAP_NORMAL); > > + if (unload) > > + ret = posix_madvise(base, len, BUF_SEQUENTIAL); > > if (ret) { > > stream_printf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = > %d\n", > > (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? > MT_mmap_tab[i].fd : -1), > > > > > > > ---------------------------------------------------------------------------- > -- > > Download Intel® Parallel Studio Eval > > Try the new software tools for yourself. Speed compiling, find bugs > > proactively, and fine-tune applications for parallel performance. > > See why Intel Parallel Studio got high marks during beta. > > http://p.sf.net/sfu/intel-sw-dev > > _______________________________________________ > > Monetdb-checkins mailing list > > Monetdb-checkins@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/monetdb-checkins > > > > > > -- > | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | > | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | > | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | > | The Netherlands | Fax : +31 (20) 592-4199 | > > -- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |
I applied the patches of Peter as of 1AM and started the SF100 run. It gave a segfault after 10minutes, but for once i did not attend Q1 to 'see/feel' processing. Rebuilding now with all patches of this night. Peter Boncz wrote: > Hi Stefan > > Thanks, indeed in all areas improvements are needed: > 1) indeed (scary use of free!) this should be corrected > 2) typically yes. I do recall now that BATfetchjoin heap sharing will > invalidate the otherwise always applying order correlation. If we have a way > to detect that a heap is shared, we should treat those shared string heaps > as WILLNEED. > 3) also correct. The MT_mmap_find() could easily find entries by range > overlap, then inform would find the relevant heap > > Finally, now sequential advise will not trigger preloading, but I actually > think it can help (if you have enough memory). Maybe prefetch sequential > heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory. > > Peter > > -----Original Message----- > From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] > Sent: vrijdag 19 februari 2010 1:34 > To: monetdb-developers@lists.sourceforge.net; Peter Boncz > Cc: monetdb-checkins@lists.sourceforge.net > Subject: Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, > 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33 > > Peter, > > I have some questions to make sure I understand your new code correctly: > > 1) > I don't see any plance in the hash code (at least not in gdk_search.mx) > where the "free" element of a hash heap is set (or used) other than the > initialization to 0 in HEAPalloc; > thus, I guess, "free" for hash heaps is always 0; > hence, shouln't we use "size" instead of "free" for the madvise & preload > size of hash heaps (as we did in the original BATpreload/BATaccess code)? > > 2) > Am I right that for string heaps you conclude from a strong order > correlation between the off-heap and the string heap (due sequential > load/insertion) that also the first and last BUN in the offset point to the > "first" and "last" string in the string heap? > Well, indeed, since access is to be considered in page size granularity, > this might be reasonable ... > > > 3) > (This was the same in the previous version of the code) > For BUN heaps, in case of views (slices), the base pointer of the view's > heap might not be the same as the parent's heap, in fact, it might not be > page-aligned. > If I understand the MT_mmap_tab[] array correctly, it identifies heap by > their page-aligned base pointer of the parent's heap. > Hence, BATaccess() on a slice view BAT with non-aligned heap->base > pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned > heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() > does nothing with that heap. With preload==1 it does hence not resgister the > posix_madvise() call that access_heap() does. COnsequently, with > preload==-1, MT_mmap_inform() will never reset the advise set via slice > views, unless there is (also) access to the original parent's heap (i.e., > with page-aligned heap->base pointer. > I jjust noticed this, but do not yet understand, whether and if so which > consequences this (might) have ... > > > Stefan > > > On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote: >> Update of /cvsroot/monetdb/MonetDB/src/gdk >> In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734 >> >> Modified Files: >> Tag: Feb2010 >> gdk_posix.mx gdk_storage.mx >> Log Message: >> did experimentation with sequential mmap I/O. >> - on very fast subsystems (such as 16xssd) it is three times slower than > optimally tuned direct I/O (1GB/s vs 3GB/s) >> - with less disks the difference is smaller (e.g. 140 vs 200MB/s) >> regrettably, nothing helped to get it higher. >> >> the below checkin makes the following changes: >> - simplified BATaccess code by separating out routine >> - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) >> - observe that large string heaps have a high sequential correletaion >> hense always WILLNEED fetching is overkill >> - move the madvise() call back to BATaccess at the start of the access but > removing >> the advise is done in vmtrim, as you need the overview when the last > user is away. >> - the basic advise is SEQUENTIAL (ie decent I/O) >> >> >> >> Index: gdk_storage.mx >> =================================================================== >> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v >> retrieving revision 1.149.2.32 >> retrieving revision 1.149.2.33 >> diff -u -d -r1.149.2.32 -r1.149.2.33 >> --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 >> +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 >> @@ -697,156 +697,95 @@ >> return BATload_intern(i); >> } >> @- BAT preload >> -To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload >> -request. >> -Of course, it does not make sense to touch more then we can physically > accomodate. >> +To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload request. >> +Of course, it does not make sense to touch more then we can physically > accomodate (budget). >> @c >> -size_t >> -BATaccess(BAT *b, int what, int advise, int preload) { >> - size_t *i, *limit; >> - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; >> - size_t step = MT_pagesize()/sizeof(size_t); >> - size_t pages = (size_t) (0.8 * MT_npages()); >> - >> - > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); >> - >> - /* VAR heaps (inherent random access) */ >> - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { >> - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, > preload, MMAP_WILLNEED, 0); >> - } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->vheap\n", BATgetId(b), advise); >> - limit = (size_t *) (b->H->vheap->base + > b->H->vheap->free) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->H->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> +/* modern linux tends to use 128K readaround = 64K readahead >> + * changes have been going on in 2009, towards true readahead >> + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c >> + * >> + * Peter Feb2010: I tried to do prefetches further apart, to trigger > multiple readahead >> + * units in parallel, but it does improve performance > visibly >> + */ >> +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, > int touch, int preload, int advise) { >> + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = > 0, page = MT_pagesize(); >> + int t = GDKms(); >> + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { >> + MT_mmap_inform(h->base, h->size, preload, advise, 0); >> + if (preload > 0) { >> + void* alignedbase = (void*) (((size_t) base) & > ~(page-1)); >> + size_t alignedsz = (sz + (page-1)) & ~(page-1); >> + int ret = posix_madvise(alignedbase, sz, advise); >> + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", >> + h->filename, PTRFMTCAST alignedbase, > alignedsz >> 20, advise, errno); >> } >> } >> - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { >> - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, > preload, MMAP_WILLNEED, 0); >> - } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->vheap\n", BATgetId(b), advise); >> - limit = (size_t *) (b->T->vheap->base + > b->T->vheap->free - sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->T->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if (touch && preload > 0) { >> + /* we need to ensure alignment, here, as b might be a view > and heap.base of views are not necessarily aligned */ >> + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - > 1) & (~(sizeof(size_t) - 1))); >> + size_t *hi = (size_t *) (base + sz); >> + for (hi -= 8*page; lo <= hi; lo += 8*page) { >> + /* try to trigger loading of multiple pages without > blocking */ >> + v0 += lo[0*page]; v1 += lo[1*page]; v2 += > lo[2*page]; v3 += lo[3*page]; >> + v4 += lo[4*page]; v5 += lo[5*page]; v6 += > lo[6*page]; v7 += lo[7*page]; >> } >> + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; >> } >> + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) > = %dms \n", id, hp, preload, (int) (sz>>20), >> + > (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK > NOWN", GDKms()-t); >> + return v0+v1+v2+v3+v4+v5+v6+v7; >> +} >> >> - /* BUN heaps (no need to preload for sequential access) */ >> - if ( what&USE_HEAD && b->H->heap.base ) { >> - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->H->heap.base, b->H->heap.size, > preload, advise, 0); >> - } >> - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->heap\n", BATgetId(b), advise); >> - limit = (size_t *) (Hloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> - } >> - } >> - if ( what&USE_TAIL && b->T->heap.base ) { >> - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->T->heap.base, b->T->heap.size, > preload, advise, 0); >> +size_t >> +BATaccess(BAT *b, int what, int advise, int preload) { >> + ssize_t budget = (ssize_t) (0.8 * MT_npages()); >> + size_t v = 0, sz; >> + str id = BATgetId(b); >> + BATiter bi = bat_iterator(b); >> + >> + > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); >> + if (BATcount(b) == 0) return 0; >> + >> + /* HASH indices (inherent random access). handle first as they > *will* be access randomly (one can always hope for locality on the other > heaps) */ >> + if ( what&USE_HHASH || what&USE_THASH ) { >> + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); >> + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base) { >> + budget -= sz = (b->H->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; >> + v += access_heap(id, "hhash", b->H->hash->heap, > b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); >> } >> - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->heap\n", BATgetId(b), advise); >> - limit = (size_t *) (Tloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base) { >> + budget -= sz = (b->T->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; >> + v += access_heap(id, "thash", b->T->hash->heap, > b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); >> } >> + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); >> } >> >> - /* HASH indices (inherent random access) */ >> - if ( what&USE_HHASH || what&USE_THASH ) >> - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); >> - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base ) { >> - if (b->H->hash->heap->storage != STORE_MEM && > b->H->hash->heap->size > MT_MMAP_TILE) { >> - MT_mmap_inform(b->H->hash->heap->base, > b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); >> + /* we only touch stuff that is going to be read randomly (WILLNEED). > Note varheaps are sequential wrt to the references, or small */ >> + if ( what&USE_HEAD) { >> + if (b->H->heap.base) { >> + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = > BUNhloc(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->hash\n", BATgetId(b), advise); >> - limit = (size_t *) (b->H->hash->heap->base + > b->H->hash->heap->size - sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->H->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if (b->H->vheap && b->H->vheap->base) { >> + char *lo = BUNhead(bi, BUNfirst(b)), *hi = > BUNhead(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "hheap", b->H->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> } >> - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base ) { >> - if (b->T->hash->heap->storage != STORE_MEM && > b->T->hash->heap->size > MT_MMAP_TILE) { >> - MT_mmap_inform(b->T->hash->heap->base, > b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); >> + if ( what&USE_TAIL) { >> + if (b->T->heap.base) { >> + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = > BUNtloc(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->hash\n", BATgetId(b), advise); >> - limit = (size_t *) (b->T->hash->heap->base + > b->T->hash->heap->size - sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->T->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if (b->T->vheap && b->T->vheap->base) { >> + char *lo = BUNtail(bi, BUNfirst(b)), *hi = > BUNtail(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "theap", b->T->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> } >> - if ( what&USE_HHASH || what&USE_THASH ) >> - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); >> - >> - return v1 + v2 + v3 + v4; >> + return v; >> } >> @} >> >> >> Index: gdk_posix.mx >> =================================================================== >> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v >> retrieving revision 1.176.2.21 >> retrieving revision 1.176.2.22 >> diff -u -d -r1.176.2.21 -r1.176.2.22 >> --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 >> +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 >> @@ -909,10 +909,8 @@ >> unload = MT_mmap_tab[i].usecnt == 0; >> } >> (void) pthread_mutex_unlock(&MT_mmap_lock); >> - if (i >= 0 && preload > 0) >> - ret = posix_madvise(base, len, advise); >> - else if (unload) >> - ret = posix_madvise(base, len, MMAP_NORMAL); >> + if (unload) >> + ret = posix_madvise(base, len, BUF_SEQUENTIAL); >> if (ret) { >> stream_printf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = > %d\n", >> (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? > MT_mmap_tab[i].fd : -1), >> >> > ---------------------------------------------------------------------------- > -- >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Monetdb-checkins mailing list >> Monetdb-checkins@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/monetdb-checkins >> >> >
participants (3)
-
Martin Kersten
-
Peter Boncz
-
Stefan Manegold