Peter, I have some questions to make sure I understand your new code correctly: 1) I don't see any plance in the hash code (at least not in gdk_search.mx) where the "free" element of a hash heap is set (or used) other than the initialization to 0 in HEAPalloc; thus, I guess, "free" for hash heaps is always 0; hence, shouln't we use "size" instead of "free" for the madvise & preload size of hash heaps (as we did in the original BATpreload/BATaccess code)? 2) Am I right that for string heaps you conclude from a strong order correlation between the off-heap and the string heap (due sequential load/insertion) that also the first and last BUN in the offset point to the "first" and "last" string in the string heap? Well, indeed, since access is to be considered in page size granularity, this might be reasonable ... 3) (This was the same in the previous version of the code) For BUN heaps, in case of views (slices), the base pointer of the view's heap might not be the same as the parent's heap, in fact, it might not be page-aligned. If I understand the MT_mmap_tab[] array correctly, it identifies heap by their page-aligned base pointer of the parent's heap. Hence, BATaccess() on a slice view BAT with non-aligned heap->base pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() does nothing with that heap. With preload==1 it does hence not resgister the posix_madvise() call that access_heap() does. COnsequently, with preload==-1, MT_mmap_inform() will never reset the advise set via slice views, unless there is (also) access to the original parent's heap (i.e., with page-aligned heap->base pointer. I jjust noticed this, but do not yet understand, whether and if so which consequences this (might) have ... Stefan On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote:
Update of /cvsroot/monetdb/MonetDB/src/gdk In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734
Modified Files: Tag: Feb2010 gdk_posix.mx gdk_storage.mx Log Message: did experimentation with sequential mmap I/O. - on very fast subsystems (such as 16xssd) it is three times slower than optimally tuned direct I/O (1GB/s vs 3GB/s) - with less disks the difference is smaller (e.g. 140 vs 200MB/s) regrettably, nothing helped to get it higher.
the below checkin makes the following changes: - simplified BATaccess code by separating out routine - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) - observe that large string heaps have a high sequential correletaion hense always WILLNEED fetching is overkill - move the madvise() call back to BATaccess at the start of the access but removing the advise is done in vmtrim, as you need the overview when the last user is away. - the basic advise is SEQUENTIAL (ie decent I/O)
Index: gdk_storage.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v retrieving revision 1.149.2.32 retrieving revision 1.149.2.33 diff -u -d -r1.149.2.32 -r1.149.2.33 --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 @@ -697,156 +697,95 @@ return BATload_intern(i); } @- BAT preload -To avoid random disk access to large (memory-mapped) BATs it may help to issue a preload -request. -Of course, it does not make sense to touch more then we can physically accomodate. +To avoid random disk access to large (memory-mapped) BATs it may help to issue a preload request. +Of course, it does not make sense to touch more then we can physically accomodate (budget). @c -size_t -BATaccess(BAT *b, int what, int advise, int preload) { - size_t *i, *limit; - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; - size_t step = MT_pagesize()/sizeof(size_t); - size_t pages = (size_t) (0.8 * MT_npages()); - - assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||advise==MMAP_WILLNEED||advise==MMAP_DONTNEED); - - /* VAR heaps (inherent random access) */ - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, preload, MMAP_WILLNEED, 0); - } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->vheap\n", BATgetId(b), advise); - limit = (size_t *) (b->H->vheap->base + b->H->vheap->free) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->H->vheap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } +/* modern linux tends to use 128K readaround = 64K readahead + * changes have been going on in 2009, towards true readahead + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c + * + * Peter Feb2010: I tried to do prefetches further apart, to trigger multiple readahead + * units in parallel, but it does improve performance visibly + */ +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, int touch, int preload, int advise) { + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = 0, page = MT_pagesize(); + int t = GDKms(); + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { + MT_mmap_inform(h->base, h->size, preload, advise, 0); + if (preload > 0) { + void* alignedbase = (void*) (((size_t) base) & ~(page-1)); + size_t alignedsz = (sz + (page-1)) & ~(page-1); + int ret = posix_madvise(alignedbase, sz, advise); + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", + h->filename, PTRFMTCAST alignedbase, alignedsz >> 20, advise, errno); } } - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, preload, MMAP_WILLNEED, 0); - } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->vheap\n", BATgetId(b), advise); - limit = (size_t *) (b->T->vheap->base + b->T->vheap->free - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->T->vheap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if (touch && preload > 0) { + /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); + size_t *hi = (size_t *) (base + sz); + for (hi -= 8*page; lo <= hi; lo += 8*page) { + /* try to trigger loading of multiple pages without blocking */ + v0 += lo[0*page]; v1 += lo[1*page]; v2 += lo[2*page]; v3 += lo[3*page]; + v4 += lo[4*page]; v5 += lo[5*page]; v6 += lo[6*page]; v7 += lo[7*page]; } + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; } + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) = %dms \n", id, hp, preload, (int) (sz>>20), + (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNKNOWN", GDKms()-t); + return v0+v1+v2+v3+v4+v5+v6+v7; +}
- /* BUN heaps (no need to preload for sequential access) */ - if ( what&USE_HEAD && b->H->heap.base ) { - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > MT_MMAP_TILE) { - MT_mmap_inform(b->H->heap.base, b->H->heap.size, preload, advise, 0); - } - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->heap\n", BATgetId(b), advise); - limit = (size_t *) (Hloc(b, BUNlast(b)) - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } - } - } - if ( what&USE_TAIL && b->T->heap.base ) { - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > MT_MMAP_TILE) { - MT_mmap_inform(b->T->heap.base, b->T->heap.size, preload, advise, 0); +size_t +BATaccess(BAT *b, int what, int advise, int preload) { + ssize_t budget = (ssize_t) (0.8 * MT_npages()); + size_t v = 0, sz; + str id = BATgetId(b); + BATiter bi = bat_iterator(b); + + assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||advise==MMAP_WILLNEED||advise==MMAP_DONTNEED); + if (BATcount(b) == 0) return 0; + + /* HASH indices (inherent random access). handle first as they *will* be access randomly (one can always hope for locality on the other heaps) */ + if ( what&USE_HHASH || what&USE_THASH ) { + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && b->H->hash->heap->base) { + budget -= sz = (b->H->hash->heap->free > (size_t) budget)?budget:(ssize_t)b->T->hash->heap->free; + v += access_heap(id, "hhash", b->H->hash->heap, b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); } - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->heap\n", BATgetId(b), advise); - limit = (size_t *) (Tloc(b, BUNlast(b)) - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && b->T->hash->heap->base) { + budget -= sz = (b->T->hash->heap->free > (size_t) budget)?budget:(ssize_t)b->T->hash->heap->free; + v += access_heap(id, "thash", b->T->hash->heap, b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); } + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); }
- /* HASH indices (inherent random access) */ - if ( what&USE_HHASH || what&USE_THASH ) - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && b->H->hash->heap->base ) { - if (b->H->hash->heap->storage != STORE_MEM && b->H->hash->heap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->H->hash->heap->base, b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); + /* we only touch stuff that is going to be read randomly (WILLNEED). Note varheaps are sequential wrt to the references, or small */ + if ( what&USE_HEAD) { + if (b->H->heap.base) { + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = BUNhloc(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): H->hash\n", BATgetId(b), advise); - limit = (size_t *) (b->H->hash->heap->base + b->H->hash->heap->size - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->H->hash->heap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if (b->H->vheap && b->H->vheap->base) { + char *lo = BUNhead(bi, BUNfirst(b)), *hi = BUNhead(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "hheap", b->H->vheap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } } - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && b->T->hash->heap->base ) { - if (b->T->hash->heap->storage != STORE_MEM && b->T->hash->heap->size > MT_MMAP_TILE) { - MT_mmap_inform(b->T->hash->heap->base, b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); + if ( what&USE_TAIL) { + if (b->T->heap.base) { + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = BUNtloc(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } - if (preload > 0 && pages > 0) { - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): T->hash\n", BATgetId(b), advise); - limit = (size_t *) (b->T->hash->heap->base + b->T->hash->heap->size - sizeof(size_t)) - 4 * step; - /* we need to ensure alignment, here, as b might be a view and heap.base of views are not necessarily aligned */ - i = (size_t *) (((size_t)b->T->hash->heap->base + sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= 4) { - v1 += *i; - v2 += *(i + step); - v3 += *(i + 2*step); - v4 += *(i + 3*step); - } - limit += 4 * step; - for (; i <= limit && pages > 0; i+= step, pages--) { - v1 += *i; - } + if (b->T->vheap && b->T->vheap->base) { + char *lo = BUNtail(bi, BUNfirst(b)), *hi = BUNtail(bi, BUNlast(b)-1); + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); + v += access_heap(id, "theap", b->T->vheap, lo, sz, (advise == BUF_WILLNEED), preload, advise); } } - if ( what&USE_HHASH || what&USE_THASH ) - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), "BATaccess"); - - return v1 + v2 + v3 + v4; + return v; } @}
Index: gdk_posix.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v retrieving revision 1.176.2.21 retrieving revision 1.176.2.22 diff -u -d -r1.176.2.21 -r1.176.2.22 --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 @@ -909,10 +909,8 @@ unload = MT_mmap_tab[i].usecnt == 0; } (void) pthread_mutex_unlock(&MT_mmap_lock); - if (i >= 0 && preload > 0) - ret = posix_madvise(base, len, advise); - else if (unload) - ret = posix_madvise(base, len, MMAP_NORMAL); + if (unload) + ret = posix_madvise(base, len, BUF_SEQUENTIAL); if (ret) { stream_printf(GDKerr, "#MT_mmap_inform: posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? MT_mmap_tab[i].fd : -1),
------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Monetdb-checkins mailing list Monetdb-checkins@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-checkins
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |