19 Feb
2010
19 Feb
'10
2:29 a.m.
On Fri, Feb 19, 2010 at 01:46:23AM +0100, Peter Boncz wrote: > Hi Stefan > > Thanks, indeed in all areas improvements are needed: > 1) indeed (scary use of free!) this should be corrected Done. > 2) typically yes. I do recall now that BATfetchjoin heap sharing will > invalidate the otherwise always applying order correlation. If we have a way > to detect that a heap is shared, we should treat those shared string heaps > as WILLNEED. I'll leave that for tomorrow, or later ... > 3) also correct. The MT_mmap_find() could easily find entries by range > overlap, then inform would find the relevant heap something like this, I suppose: Index: MonetDB/src/gdk/gdk_posix.mx =================================================================== RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v retrieving revision 1.176.2.22 diff -u -r1.176.2.22 gdk_posix.mx --- MonetDB/src/gdk/gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 +++ MonetDB/src/gdk/gdk_posix.mx 19 Feb 2010 01:18:34 -0000 @@ -587,7 +587,8 @@ int i, prev = MT_MMAP_BUFSIZE; for (i = MT_mmap_first; i >= 0; i = MT_mmap_tab[i].next) { - if (MT_mmap_tab[i].base == base) { + if (MT_mmap_tab[i].base <= (char*) base && + (char*) base < MT_mmap_tab[i].base + MT_mmap_tab[i].len) { return prev; } prev = i; > Finally, now sequential advise will not trigger preloading, but I actually > think it can help (if you have enough memory). Maybe prefetch sequential > heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory. indeed ... Stefan > Peter > > -----Original Message----- > From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] > Sent: vrijdag 19 februari 2010 1:34 > To: monetdb-developers@lists.sourceforge.net; Peter Boncz > Cc: monetdb-checkins@lists.sourceforge.net > Subject: * Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, > 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33 > > Peter, > > I have some questions to make sure I understand your new code correctly: > > 1) > I don't see any plance in the hash code (at least not in gdk_search.mx) > where the "free" element of a hash heap is set (or used) other than the > initialization to 0 in HEAPalloc; > thus, I guess, "free" for hash heaps is always 0; > hence, shouln't we use "size" instead of "free" for the madvise & preload > size of hash heaps (as we did in the original BATpreload/BATaccess code)? > > 2) > Am I right that for string heaps you conclude from a strong order > correlation between the off-heap and the string heap (due sequential > load/insertion) that also the first and last BUN in the offset point to the > "first" and "last" string in the string heap? > Well, indeed, since access is to be considered in page size granularity, > this might be reasonable ... > > > 3) > (This was the same in the previous version of the code) > For BUN heaps, in case of views (slices), the base pointer of the view's > heap might not be the same as the parent's heap, in fact, it might not be > page-aligned. > If I understand the MT_mmap_tab[] array correctly, it identifies heap by > their page-aligned base pointer of the parent's heap. > Hence, BATaccess() on a slice view BAT with non-aligned heap->base > pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned > heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() > does nothing with that heap. With preload==1 it does hence not resgister the > posix_madvise() call that access_heap() does. COnsequently, with > preload==-1, MT_mmap_inform() will never reset the advise set via slice > views, unless there is (also) access to the original parent's heap (i.e., > with page-aligned heap->base pointer. > I jjust noticed this, but do not yet understand, whether and if so which > consequences this (might) have ... > > > Stefan > > > On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote: > > Update of /cvsroot/monetdb/MonetDB/src/gdk > > In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734 > > > > Modified Files: > > Tag: Feb2010 > > gdk_posix.mx gdk_storage.mx > > Log Message: > > did experimentation with sequential mmap I/O. > > - on very fast subsystems (such as 16xssd) it is three times slower than > optimally tuned direct I/O (1GB/s vs 3GB/s) > > - with less disks the difference is smaller (e.g. 140 vs 200MB/s) > > regrettably, nothing helped to get it higher. > > > > the below checkin makes the following changes: > > - simplified BATaccess code by separating out routine > > - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) > > - observe that large string heaps have a high sequential correletaion > > hense always WILLNEED fetching is overkill > > - move the madvise() call back to BATaccess at the start of the access but > removing > > the advise is done in vmtrim, as you need the overview when the last > user is away. > > - the basic advise is SEQUENTIAL (ie decent I/O) > > > > > > > > Index: gdk_storage.mx > > =================================================================== > > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v > > retrieving revision 1.149.2.32 > > retrieving revision 1.149.2.33 > > diff -u -d -r1.149.2.32 -r1.149.2.33 > > --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 > > +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 > > @@ -697,156 +697,95 @@ > > return BATload_intern(i); > > } > > @- BAT preload > > -To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload > > -request. > > -Of course, it does not make sense to touch more then we can physically > accomodate. > > +To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload request. > > +Of course, it does not make sense to touch more then we can physically > accomodate (budget). > > @c > > -size_t > > -BATaccess(BAT *b, int what, int advise, int preload) { > > - size_t *i, *limit; > > - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; > > - size_t step = MT_pagesize()/sizeof(size_t); > > - size_t pages = (size_t) (0.8 * MT_npages()); > > - > > - > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > > - > > - /* VAR heaps (inherent random access) */ > > - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { > > - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, > preload, MMAP_WILLNEED, 0); > > - } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->vheap\n", BATgetId(b), advise); > > - limit = (size_t *) (b->H->vheap->base + > b->H->vheap->free) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->H->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > +/* modern linux tends to use 128K readaround = 64K readahead > > + * changes have been going on in 2009, towards true readahead > > + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c > > + * > > + * Peter Feb2010: I tried to do prefetches further apart, to trigger > multiple readahead > > + * units in parallel, but it does improve performance > visibly > > + */ > > +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, > int touch, int preload, int advise) { > > + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = > 0, page = MT_pagesize(); > > + int t = GDKms(); > > + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { > > + MT_mmap_inform(h->base, h->size, preload, advise, 0); > > + if (preload > 0) { > > + void* alignedbase = (void*) (((size_t) base) & > ~(page-1)); > > + size_t alignedsz = (sz + (page-1)) & ~(page-1); > > + int ret = posix_madvise(alignedbase, sz, advise); > > + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", > > + h->filename, PTRFMTCAST alignedbase, > alignedsz >> 20, advise, errno); > > } > > } > > - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { > > - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, > preload, MMAP_WILLNEED, 0); > > - } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->vheap\n", BATgetId(b), advise); > > - limit = (size_t *) (b->T->vheap->base + > b->T->vheap->free - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->T->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (touch && preload > 0) { > > + /* we need to ensure alignment, here, as b might be a view > and heap.base of views are not necessarily aligned */ > > + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - > 1) & (~(sizeof(size_t) - 1))); > > + size_t *hi = (size_t *) (base + sz); > > + for (hi -= 8*page; lo <= hi; lo += 8*page) { > > + /* try to trigger loading of multiple pages without > blocking */ > > + v0 += lo[0*page]; v1 += lo[1*page]; v2 += > lo[2*page]; v3 += lo[3*page]; > > + v4 += lo[4*page]; v5 += lo[5*page]; v6 += > lo[6*page]; v7 += lo[7*page]; > > } > > + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; > > } > > + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) > = %dms \n", id, hp, preload, (int) (sz>>20), > > + > (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK > NOWN", GDKms()-t); > > + return v0+v1+v2+v3+v4+v5+v6+v7; > > +} > > > > - /* BUN heaps (no need to preload for sequential access) */ > > - if ( what&USE_HEAD && b->H->heap.base ) { > > - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->heap.base, b->H->heap.size, > preload, advise, 0); > > - } > > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->heap\n", BATgetId(b), advise); > > - limit = (size_t *) (Hloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > - } > > - } > > - if ( what&USE_TAIL && b->T->heap.base ) { > > - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->heap.base, b->T->heap.size, > preload, advise, 0); > > +size_t > > +BATaccess(BAT *b, int what, int advise, int preload) { > > + ssize_t budget = (ssize_t) (0.8 * MT_npages()); > > + size_t v = 0, sz; > > + str id = BATgetId(b); > > + BATiter bi = bat_iterator(b); > > + > > + > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); > > + if (BATcount(b) == 0) return 0; > > + > > + /* HASH indices (inherent random access). handle first as they > *will* be access randomly (one can always hope for locality on the other > heaps) */ > > + if ( what&USE_HHASH || what&USE_THASH ) { > > + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); > > + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base) { > > + budget -= sz = (b->H->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; > > + v += access_heap(id, "hhash", b->H->hash->heap, > b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > > } > > - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->heap\n", BATgetId(b), advise); > > - limit = (size_t *) (Tloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base) { > > + budget -= sz = (b->T->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; > > + v += access_heap(id, "thash", b->T->hash->heap, > b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); > > } > > + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); > > } > > > > - /* HASH indices (inherent random access) */ > > - if ( what&USE_HHASH || what&USE_THASH ) > > - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); > > - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base ) { > > - if (b->H->hash->heap->storage != STORE_MEM && > b->H->hash->heap->size > MT_MMAP_TILE) { > > - MT_mmap_inform(b->H->hash->heap->base, > b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); > > + /* we only touch stuff that is going to be read randomly (WILLNEED). > Note varheaps are sequential wrt to the references, or small */ > > + if ( what&USE_HEAD) { > > + if (b->H->heap.base) { > > + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = > BUNhloc(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->hash\n", BATgetId(b), advise); > > - limit = (size_t *) (b->H->hash->heap->base + > b->H->hash->heap->size - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->H->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (b->H->vheap && b->H->vheap->base) { > > + char *lo = BUNhead(bi, BUNfirst(b)), *hi = > BUNhead(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "hheap", b->H->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > } > > - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base ) { > > - if (b->T->hash->heap->storage != STORE_MEM && > b->T->hash->heap->size > MT_MMAP_TILE) { > > - MT_mmap_inform(b->T->hash->heap->base, > b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); > > + if ( what&USE_TAIL) { > > + if (b->T->heap.base) { > > + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = > BUNtloc(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > - if (preload > 0 && pages > 0) { > > - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->hash\n", BATgetId(b), advise); > > - limit = (size_t *) (b->T->hash->heap->base + > b->T->hash->heap->size - sizeof(size_t)) - 4 * step; > > - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ > > - i = (size_t *) (((size_t)b->T->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); > > - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { > > - v1 += *i; > > - v2 += *(i + step); > > - v3 += *(i + 2*step); > > - v4 += *(i + 3*step); > > - } > > - limit += 4 * step; > > - for (; i <= limit && pages > 0; i+= step, pages--) > { > > - v1 += *i; > > - } > > + if (b->T->vheap && b->T->vheap->base) { > > + char *lo = BUNtail(bi, BUNfirst(b)), *hi = > BUNtail(bi, BUNlast(b)-1); > > + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); > > + v += access_heap(id, "theap", b->T->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); > > } > > } > > - if ( what&USE_HHASH || what&USE_THASH ) > > - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); > > - > > - return v1 + v2 + v3 + v4; > > + return v; > > } > > @} > > > > > > Index: gdk_posix.mx > > =================================================================== > > RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v > > retrieving revision 1.176.2.21 > > retrieving revision 1.176.2.22 > > diff -u -d -r1.176.2.21 -r1.176.2.22 > > --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 > > +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 > > @@ -909,10 +909,8 @@ > > unload = MT_mmap_tab[i].usecnt == 0; > > } > > (void) pthread_mutex_unlock(&MT_mmap_lock); > > - if (i >= 0 && preload > 0) > > - ret = posix_madvise(base, len, advise); > > - else if (unload) > > - ret = posix_madvise(base, len, MMAP_NORMAL); > > + if (unload) > > + ret = posix_madvise(base, len, BUF_SEQUENTIAL); > > if (ret) { > > stream_printf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = > %d\n", > > (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? > MT_mmap_tab[i].fd : -1), > > > > > > > ---------------------------------------------------------------------------- > -- > > Download Intel® Parallel Studio Eval > > Try the new software tools for yourself. Speed compiling, find bugs > > proactively, and fine-tune applications for parallel performance. > > See why Intel Parallel Studio got high marks during beta. > > http://p.sf.net/sfu/intel-sw-dev > > _______________________________________________ > > Monetdb-checkins mailing list > > Monetdb-checkins@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/monetdb-checkins > > > > > > -- > | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | > | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | > | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | > | The Netherlands | Fax : +31 (20) 592-4199 | > > -- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |