19 Feb
2010
19 Feb
'10
7:19 a.m.
I applied the patches of Peter as of 1AM and started the SF100 run. It gave a segfault after 10minutes, but for once i did not attend Q1 to 'see/feel' processing. Rebuilding now with all patches of this night. Peter Boncz wrote: > Hi Stefan > > Thanks, indeed in all areas improvements are needed: > 1) indeed (scary use of free!) this should be corrected > 2) typically yes. I do recall now that BATfetchjoin heap sharing will > invalidate the otherwise always applying order correlation. If we have a way > to detect that a heap is shared, we should treat those shared string heaps > as WILLNEED. > 3) also correct. The MT_mmap_find() could easily find entries by range > overlap, then inform would find the relevant heap > > Finally, now sequential advise will not trigger preloading, but I actually > think it can help (if you have enough memory). Maybe prefetch sequential > heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory. > > Peter > > -----Original Message----- > From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl] > Sent: vrijdag 19 februari 2010 1:34 > To: monetdb-developers@lists.sourceforge.net; Peter Boncz > Cc: monetdb-checkins@lists.sourceforge.net > Subject: Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010, > 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33 > > Peter, > > I have some questions to make sure I understand your new code correctly: > > 1) > I don't see any plance in the hash code (at least not in gdk_search.mx) > where the "free" element of a hash heap is set (or used) other than the > initialization to 0 in HEAPalloc; > thus, I guess, "free" for hash heaps is always 0; > hence, shouln't we use "size" instead of "free" for the madvise & preload > size of hash heaps (as we did in the original BATpreload/BATaccess code)? > > 2) > Am I right that for string heaps you conclude from a strong order > correlation between the off-heap and the string heap (due sequential > load/insertion) that also the first and last BUN in the offset point to the > "first" and "last" string in the string heap? > Well, indeed, since access is to be considered in page size granularity, > this might be reasonable ... > > > 3) > (This was the same in the previous version of the code) > For BUN heaps, in case of views (slices), the base pointer of the view's > heap might not be the same as the parent's heap, in fact, it might not be > page-aligned. > If I understand the MT_mmap_tab[] array correctly, it identifies heap by > their page-aligned base pointer of the parent's heap. > Hence, BATaccess() on a slice view BAT with non-aligned heap->base > pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned > heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform() > does nothing with that heap. With preload==1 it does hence not resgister the > posix_madvise() call that access_heap() does. COnsequently, with > preload==-1, MT_mmap_inform() will never reset the advise set via slice > views, unless there is (also) access to the original parent's heap (i.e., > with page-aligned heap->base pointer. > I jjust noticed this, but do not yet understand, whether and if so which > consequences this (might) have ... > > > Stefan > > > On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote: >> Update of /cvsroot/monetdb/MonetDB/src/gdk >> In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734 >> >> Modified Files: >> Tag: Feb2010 >> gdk_posix.mx gdk_storage.mx >> Log Message: >> did experimentation with sequential mmap I/O. >> - on very fast subsystems (such as 16xssd) it is three times slower than > optimally tuned direct I/O (1GB/s vs 3GB/s) >> - with less disks the difference is smaller (e.g. 140 vs 200MB/s) >> regrettably, nothing helped to get it higher. >> >> the below checkin makes the following changes: >> - simplified BATaccess code by separating out routine >> - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast) >> - observe that large string heaps have a high sequential correletaion >> hense always WILLNEED fetching is overkill >> - move the madvise() call back to BATaccess at the start of the access but > removing >> the advise is done in vmtrim, as you need the overview when the last > user is away. >> - the basic advise is SEQUENTIAL (ie decent I/O) >> >> >> >> Index: gdk_storage.mx >> =================================================================== >> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v >> retrieving revision 1.149.2.32 >> retrieving revision 1.149.2.33 >> diff -u -d -r1.149.2.32 -r1.149.2.33 >> --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32 >> +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33 >> @@ -697,156 +697,95 @@ >> return BATload_intern(i); >> } >> @- BAT preload >> -To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload >> -request. >> -Of course, it does not make sense to touch more then we can physically > accomodate. >> +To avoid random disk access to large (memory-mapped) BATs it may help to > issue a preload request. >> +Of course, it does not make sense to touch more then we can physically > accomodate (budget). >> @c >> -size_t >> -BATaccess(BAT *b, int what, int advise, int preload) { >> - size_t *i, *limit; >> - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0; >> - size_t step = MT_pagesize()/sizeof(size_t); >> - size_t pages = (size_t) (0.8 * MT_npages()); >> - >> - > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); >> - >> - /* VAR heaps (inherent random access) */ >> - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) { >> - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size, > preload, MMAP_WILLNEED, 0); >> - } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->vheap\n", BATgetId(b), advise); >> - limit = (size_t *) (b->H->vheap->base + > b->H->vheap->free) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->H->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> +/* modern linux tends to use 128K readaround = 64K readahead >> + * changes have been going on in 2009, towards true readahead >> + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c >> + * >> + * Peter Feb2010: I tried to do prefetches further apart, to trigger > multiple readahead >> + * units in parallel, but it does improve performance > visibly >> + */ >> +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz, > int touch, int preload, int advise) { >> + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 = > 0, page = MT_pagesize(); >> + int t = GDKms(); >> + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) { >> + MT_mmap_inform(h->base, h->size, preload, advise, 0); >> + if (preload > 0) { >> + void* alignedbase = (void*) (((size_t) base) & > ~(page-1)); >> + size_t alignedsz = (sz + (page-1)) & ~(page-1); >> + int ret = posix_madvise(alignedbase, sz, advise); >> + if (ret) THRprintf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n", >> + h->filename, PTRFMTCAST alignedbase, > alignedsz >> 20, advise, errno); >> } >> } >> - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) { >> - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size, > preload, MMAP_WILLNEED, 0); >> - } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->vheap\n", BATgetId(b), advise); >> - limit = (size_t *) (b->T->vheap->base + > b->T->vheap->free - sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->T->vheap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if (touch && preload > 0) { >> + /* we need to ensure alignment, here, as b might be a view > and heap.base of views are not necessarily aligned */ >> + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) - > 1) & (~(sizeof(size_t) - 1))); >> + size_t *hi = (size_t *) (base + sz); >> + for (hi -= 8*page; lo <= hi; lo += 8*page) { >> + /* try to trigger loading of multiple pages without > blocking */ >> + v0 += lo[0*page]; v1 += lo[1*page]; v2 += > lo[2*page]; v3 += lo[3*page]; >> + v4 += lo[4*page]; v5 += lo[5*page]; v6 += > lo[6*page]; v7 += lo[7*page]; >> } >> + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo; >> } >> + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s) > = %dms \n", id, hp, preload, (int) (sz>>20), >> + > (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK > NOWN", GDKms()-t); >> + return v0+v1+v2+v3+v4+v5+v6+v7; >> +} >> >> - /* BUN heaps (no need to preload for sequential access) */ >> - if ( what&USE_HEAD && b->H->heap.base ) { >> - if (b->H->heap.storage != STORE_MEM && b->H->heap.size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->H->heap.base, b->H->heap.size, > preload, advise, 0); >> - } >> - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->heap\n", BATgetId(b), advise); >> - limit = (size_t *) (Hloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> - } >> - } >> - if ( what&USE_TAIL && b->T->heap.base ) { >> - if (b->T->heap.storage != STORE_MEM && b->T->heap.size > > MT_MMAP_TILE) { >> - MT_mmap_inform(b->T->heap.base, b->T->heap.size, > preload, advise, 0); >> +size_t >> +BATaccess(BAT *b, int what, int advise, int preload) { >> + ssize_t budget = (ssize_t) (0.8 * MT_npages()); >> + size_t v = 0, sz; >> + str id = BATgetId(b); >> + BATiter bi = bat_iterator(b); >> + >> + > assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad > vise==MMAP_WILLNEED||advise==MMAP_DONTNEED); >> + if (BATcount(b) == 0) return 0; >> + >> + /* HASH indices (inherent random access). handle first as they > *will* be access randomly (one can always hope for locality on the other > heaps) */ >> + if ( what&USE_HHASH || what&USE_THASH ) { >> + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); >> + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base) { >> + budget -= sz = (b->H->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; >> + v += access_heap(id, "hhash", b->H->hash->heap, > b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); >> } >> - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->heap\n", BATgetId(b), advise); >> - limit = (size_t *) (Tloc(b, BUNlast(b)) - > sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base) { >> + budget -= sz = (b->T->hash->heap->free > (size_t) > budget)?budget:(ssize_t)b->T->hash->heap->free; >> + v += access_heap(id, "thash", b->T->hash->heap, > b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED); >> } >> + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); >> } >> >> - /* HASH indices (inherent random access) */ >> - if ( what&USE_HHASH || what&USE_THASH ) >> - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK), > "BATaccess"); >> - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap && > b->H->hash->heap->base ) { >> - if (b->H->hash->heap->storage != STORE_MEM && > b->H->hash->heap->size > MT_MMAP_TILE) { >> - MT_mmap_inform(b->H->hash->heap->base, > b->H->hash->heap->size, preload, MMAP_WILLNEED, 0); >> + /* we only touch stuff that is going to be read randomly (WILLNEED). > Note varheaps are sequential wrt to the references, or small */ >> + if ( what&USE_HEAD) { >> + if (b->H->heap.base) { >> + char *lo = BUNhloc(bi, BUNfirst(b)), *hi = > BUNhloc(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "hbuns", &b->H->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > H->hash\n", BATgetId(b), advise); >> - limit = (size_t *) (b->H->hash->heap->base + > b->H->hash->heap->size - sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->H->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if (b->H->vheap && b->H->vheap->base) { >> + char *lo = BUNhead(bi, BUNfirst(b)), *hi = > BUNhead(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "hheap", b->H->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> } >> - if ( what&USE_THASH && b->T->hash && b->T->hash->heap && > b->T->hash->heap->base ) { >> - if (b->T->hash->heap->storage != STORE_MEM && > b->T->hash->heap->size > MT_MMAP_TILE) { >> - MT_mmap_inform(b->T->hash->heap->base, > b->T->hash->heap->size, preload, MMAP_WILLNEED, 0); >> + if ( what&USE_TAIL) { >> + if (b->T->heap.base) { >> + char *lo = BUNtloc(bi, BUNfirst(b)), *hi = > BUNtloc(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "tbuns", &b->T->heap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> - if (preload > 0 && pages > 0) { >> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d): > T->hash\n", BATgetId(b), advise); >> - limit = (size_t *) (b->T->hash->heap->base + > b->T->hash->heap->size - sizeof(size_t)) - 4 * step; >> - /* we need to ensure alignment, here, as b might be > a view and heap.base of views are not necessarily aligned */ >> - i = (size_t *) (((size_t)b->T->hash->heap->base + > sizeof(size_t) - 1) & (~(sizeof(size_t) - 1))); >> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-= > 4) { >> - v1 += *i; >> - v2 += *(i + step); >> - v3 += *(i + 2*step); >> - v4 += *(i + 3*step); >> - } >> - limit += 4 * step; >> - for (; i <= limit && pages > 0; i+= step, pages--) > { >> - v1 += *i; >> - } >> + if (b->T->vheap && b->T->vheap->base) { >> + char *lo = BUNtail(bi, BUNfirst(b)), *hi = > BUNtail(bi, BUNlast(b)-1); >> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo); >> + v += access_heap(id, "theap", b->T->vheap, lo, sz, > (advise == BUF_WILLNEED), preload, advise); >> } >> } >> - if ( what&USE_HHASH || what&USE_THASH ) >> - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) & > BBP_BATMASK), "BATaccess"); >> - >> - return v1 + v2 + v3 + v4; >> + return v; >> } >> @} >> >> >> Index: gdk_posix.mx >> =================================================================== >> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v >> retrieving revision 1.176.2.21 >> retrieving revision 1.176.2.22 >> diff -u -d -r1.176.2.21 -r1.176.2.22 >> --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21 >> +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22 >> @@ -909,10 +909,8 @@ >> unload = MT_mmap_tab[i].usecnt == 0; >> } >> (void) pthread_mutex_unlock(&MT_mmap_lock); >> - if (i >= 0 && preload > 0) >> - ret = posix_madvise(base, len, advise); >> - else if (unload) >> - ret = posix_madvise(base, len, MMAP_NORMAL); >> + if (unload) >> + ret = posix_madvise(base, len, BUF_SEQUENTIAL); >> if (ret) { >> stream_printf(GDKerr, "#MT_mmap_inform: > posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) = > %d\n", >> (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ? > MT_mmap_tab[i].fd : -1), >> >> > ---------------------------------------------------------------------------- > -- >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Monetdb-checkins mailing list >> Monetdb-checkins@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/monetdb-checkins >> >> >