This is great news. Thank you very much for sharing!
----- Original Message -----
> w.r.t. the slow hash index creation, I did find an OS setting that improved
> things dramatically - first hash building query is now ~3 minutes rather
> than 4.5 hrs.
>
> I found that my /proc/sys/vm/dirty_background_ratio setting was 0, and so the
> OS kicked in and started writing as soon as pages were dirty. This gets in
> the way of the page fault servicing. I set it to 20, which now means that
> the OS doesn't start flushing for a while and so mserver gets to max out CPU
> dirtying up pages during the thash creation. A little bit later, the flusher
> kicks in and mserver slows down to 0.1 CPU. Flushing completes, mserver
> picks up to 100%, etc... We make a lot quicker progress.
On our Fedora 18 machines, the default dirty_background_ratio appears to be 10.
I don't right now, we I hope we can collect all related experiences and knowledge
> I'm still confused by why my OS doesn't seem to want to keep all the pages of
> thash in memory. If I look in /proc/{pid}/numa_maps I can see that during
> the thash creation, for the thash mem mapped file the 'active' page count
> (which I believe is the memory resident portion) seems to not want to
> increase to more than ~0.25 of the total 'mapped' page count. My assumption
> is that if this proportion was closer to 1.0, then my memory access wouldn't
> be causing page faults in the first place. I've got lots of RAM spare, and
> am using interleaved numa settings, so I don't understand what's preventing
> active growing. Does anyone have an idea, or know how I can diagnose this?
in this thread and turn it into general advice.
Thanks, again, for your patience and willingness to experiment and share your
experiences!
Stefan
> -Will
>
>
>
>
> On 5 December 2013 19:57, Will Muldrew < will.muldrew@gmail.com > wrote:
>
>
>
>
> No index!
> On 5 Dec 2013 17:05, "Stefan Manegold" < Stefan.Manegold@cwi.nl > wrote:
>
>
> hm, that's with the index in place?
>
> If so:
> good for our optimizer,
> but unfortunately not so good for you ...
>
> If not:
> - good for you!
> - impressive for our "execution engine" (scan)
> (- "embarrassing" for our optimizer ... well, it's a minor detail ...)
>
> ;-)
>
> Stefan
>
> ----- Original Message -----
> > Sorry!
> >
> > sql>select count(*) from t_order2 where orderId between 'AAAAAA' and
> > 'AAAAAA';
> > +------+
> > | L1 |
> > +======+
> > | 0 |
> > +------+
> > 1 tuple (596.202ms)
> >
> >
> >
> > On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:
> >
> >
> > Hi Will,
> >
> > thank for reminding me: the optimizer still sees your query as point query
> > and thus prevents parallelization in order to build/use the hash index.
> >
> > Maybe you can "fool" the optimizer (and I hope noone else is reading what
> > I'm
> > writing now ... ;-)) by "disguising" the point predicate as range
> > predicate,
> > say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope
> > (for
> > us)
> > that is will not work ...
> >
> > Stefan
> >
> > ----- Original Message -----
> > > Hi Stefan
> > >
> > > I've added that hash = FALSE patch, and now my simple query returns
> > > reliably
> > > in 6 seconds. It's still a lot slower than the indexed one (obviously),
> > > and
> > > 12x slower than the one with the time > '1970-01-01' optimiser nudge
> > > (which
> > > causes parallelism), but it'll do for now.
> > >
> > > Thanks
> > >
> > > -Will
> > >
> > >
> > > On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com > wrote:
> > >
> > >
> > >
> > > Hi Stefan
> > >
> > > Here are my version details:
> > >
> > > MonetDB-11.15.17
> > > CentOS release 6.3
> > > kernel 3.6.2
> > >
> > > I think I might proceed with a patched version with the hash disabled.
> > > That's
> > > totally fine for me right now, though obviously it'd be nice to be on the
> > > trunk!
> > >
> > > W.r.t. a less random access hash generation. This is all very hand-wavey
> > > but
> > > how about: 1) generate a sequence of (hash,oid) pairs to some file or
> > > mmapped area. 2). use some well-known "external sort" algo to sort these
> > > pairs by hash (e.g. split, qsort, merge). 3). build your final hash by
> > > iterating through this sorted datastructure.
> > >
> > > Your memory writing would then be sequential and you'd get good page hit
> > > characteristics. You just need to choose a decent sort algo - ideally
> > > which
> > > works transparently over memmapped data.
> > >
> > > What do you think? I might even have a crack at it myself :)
> > >
> > > -Will
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl >
> > > wrote:
> > >
> > >
> > > Hi Will,
> > >
> > > could you please also share with us which version of MonetDB you are
> > > using,
> > > and OS you are running?
> > >
> > > Pleass find more comments & questions inlined below.
> > >
> > > ----- Original Message -----
> > > > Hi
> > > >
> > > > I've been investigating very slow queries on my 350M row dataset when
> > > > selecting for equality on a non-unique string id. e.g.
> > > >
> > > > select count(*) from t_order where orderId = 'FOO';
> > >
> > > Is your orderId column unique, or are there duplicate values?
> > >
> > > > This query takes ~4.5 hours to build a thash index. Subsequent repeats
> > > > hit
> > > > the cache and are sub milli.
> > > >
> > > > select count(*) from t_order where time > '1970-01-01' and orderId =
> > > > 'FOO'
> > > >
> > > > This second one is consistently finished in a few seconds or less, and
> > > > doesn't create any indexes.
> > > >
> > > > The first query is a bit of a show-stopper. It spends a huge about of
> > > > time
> > > > in
> > > > BAThash (gdk_search.c), and incurs a lot of page faults while reading
> > > > and
> > > > writing at random across a multi GB mmaped thash file. Additionally
> > > > these
> > > > page faults are hampered by the OS trying to write dirty pages out in
> > > > the
> > > > background. I've got plenty of RAM overall (130GB), but the 'active'
> > > > page
> > > > proportion for the thash file seems to be stuck at about 25% - giving
> > > > me
> > > > a
> > > > 0.75 probability of a miss.
> > > >
> > > > * Is there some way to make my OS less keen to evict pages? (swappiness
> > > > =
> > > > 0
> > > > already) I should have plenty of room to have the whole thash file
> > > > resident.
> > > > Additionally, allowing it to sit on dirty pages longer would reduce the
> > > > total write IO.
> > >
> > > In fact, if there is sufficient memory in total, there would be no need
> > > to
> > > flush
> > > dirty (non-persistent) pages at all, but we have not yet found a good way
> > > to
> > > hit
> > > this to the OS --- madvice() did not have much/any useful efficet for us
> > > on
> > > Linux.
> > >
> > > Maybe some of the issues dicsussed here play a role:
> > > http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
> > > ?
> > >
> > > While we have not yet encountered the NUMA problem reported there,
> > > we have encountered the transparent huge page problems, and usually
> > > disable them as also suggested in the above posting.
> > >
> > > > * Is there some way to simply prevent large thash creation? For many
> > > > applications I'd rather have slower consistent queries, then incur a
> > > > massively slow query following a restart (not great in a prod
> > > > environment!)
> > >
> > > Not easily right now.
> > > However, you can either add the "fake" time predicate as you did above,
> > > which (in this case) triggers MonetDB's optimizer to push the time
> > > predicate
> > > (either because it's in timestamp (int) rather then varchar (string), or
> > > because the time column happens to be ordered) below the orderId
> > > predicate
> > > and this avoid the hash build,
> > > or change the following line(s) in gdk/gdk_select.c (in case you built
> > > MonetDB yourself from source):
> > >
> > > hash = b->batPersistence == PERSISTENT &&
> > > (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 &&
> > > BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize /
> > > 2;
> > > ->
> > > hash = FALSE;
> > >
> > > We'll consider whether we can provide a better solution in a future
> > > release.
> > >
> > > > * Is there some way to generate the hash index with a more sympathetic
> > > > algo
> > > > that doesn't degrade so steeply. e.g some hash + sort. (I have
> > > > literally
> > > > no
> > > > experience with this!)
> > >
> > > We'll also look into this. However, building a hash index inherently
> > > incurs
> > > random access, so chances are slim ...
> > >
> > > Best,
> > > Stefan
> > >
> > > > -Will Muldrew
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > users-list mailing list
> > > > users-list@monetdb.org
> > > > https://www.monetdb.org/mailman/listinfo/users-list
> > > >
> > >
> > > --
> > > | Stefan.Manegold@CWI.nl | DB Architectures (DA) |
> > > | www.CWI.nl/~manegold/ | Science Park 123 (L321) |
> > > | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
> > >
> > > _______________________________________________
> > > users-list mailing list
> > > users-list@monetdb.org
> > > https://www.monetdb.org/mailman/listinfo/users-list
> > >
> > >
> > >
> > > _______________________________________________
> > > users-list mailing list
> > > users-list@monetdb.org
> > > https://www.monetdb.org/mailman/listinfo/users-list
> > >
> >
> > --
> > | Stefan.Manegold@CWI.nl | DB Architectures (DA) |
> > | www.CWI.nl/~manegold/ | Science Park 123 (L321) |
> > | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
> >
> > _______________________________________________
> > users-list mailing list
> > users-list@monetdb.org
> > https://www.monetdb.org/mailman/listinfo/users-list
> >
> >
> > _______________________________________________
> > users-list mailing list
> > users-list@monetdb.org
> > https://www.monetdb.org/mailman/listinfo/users-list
> >
>
> --
> | Stefan.Manegold@CWI.nl | DB Architectures (DA) |
> | www.CWI.nl/~manegold/ | Science Park 123 (L321) |
> | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>
--
| Stefan.Manegold@CWI.nl | DB Architectures (DA) |
| www.CWI.nl/~manegold/ | Science Park 123 (L321) |
| +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list