Hi Stefan

[13:24:23] $ cat /proc/sys/vm/dirty_background_bytes
0

Think it was that way before (I know it's mutually exclusive with the ratio)

The new setting didn't appear to have any effect on the active ratio.

-Will



On 6 December 2013 12:08, Stefan Manegold <Stefan.Manegold@cwi.nl> wrote:
Will et al.,

a value "0" for /proc/sys/vm/dirty_background_ratio could also meant that
/proc/sys/vm/dirty_background_bytes was set to a non-0 value; cf.,
https://www.kernel.org/doc/Documentation/sysctl/vm.txt

Did you try whether setting /proc/sys/vm/dirty_background_ratio to a value
> 20 would result in /proc/{pid}/numa_maps growing beyond 25% ?

Stefan


----- Original Message -----
> No problem - monetdb is an exciting project!
>
>
> On 6 December 2013 11:37, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:
>
>
>
>
> ----- Original Message -----
> > w.r.t. the slow hash index creation, I did find an OS setting that improved
> > things dramatically - first hash building query is now ~3 minutes rather
> > than 4.5 hrs.
> >
> > I found that my /proc/sys/vm/dirty_background_ratio setting was 0, and so
> > the
> > OS kicked in and started writing as soon as pages were dirty. This gets in
> > the way of the page fault servicing. I set it to 20, which now means that
> > the OS doesn't start flushing for a while and so mserver gets to max out
> > CPU
> > dirtying up pages during the thash creation. A little bit later, the
> > flusher
> > kicks in and mserver slows down to 0.1 CPU. Flushing completes, mserver
> > picks up to 100%, etc... We make a lot quicker progress.
>
> This is great news. Thank you very much for sharing!
>
> On our Fedora 18 machines, the default dirty_background_ratio appears to be
> 10.
>
> > I'm still confused by why my OS doesn't seem to want to keep all the pages
> > of
> > thash in memory. If I look in /proc/{pid}/numa_maps I can see that during
> > the thash creation, for the thash mem mapped file the 'active' page count
> > (which I believe is the memory resident portion) seems to not want to
> > increase to more than ~0.25 of the total 'mapped' page count. My assumption
> > is that if this proportion was closer to 1.0, then my memory access
> > wouldn't
> > be causing page faults in the first place. I've got lots of RAM spare, and
> > am using interleaved numa settings, so I don't understand what's preventing
> > active growing. Does anyone have an idea, or know how I can diagnose this?
>
> I don't right now, we I hope we can collect all related experiences and
> knowledge
> in this thread and turn it into general advice.
>
> Thanks, again, for your patience and willingness to experiment and share your
> experiences!
>
> Stefan
>
> > -Will
> >
> >
> >
> >
> > On 5 December 2013 19:57, Will Muldrew < will.muldrew@gmail.com > wrote:
> >
> >
> >
> >
> > No index!
> > On 5 Dec 2013 17:05, "Stefan Manegold" < Stefan.Manegold@cwi.nl > wrote:
> >
> >
> > hm, that's with the index in place?
> >
> > If so:
> > good for our optimizer,
> > but unfortunately not so good for you ...
> >
> > If not:
> > - good for you!
> > - impressive for our "execution engine" (scan)
> > (- "embarrassing" for our optimizer ... well, it's a minor detail ...)
> >
> > ;-)
> >
> > Stefan
> >
> > ----- Original Message -----
> > > Sorry!
> > >
> > > sql>select count(*) from t_order2 where orderId between 'AAAAAA' and
> > > 'AAAAAA';
> > > +------+
> > > | L1 |
> > > +======+
> > > | 0 |
> > > +------+
> > > 1 tuple (596.202ms)
> > >
> > >
> > >
> > > On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl >
> > > wrote:
> > >
> > >
> > > Hi Will,
> > >
> > > thank for reminding me: the optimizer still sees your query as point
> > > query
> > > and thus prevents parallelization in order to build/use the hash index.
> > >
> > > Maybe you can "fool" the optimizer (and I hope noone else is reading what
> > > I'm
> > > writing now ... ;-)) by "disguising" the point predicate as range
> > > predicate,
> > > say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope
> > > (for
> > > us)
> > > that is will not work ...
> > >
> > > Stefan
> > >
> > > ----- Original Message -----
> > > > Hi Stefan
> > > >
> > > > I've added that hash = FALSE patch, and now my simple query returns
> > > > reliably
> > > > in 6 seconds. It's still a lot slower than the indexed one (obviously),
> > > > and
> > > > 12x slower than the one with the time > '1970-01-01' optimiser nudge
> > > > (which
> > > > causes parallelism), but it'll do for now.
> > > >
> > > > Thanks
> > > >
> > > > -Will
> > > >
> > > >
> > > > On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com >
> > > > wrote:
> > > >
> > > >
> > > >
> > > > Hi Stefan
> > > >
> > > > Here are my version details:
> > > >
> > > > MonetDB-11.15.17
> > > > CentOS release 6.3
> > > > kernel 3.6.2
> > > >
> > > > I think I might proceed with a patched version with the hash disabled.
> > > > That's
> > > > totally fine for me right now, though obviously it'd be nice to be on
> > > > the
> > > > trunk!
> > > >
> > > > W.r.t. a less random access hash generation. This is all very
> > > > hand-wavey
> > > > but
> > > > how about: 1) generate a sequence of (hash,oid) pairs to some file or
> > > > mmapped area. 2). use some well-known "external sort" algo to sort
> > > > these
> > > > pairs by hash (e.g. split, qsort, merge). 3). build your final hash by
> > > > iterating through this sorted datastructure.
> > > >
> > > > Your memory writing would then be sequential and you'd get good page
> > > > hit
> > > > characteristics. You just need to choose a decent sort algo - ideally
> > > > which
> > > > works transparently over memmapped data.
> > > >
> > > > What do you think? I might even have a crack at it myself :)
> > > >
> > > > -Will
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl >
> > > > wrote:
> > > >
> > > >
> > > > Hi Will,
> > > >
> > > > could you please also share with us which version of MonetDB you are
> > > > using,
> > > > and OS you are running?
> > > >
> > > > Pleass find more comments & questions inlined below.
> > > >
> > > > ----- Original Message -----
> > > > > Hi
> > > > >
> > > > > I've been investigating very slow queries on my 350M row dataset when
> > > > > selecting for equality on a non-unique string id. e.g.
> > > > >
> > > > > select count(*) from t_order where orderId = 'FOO';
> > > >
> > > > Is your orderId column unique, or are there duplicate values?
> > > >
> > > > > This query takes ~4.5 hours to build a thash index. Subsequent
> > > > > repeats
> > > > > hit
> > > > > the cache and are sub milli.
> > > > >
> > > > > select count(*) from t_order where time > '1970-01-01' and orderId =
> > > > > 'FOO'
> > > > >
> > > > > This second one is consistently finished in a few seconds or less,
> > > > > and
> > > > > doesn't create any indexes.
> > > > >
> > > > > The first query is a bit of a show-stopper. It spends a huge about of
> > > > > time
> > > > > in
> > > > > BAThash (gdk_search.c), and incurs a lot of page faults while reading
> > > > > and
> > > > > writing at random across a multi GB mmaped thash file. Additionally
> > > > > these
> > > > > page faults are hampered by the OS trying to write dirty pages out in
> > > > > the
> > > > > background. I've got plenty of RAM overall (130GB), but the 'active'
> > > > > page
> > > > > proportion for the thash file seems to be stuck at about 25% - giving
> > > > > me
> > > > > a
> > > > > 0.75 probability of a miss.
> > > > >
> > > > > * Is there some way to make my OS less keen to evict pages?
> > > > > (swappiness
> > > > > =
> > > > > 0
> > > > > already) I should have plenty of room to have the whole thash file
> > > > > resident.
> > > > > Additionally, allowing it to sit on dirty pages longer would reduce
> > > > > the
> > > > > total write IO.
> > > >
> > > > In fact, if there is sufficient memory in total, there would be no need
> > > > to
> > > > flush
> > > > dirty (non-persistent) pages at all, but we have not yet found a good
> > > > way
> > > > to
> > > > hit
> > > > this to the OS --- madvice() did not have much/any useful efficet for
> > > > us
> > > > on
> > > > Linux.
> > > >
> > > > Maybe some of the issues dicsussed here play a role:
> > > > http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
> > > > ?
> > > >
> > > > While we have not yet encountered the NUMA problem reported there,
> > > > we have encountered the transparent huge page problems, and usually
> > > > disable them as also suggested in the above posting.
> > > >
> > > > > * Is there some way to simply prevent large thash creation? For many
> > > > > applications I'd rather have slower consistent queries, then incur a
> > > > > massively slow query following a restart (not great in a prod
> > > > > environment!)
> > > >
> > > > Not easily right now.
> > > > However, you can either add the "fake" time predicate as you did above,
> > > > which (in this case) triggers MonetDB's optimizer to push the time
> > > > predicate
> > > > (either because it's in timestamp (int) rather then varchar (string),
> > > > or
> > > > because the time column happens to be ordered) below the orderId
> > > > predicate
> > > > and this avoid the hash build,
> > > > or change the following line(s) in gdk/gdk_select.c (in case you built
> > > > MonetDB yourself from source):
> > > >
> > > > hash = b->batPersistence == PERSISTENT &&
> > > > (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 &&
> > > > BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize
> > > > /
> > > > 2;
> > > > ->
> > > > hash = FALSE;
> > > >
> > > > We'll consider whether we can provide a better solution in a future
> > > > release.
> > > >
> > > > > * Is there some way to generate the hash index with a more
> > > > > sympathetic
> > > > > algo
> > > > > that doesn't degrade so steeply. e.g some hash + sort. (I have
> > > > > literally
> > > > > no
> > > > > experience with this!)
> > > >
> > > > We'll also look into this. However, building a hash index inherently
> > > > incurs
> > > > random access, so chances are slim ...
> > > >
> > > > Best,
> > > > Stefan
> > > >
> > > > > -Will Muldrew
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > users-list mailing list
> > > > > users-list@monetdb.org
> > > > > https://www.monetdb.org/mailman/listinfo/users-list
> > > > >
> > > >
> > > > --
> > > > | Stefan.Manegold@CWI.nl | DB Architectures (DA) |
> > > > | www.CWI.nl/~manegold/ | Science Park 123 (L321) |
> > > > | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
> > > >
> > > > _______________________________________________
> > > > users-list mailing list
> > > > users-list@monetdb.org
> > > > https://www.monetdb.org/mailman/listinfo/users-list
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > users-list mailing list
> > > > users-list@monetdb.org
> > > > https://www.monetdb.org/mailman/listinfo/users-list
> > > >
> > >
> > > --
> > > | Stefan.Manegold@CWI.nl | DB Architectures (DA) |
> > > | www.CWI.nl/~manegold/ | Science Park 123 (L321) |
> > > | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
> > >
> > > _______________________________________________
> > > users-list mailing list
> > > users-list@monetdb.org
> > > https://www.monetdb.org/mailman/listinfo/users-list
> > >
> > >
> > > _______________________________________________
> > > users-list mailing list
> > > users-list@monetdb.org
> > > https://www.monetdb.org/mailman/listinfo/users-list
> > >
> >
> > --
> > | Stefan.Manegold@CWI.nl | DB Architectures (DA) |
> > | www.CWI.nl/~manegold/ | Science Park 123 (L321) |
> > | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
> >
> > _______________________________________________
> > users-list mailing list
> > users-list@monetdb.org
> > https://www.monetdb.org/mailman/listinfo/users-list
> >
> >
> > _______________________________________________
> > users-list mailing list
> > users-list@monetdb.org
> > https://www.monetdb.org/mailman/listinfo/users-list
> >
>
> --
> | Stefan.Manegold@CWI.nl | DB Architectures (DA) |
> | www.CWI.nl/~manegold/ | Science Park 123 (L321) |
> | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>

--
| Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list