Hashjoin performance with large vs small tables

newer
Announcement: New Oct2014-SP4...

Roberto Cornacchia

11 May 2015 11 May '15

4:30 p.m.

Hi there, Triggered by a suspiciously slow join, I had a look at BATsubjoin, in particular at how it is decided which algorithm to use. I looked at a couple of cases with a join on strings. This is one (pseudo syntax): l.count ~16M l.T.sorted = 0 l.T.revsorted = 0 r.count = 1 r.T.sorted = 1 r.T.revsorted = 1 In this case, a hash table on r is created (because it's smaller). This join takes *430ms*. I forced swapping l and r, thus built the hash table on the larger bat, and then it takes *0.8ms*. Notice that though r.count = 1 is a bit of an extreme case, it does happen often in practice. And the case with a few tuples would not be very different. The question I wanted to ask is: is it always right to build the hash table on the small table? Perhaps some more heuristics can help? Like: if the smaller is very small, and the larger is not extremely large, then hash on the larger one? Roberto

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Roberto Cornacchia

11 May 11 May

4:36 p.m.

Correction: This join takes *430ms*.

...

I forced swapping l and r, thus built the hash table on the larger bat, and then it takes *0.8ms*.

It takes 0.8ms the second time. The first time, it needs to create the hash table, and then it takes about 30ms. Still, much better than 430ms. Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Stefan Manegold

4:45 p.m.

----- On May 11, 2015, at 6:36 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...

Correction:

This join takes 430ms . I forced swapping l and r, thus built the hash table on the larger bat, and then it takes 0.8ms .

It takes 0.8ms the second time. The first time, it needs to create the hash table, and then it takes about 30ms. Still, much better than 430ms.

Ok, but indeed still the question where does this difference come from?

...

Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ... Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ... Also: Which version of MonetDB are we talking about? Stefan -- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

Roberto Cornacchia

4:49 p.m.

...

...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

...

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

...

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

...

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Stefan Manegold

8:58 p.m.

Roberto, just to recap all facts: - MonetDB Oct2014-SP3 - equality join between 2 string-BATs - both BATs are persistent "base" BATs - larger BAT has 16 M BUNs - smaller BAT has 1 BUN - when (forcefully) building the hash on the larger one, and then performing a single probe from the smaller one, the first ("cold") join takes 30 ms (building the hash table), while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table). This suggests ~29.2 ms for building the 16 M hash table and 0.8 ms for a single probe into that hash table. - when building the 1 BUN hash table on the smaller one, and then performing 16 M probes from the larger one, the (first?) ("cold"?) join takes 430 ms? + How long does a subsequent ("hot") join take (re-using the pre-built hash table)? Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze where the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs. "hot")? Could you share your data to reproduce and analyze the problem? Thanks! Stefan ----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...

...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Roberto Cornacchia

12 May 12 May

8:58 a.m.

On 11 May 2015 at 22:58, Stefan Manegold wrote:

...

Roberto,

just to recap all facts:

- MonetDB Oct2014-SP3

- equality join between 2 string-BATs

- both BATs are persistent "base" BATs

Correct

...

- larger BAT has 16 M BUNs

Apologies, I had misread the count here. It is 1.6M But this doesn't change things. - smaller BAT has 1 BUN

...

- when (forcefully) building the hash on the larger one, and then performing a single probe from the smaller one, the first ("cold") join takes 30 ms (building the hash table), while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table). This suggests ~29.2 ms for building the 16 M hash table and 0.8 ms for a single probe into that hash table.

Correct (except 16M -> 1.6M)

...

- when building the 1 BUN hash table on the smaller one, and then performing 16 M probes from the larger one, the (first?) ("cold"?) join takes 430 ms?

Correct (except 16M -> 1.6M)

...

+ How long does a subsequent ("hot") join take (re-using the pre-built hash table)?

Exactly the same, as expected. The pre-built hash table on the 1-tuple bat can hardly be useful

...

Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze where the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs. "hot")?

Could you share your data to reproduce and analyze the problem?

...

I'm sending data and profiling by email. Thank you. Thanks!

...

Stefan

----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Roberto Cornacchia

9 a.m.

One more detail: I ran all tests with sequential_pipe On 12 May 2015 at 10:58, Roberto Cornacchia wrote:

...

On 11 May 2015 at 22:58, Stefan Manegold wrote:

...
Roberto,

just to recap all facts:

- MonetDB Oct2014-SP3

- equality join between 2 string-BATs

- both BATs are persistent "base" BATs

Correct

...
- larger BAT has 16 M BUNs

Apologies, I had misread the count here. It is 1.6M But this doesn't change things.

- smaller BAT has 1 BUN

...
- when (forcefully) building the hash on the larger one, and then performing a single probe from the smaller one, the first ("cold") join takes 30 ms (building the hash table), while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table). This suggests ~29.2 ms for building the 16 M hash table and 0.8 ms for a single probe into that hash table.

Correct (except 16M -> 1.6M)

...
- when building the 1 BUN hash table on the smaller one, and then performing 16 M probes from the larger one, the (first?) ("cold"?) join takes 430 ms?

Correct (except 16M -> 1.6M)

...
+ How long does a subsequent ("hot") join take (re-using the pre-built hash table)?

Exactly the same, as expected. The pre-built hash table on the 1-tuple bat can hardly be useful

...
Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze where the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs. "hot")?

Could you share your data to reproduce and analyze the problem?

...
I'm sending data and profiling by email. Thank you.

Thanks!

...
Stefan

----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Stefan Manegold

10:30 a.m.

Roberto, do you run a debug or an optimized build? if the former, try the latter or at least disable assertions. With a debug build (i.e., with assertions enabled) you might meassure the costs for (potentially expensive) assertions, rather than for pure execution ... Stefan ----- On May 12, 2015, at 11:00 AM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...

One more detail: I ran all tests with sequential_pipe

On 12 May 2015 at 10:58, Roberto Cornacchia < roberto.cornacchia@gmail.com > wrote:

On 11 May 2015 at 22:58, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Roberto,

just to recap all facts:

- MonetDB Oct2014-SP3

- equality join between 2 string-BATs

- both BATs are persistent "base" BATs

Correct

- larger BAT has 16 M BUNs

Apologies, I had misread the count here. It is 1.6M But this doesn't change things.

- smaller BAT has 1 BUN

- when (forcefully) building the hash on the larger one, and then performing a single probe from the smaller one, the first ("cold") join takes 30 ms (building the hash table), while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table). This suggests ~29.2 ms for building the 16 M hash table and 0.8 ms for a single probe into that hash table.

Correct (except 16M -> 1.6M)

- when building the 1 BUN hash table on the smaller one, and then performing 16 M probes from the larger one, the (first?) ("cold"?) join takes 430 ms?

Correct (except 16M -> 1.6M)

+ How long does a subsequent ("hot") join take (re-using the pre-built hash table)?

Exactly the same, as expected. The pre-built hash table on the 1-tuple bat can hardly be useful

Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze where the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs. "hot")?

Could you share your data to reproduce and analyze the problem?

I'm sending data and profiling by email. Thank you.

Thanks!

Stefan

----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Roberto Cornacchia

10:45 a.m.

Yes, debug build :/ However, callgrind shows that on the "bad" run (1.6M probes against 1 tuple), the majority of the time is spent on strHash, which contains no assertion. For now, in attachment the timings I got with the debug build. Roberto On 12 May 2015 at 12:30, Stefan Manegold wrote:

...

Roberto,

do you run a debug or an optimized build?

if the former, try the latter or at least disable assertions.

With a debug build (i.e., with assertions enabled) you might meassure the costs for (potentially expensive) assertions, rather than for pure execution ...

Stefan

----- On May 12, 2015, at 11:00 AM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
One more detail: I ran all tests with sequential_pipe

On 12 May 2015 at 10:58, Roberto Cornacchia < roberto.cornacchia@gmail.com > wrote:

On 11 May 2015 at 22:58, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Roberto,

just to recap all facts:

- MonetDB Oct2014-SP3

- equality join between 2 string-BATs

- both BATs are persistent "base" BATs

Correct

- larger BAT has 16 M BUNs

Apologies, I had misread the count here. It is 1.6M But this doesn't change things.

- smaller BAT has 1 BUN

- when (forcefully) building the hash on the larger one, and then performing a single probe from the smaller one, the first ("cold") join takes 30 ms (building the hash table), while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table). This suggests ~29.2 ms for building the 16 M hash table and 0.8 ms for a single probe into that hash table.

Correct (except 16M -> 1.6M)

- when building the 1 BUN hash table on the smaller one, and then performing 16 M probes from the larger one, the (first?) ("cold"?) join takes 430 ms?

Correct (except 16M -> 1.6M)

+ How long does a subsequent ("hot") join take (re-using the pre-built hash table)?

Exactly the same, as expected. The pre-built hash table on the 1-tuple bat can hardly be useful

Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze where the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs. "hot")?

Could you share your data to reproduce and analyze the problem?

I'm sending data and profiling by email. Thank you.

Thanks!

Stefan

----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Stefan Manegold

10:55 a.m.

Thanks! A first *guess*: With the hash table on the smaller and the 1.6 M probes from the larger, the hash is calculated for each of the 1.6 M tuples, "although" there are only 16 distinct values; apparently, the hash values are not stored ("cached") in the string heap, or at least not re-used. With the hash table on the larger and a single probe from the smaller, the string hash appears to be calculated only once per each of the 16 distinct values, (or is already pre-calculated from the string inserts?) and then is re-used during hash table build. Stefan as a side note, the "trickling in" information about detailed data and workload characteristics slowly seem to clearify / sharpen the picture ... ----- On May 12, 2015, at 12:45 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...

Yes, debug build :/

However, callgrind shows that on the "bad" run (1.6M probes against 1 tuple), the majority of the time is spent on strHash, which contains no assertion.

For now, in attachment the timings I got with the debug build.

Roberto

On 12 May 2015 at 12:30, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Roberto,

do you run a debug or an optimized build?

if the former, try the latter or at least disable assertions.

With a debug build (i.e., with assertions enabled) you might meassure the costs for (potentially expensive) assertions, rather than for pure execution ...

Stefan

----- On May 12, 2015, at 11:00 AM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
One more detail: I ran all tests with sequential_pipe

On 12 May 2015 at 10:58, Roberto Cornacchia < roberto.cornacchia@gmail.com > wrote:

On 11 May 2015 at 22:58, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Roberto,

just to recap all facts:

- MonetDB Oct2014-SP3

- equality join between 2 string-BATs

- both BATs are persistent "base" BATs

Correct

- larger BAT has 16 M BUNs

Apologies, I had misread the count here. It is 1.6M But this doesn't change things.

- smaller BAT has 1 BUN

- when (forcefully) building the hash on the larger one, and then performing a single probe from the smaller one, the first ("cold") join takes 30 ms (building the hash table), while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table). This suggests ~29.2 ms for building the 16 M hash table and 0.8 ms for a single probe into that hash table.

Correct (except 16M -> 1.6M)

- when building the 1 BUN hash table on the smaller one, and then performing 16 M probes from the larger one, the (first?) ("cold"?) join takes 430 ms?

Correct (except 16M -> 1.6M)

+ How long does a subsequent ("hot") join take (re-using the pre-built hash table)?

Exactly the same, as expected. The pre-built hash table on the 1-tuple bat can hardly be useful

Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze where the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs. "hot")?

Could you share your data to reproduce and analyze the problem?

I'm sending data and profiling by email. Thank you.

Thanks!

Stefan

----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Roberto Cornacchia

11:02 a.m.

Yes, that makes sense. I also got suspicious about the effect of very few distinct values. That's why I included a second case, with 1.2M tuples on the large table, which are *all distinct*. This case, though to a lesser extent than the first case (12x speedup), benefits as well from hashing on the large table (4x speedup). On 12 May 2015 at 12:55, Stefan Manegold wrote:

...

Thanks!

A first *guess*:

With the hash table on the smaller and the 1.6 M probes from the larger, the hash is calculated for each of the 1.6 M tuples, "although" there are only 16 distinct values; apparently, the hash values are not stored ("cached") in the string heap, or at least not re-used.

With the hash table on the larger and a single probe from the smaller, the string hash appears to be calculated only once per each of the 16 distinct values, (or is already pre-calculated from the string inserts?) and then is re-used during hash table build.

Stefan

as a side note, the "trickling in" information about detailed data and workload characteristics slowly seem to clearify / sharpen the picture ...

----- On May 12, 2015, at 12:45 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
Yes, debug build :/

However, callgrind shows that on the "bad" run (1.6M probes against 1 tuple), the majority of the time is spent on strHash, which contains no assertion.

For now, in attachment the timings I got with the debug build.

Roberto

On 12 May 2015 at 12:30, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Roberto,

do you run a debug or an optimized build?

if the former, try the latter or at least disable assertions.

With a debug build (i.e., with assertions enabled) you might meassure the costs for (potentially expensive) assertions, rather than for pure execution ...

Stefan

----- On May 12, 2015, at 11:00 AM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
One more detail: I ran all tests with sequential_pipe

On 12 May 2015 at 10:58, Roberto Cornacchia < roberto.cornacchia@gmail.com > wrote:

On 11 May 2015 at 22:58, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Roberto,

just to recap all facts:

- MonetDB Oct2014-SP3

- equality join between 2 string-BATs

- both BATs are persistent "base" BATs

Correct

- larger BAT has 16 M BUNs

Apologies, I had misread the count here. It is 1.6M But this doesn't change things.

- smaller BAT has 1 BUN

- when (forcefully) building the hash on the larger one, and then performing a single probe from the smaller one, the first ("cold") join takes 30 ms (building the hash table), while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table). This suggests ~29.2 ms for building the 16 M hash table and 0.8 ms for a single probe into that hash table.

Correct (except 16M -> 1.6M)

- when building the 1 BUN hash table on the smaller one, and then performing 16 M probes from the larger one, the (first?) ("cold"?) join takes 430 ms?

Correct (except 16M -> 1.6M)

+ How long does a subsequent ("hot") join take (re-using the pre-built hash table)?

Exactly the same, as expected. The pre-built hash table on the 1-tuple bat can hardly be useful

Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze where the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs. "hot")?

Could you share your data to reproduce and analyze the problem?

I'm sending data and profiling by email. Thank you.

Thanks!

Stefan

----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
...
Also, those 430ms are not invested. The second time will still take 430ms. So hashing on a very small bat is never a good investment. On the contrary, hashing on a larger (but not too much) table is a good investment. The next time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge. If the bigger table / BAT is a base table/BAT, the hash table will (nowadays) be made persistent and *could* be reused --- whether it indeed will be reused, we cannot predict. If the bigger table is a transient intermediate result, re-use is unlikely ...

That's fair.

Having said that, is your smaller table a base table or an intermediate result that is (might be) a tiny slice of a large (huge) base table? Then current code might build the hash on the entire parent BAT rather than on the tiny slice ...

They both are base tables. The tiny table is created and a single insert is done. The large one is also a regular table, with NOT NULL constraint on the join column and the entire table is marked read-only.

Also: Which version of MonetDB are we talking about?

Oct2014 SP3

Stefan

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Stefan Manegold

11 May 11 May

4:40 p.m.

Hi Roberto, thanks for reporting! This need detailed profiling to understand what's happening. Why are 16M look-ups to a 1-tuple hash table 500x more expensive than building a 16M hash table (assuming that is it indeed built on the fly, i.e., does not already exist), and then performing a single look-up ... ? Stefan ----- On May 11, 2015, at 6:30 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...

Hi there,

Triggered by a suspiciously slow join, I had a look at BATsubjoin, in particular at how it is decided which algorithm to use.

I looked at a couple of cases with a join on strings. This is one (pseudo syntax):

l.count ~16M l.T.sorted = 0 l.T.revsorted = 0

r.count = 1 r.T.sorted = 1 r.T.revsorted = 1

In this case, a hash table on r is created (because it's smaller). This join takes 430ms . I forced swapping l and r, thus built the hash table on the larger bat, and then it takes 0.8ms .

Notice that though r.count = 1 is a bit of an extreme case, it does happen often in practice. And the case with a few tuples would not be very different.

The question I wanted to ask is:

is it always right to build the hash table on the small table? Perhaps some more heuristics can help? Like: if the smaller is very small, and the larger is not extremely large, then hash on the larger one?

Roberto

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

Roberto Cornacchia

4:46 p.m.

Hi Stefan, Actually, according to my second email, 430 should be compared to 30, not to 0.8. So the difference in speed is "only" 15x. Still, your question remains. However, even if the two performed equally, the current case would not invest anything.. On 11 May 2015 at 18:40, Stefan Manegold wrote:

...

Hi Roberto,

thanks for reporting!

This need detailed profiling to understand what's happening. Why are 16M look-ups to a 1-tuple hash table 500x more expensive than building a 16M hash table (assuming that is it indeed built on the fly, i.e., does not already exist), and then performing a single look-up ... ?

Stefan

----- On May 11, 2015, at 6:30 PM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

...
Hi there,

Triggered by a suspiciously slow join, I had a look at BATsubjoin, in particular at how it is decided which algorithm to use.

I looked at a couple of cases with a join on strings. This is one (pseudo syntax):

l.count ~16M l.T.sorted = 0 l.T.revsorted = 0

r.count = 1 r.T.sorted = 1 r.T.revsorted = 1

In this case, a hash table on r is created (because it's smaller). This join takes 430ms . I forced swapping l and r, thus built the hash table on the larger bat, and then it takes 0.8ms .

Notice that though r.count = 1 is a bit of an extreme case, it does happen often in practice. And the case with a few tuples would not be very different.

The question I wanted to ask is:

is it always right to build the hash table on the small table? Perhaps some more heuristics can help? Like: if the smaller is very small, and the larger is not extremely large, then hash on the larger one?

Roberto

_______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list

3589

Age (days ago)

3590

Last active (days ago)

List overview

Download

12 comments

2 participants

participants (2)

Roberto Cornacchia
Stefan Manegold

Hashjoin performance with large vs small tables

tags

participants (2)