tail heap hash found to be invalid on repeated DB loads
As part of, shall we say, helping with the QA of the March-feature-branch-based releases, I'm looking at some of the GDK code involving loading a BBP (and the action thereof). Specifically, I've been looking at the function gdk/gdk_atoms.c:strCleanHash(), which recomputes the opportunistic partial hash table for a tail heap; checks if it's equal to what (seems to be) the hash loaded from disk; and if they don't match - copies the new hash over the old one. I've added an indication in case the hashes don't match. And - I've noticed this happens repeatedly when trying to load the same database, even after previous plain vanilla unmodified source code runs, with no crashes and orderly shutdown. At least - I'm seeing this happen with 11.29.1, and on a TPC-H SF 1 database I created with the script at https://github.com/eyalroz/tpch-tools My questions: 1. Is a discrepancy between the old and recomputed hash, when there had not been any problems on the previous run, expected behavior - or is it a bug? 2. If it is expected, why isn't a correct hash written back to the disk? Doesn't the memcmp(newhash, h->base, sizeof(newhash)) command eventually/immediately get translated into a write? 3. If this is not expected behavior - has this been verified not to happen on other machines? Other versions? 4. Could this be a 11.29.3 vs 11.29.1 issue somehow? I haven't updated to the latest RC yet. Eyal
On 29/03/18 00:37, Eyal Rozenberg wrote:
As part of, shall we say, helping with the QA of the March-feature-branch-based releases, I'm looking at some of the GDK code involving loading a BBP (and the action thereof).
Specifically, I've been looking at the function gdk/gdk_atoms.c:strCleanHash(), which recomputes the opportunistic partial hash table for a tail heap; checks if it's equal to what (seems to be) the hash loaded from disk; and if they don't match - copies the new hash over the old one.
I've added an indication in case the hashes don't match. And - I've noticed this happens repeatedly when trying to load the same database, even after previous plain vanilla unmodified source code runs, with no crashes and orderly shutdown. At least - I'm seeing this happen with 11.29.1, and on a TPC-H SF 1 database I created with the script at https://github.com/eyalroz/tpch-tools
My questions:
1. Is a discrepancy between the old and recomputed hash, when there had not been any problems on the previous run, expected behavior - or is it a bug?
It is expected behavior, so that's why the code is there.
2. If it is expected, why isn't a correct hash written back to the disk? Doesn't the memcmp(newhash, h->base, sizeof(newhash)) command eventually/immediately get translated into a write?
It doesn't happen all the time, and you don't want to be writing to files when there is no need. We've had complaints about that.
3. If this is not expected behavior - has this been verified not to happen on other machines? Other versions? 4. Could this be a 11.29.3 vs 11.29.1 issue somehow? I haven't updated to the latest RC yet.
No. The problem is that during normal database operations, string heaps can be appended to, but if the change doesn't actually get committed, the appends need to be undone. This can happen e.g. when a transaction is aborted due to server shutdown (crash or regular). The committed data indicates how large the string heap is (BBP.dir), but it may be that there is more data, so that gets truncated. That is fine. But the hash may have been changed as well. If in the aborted run the file had been loaded through GDKmalloc, that change wouldn't have gone to disk, but if it had been memory mapped, the change may well have gone to disk. That is what needs to be repaired. If a string heap is larger than GDK_ELIMLIMIT (64KiB), we can't be sure where strings start, and anyway, it becomes expensive to run through the whole heap, so we only rebuild the hash for the first 64 KiB. This means that it is likely the hash entry will be different if the heap is larger since the newer hash values will be different. Note though, that the hash table is completely valid, even if different. The entries just point at different strings, but since it is opportunistic, that's fine.
Eyal _______________________________________________ developers-list mailing list developers-list@monetdb.org https://www.monetdb.org/mailman/listinfo/developers-list
-- Sjoerd Mullender
On 03/29/2018 09:47 AM, Sjoerd Mullender wrote:
2. If it is expected, why isn't a correct hash written back to the disk? Doesn't the memcmp(newhash, h->base, sizeof(newhash)) command eventually/immediately get translated into a write?
It doesn't happen all the time
If I have a mmapped column, and I theh memcmp() comes out false, then there's a memcpy() from the new hash to the h->base. Why would this not always result in a write, assuming monetdb is stopped normally?
and you don't want to be writing to files when there is no need. We've had complaints about that.
You're obviously right about not wanting to write without good reason, but one would expect that in the simple case of data being inserted once and then never modified, the hash on disk will be consistent with the hash recomputation on load.
The problem is that during normal database operations, string heaps can be appended to, but if the change doesn't actually get committed, the appends need to be undone. This can happen e.g. when a transaction is aborted due to server shutdown (crash or regular). The committed data indicates how large the string heap is (BBP.dir), but it may be that there is more data, so that gets truncated. That is fine. But the hash may have been changed as well. If in the aborted run the file had been loaded through GDKmalloc, that change wouldn't have gone to disk, but if it had been memory mapped, the change may well have gone to disk. That is what needs to be repaired. If a string heap is larger than GDK_ELIMLIMIT (64KiB), we can't be sure where strings start, and anyway, it becomes expensive to run through the whole heap, so we only rebuild the hash for the first 64 KiB. This > means that it is likely the hash entry will be different if the heap is larger since the newer hash values will be different.
So it seems the result is that even when all changes have been committed, and even in the simplest case of only executing COPY INTO once and no other writes - we can still get (sometimes? often? always?) recomputation mismatch.
Note though, that the hash table is completely valid, even if different. The entries just point at different strings, but since it is opportunistic, that's fine.
But if that's the case, why even bother with the memcmp()? If there's no good reason to expect the hash to be identical, it can simply be recomputed always. (Or alternatively, there could be some play with flags and the order of flushing mmapped pages to disk to ensure either the hash is valid or some dirty flag read from disk is up.)
participants (2)
-
Eyal Rozenberg
-
Sjoerd Mullender