On 03/29/2018 09:47 AM, Sjoerd Mullender wrote:
2. If it is expected, why isn't a correct hash written back to the disk? Doesn't the memcmp(newhash, h->base, sizeof(newhash)) command eventually/immediately get translated into a write?
It doesn't happen all the time
If I have a mmapped column, and I theh memcmp() comes out false, then there's a memcpy() from the new hash to the h->base. Why would this not always result in a write, assuming monetdb is stopped normally?
and you don't want to be writing to files when there is no need. We've had complaints about that.
You're obviously right about not wanting to write without good reason, but one would expect that in the simple case of data being inserted once and then never modified, the hash on disk will be consistent with the hash recomputation on load.
The problem is that during normal database operations, string heaps can be appended to, but if the change doesn't actually get committed, the appends need to be undone. This can happen e.g. when a transaction is aborted due to server shutdown (crash or regular). The committed data indicates how large the string heap is (BBP.dir), but it may be that there is more data, so that gets truncated. That is fine. But the hash may have been changed as well. If in the aborted run the file had been loaded through GDKmalloc, that change wouldn't have gone to disk, but if it had been memory mapped, the change may well have gone to disk. That is what needs to be repaired. If a string heap is larger than GDK_ELIMLIMIT (64KiB), we can't be sure where strings start, and anyway, it becomes expensive to run through the whole heap, so we only rebuild the hash for the first 64 KiB. This > means that it is likely the hash entry will be different if the heap is larger since the newer hash values will be different.
So it seems the result is that even when all changes have been committed, and even in the simplest case of only executing COPY INTO once and no other writes - we can still get (sometimes? often? always?) recomputation mismatch.
Note though, that the hash table is completely valid, even if different. The entries just point at different strings, but since it is opportunistic, that's fine.
But if that's the case, why even bother with the memcmp()? If there's no good reason to expect the hash to be identical, it can simply be recomputed always. (Or alternatively, there could be some play with flags and the order of flushing mmapped pages to disk to ensure either the hash is valid or some dirty flag read from disk is up.)