Crashed Database

20 May 2020

      Hi!
I have deployed several MonetDB databases on different customers, the
database is excellent haven't had that much problems, but I'm starting
feeling that the database is becoming "fragile"

Yesterday I have 2 crashes... Firstly, a database of 2.4TB of data... Monet
crashed... and the only information I have received was:

 ERR cli_mx[28693]: #main thread:!ERROR: BBPcheckbats: file
/opt/monetdb/cli_mx/cli_mx/bat/52/5236.tail too small (expected 137803304,
actual 137166848)
2020-05-19 11:10:00 MSG cli_mx[28693]: !ERROR: BBPcheckbats: file
/opt/monetdb/cli_mx/cli_mx/bat/52/5236.tail too small (expected 137803304,
actual 137166848)

2020-05-19 11:08:11 ERR control[28629]: (local): failed to fork mserver:
database 'cli_mx' has crashed after starting, manual intervention needed,
check monetdbd's logfile (merovingian.log) for details

Obviously that info Ive got it from merovingian.log.... so, which is the
manual intervention needed... dont know!

We run a backup all the nights (we rsync the database farm against backup
server). Dont know the reason, but it seems that at moment of the backup,
eventhought the database was working fine... The backup was done. So when I
received the "crashed" notice.. I tried to recover from the backup.... but
it seems the copy was already failing. So I had to start from zero...
uploading all the information again.

it was a very tought job, once I completed (at 1am in the morning), I
stopped the db to do backup.... Again... the database has crashed!!!! never
again started!!!!

In this case the error was the following:

2020-05-19 23:57:52 MSG merovingian[2379]: sending process 2389 (database
'cli_mx') the TERM signal
2020-05-19 23:57:53 MSG merovingian[2379]: database 'cli_mx' has shut down
2020-05-19 23:57:53 MSG control[2379]: (local): stopped database 'cli_mx'
2020-05-19 23:57:54 MSG merovingian[2379]: database 'cli_mx' (2389) has
exited with exit status 0

So the 3rd time, needed to upload the information again.. I have just
finished!... Now, obviously I am scared to stop and start the db... and
till this moment i dont have a backup of the farm directory.

Can someone please help me to find the problem why the database has
crashed?

The only thing that I found on the /var/log/messages was the following (may
be this may help to find the issue)

May 19 23:03:49 mx-pve kernel: [4716718.867312] mserver5[19453]: segfault
at 138 ip 00007fad7b66b1e9 sp 00007ffe648e5600 error 6 in
lib_sql.so[7fad7b532000+1bc000]
May 19 23:10:03 mx-monet kernel: [4717092.595561] mserver5[22348]: segfault
at 138 ip 00007f07572d41e9 sp 00007ffd9f5aa950 error 6 in
lib_sql.so[7f075719b000+1bc000]
May 19 23:15:03 mx-monet   kernel: [4717392.674941] mserver5[24044]:
segfault at 138 ip 00007f4e7c0041e9 sp 00007ffeab6e8e90 error 6 in
lib_sql.so[7f4e7becb000+1bc000]
May 19 23:16:57 mx-monet   kernel: [4717506.863046] mserver5[24675]:
segfault at 138 ip 00007f8710d891e9 sp 00007ffc71a0e4a0 error 6 in
lib_sql.so[7f8710c50000+1bc000]

Do you have an idea what can I do ???!

The version Im using is: v11.35.19 (Nov2019-SP3) on a Linux Centos 7.

Ariel

Ariel Abadi

tags

participants (1)