Investigating Segfault
Dear Monet Team,
We're having an issue in which certain data is causing mserver5 to crash.
After this condition is hit, mserver5 crashes at every startup and always
dumps an identical core.
We're running Monet 11.13.7 on Ubuntu Linux 64 bit.
The error is happening on line 1142 of gdk_atoms.c:
if (GDK_STRCMP(v, (str) (next + 1) + extralen) == 0) {
Examining the core dump revealed that (next + 1) + extralen is referring to
an out of bounds address. Here's the backtrace:
#0 0x00007faf58414829 in strPut (h=0x1e2d180, dst=0x7fff592cf8f8,
v=0x314dac0 "SAD014H1") at gdk_atoms.c:1142
#1 0x00007faf582dc935 in BATappend (b=0x1e2cf90, n=0x32dfdb0, force=1
'\001') at gdk_batop.c:578
#2 0x00007faf584c301e in la_bat_updates (lg=0x2d9b030, la=0x2c3ef48) at
gdk_logger.c:429
#3 0x00007faf584c3cf9 in la_apply (lg=0x2d9b030, c=0x2c3ef48) at
gdk_logger.c:645
#4 0x00007faf584c3f26 in tr_commit (lg=0x2d9b030, tr=0x2e247d0) at
gdk_logger.c:705
#5 0x00007faf584c4533 in logger_readlog (lg=0x2d9b030,
filename=0x7fff592d1e80
"/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log.56") at
gdk_logger.c:823
#6 0x00007faf584c482a in logger_readlogs (lg=0x2d9b030, fp=0x2d9b160,
filename=0x7fff592d3f90
"/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log") at
gdk_logger.c:896
#7 0x00007faf584c6f3e in logger_new (debug=0, fn=0x7faf500adfa8 "sql",
logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click",
version=52001, prefuncp=0x7faf500746a1
Hi Percy, your segfault does not instantly ring a bell with me, but before you/we dive deeply into this, I am wondering whether you could upgrade from you Oct2012-SP2 to Oct2012-SP3 or even Feb2013, and check whether the crash(es) still occur? Stefan ----- Original Message -----
Dear Monet Team,
We're having an issue in which certain data is causing mserver5 to crash. After this condition is hit, mserver5 crashes at every startup and always dumps an identical core.
We're running Monet 11.13.7 on Ubuntu Linux 64 bit.
The error is happening on line 1142 of gdk_atoms.c:
if (GDK_STRCMP(v, (str) (next + 1) + extralen) == 0) {
Examining the core dump revealed that (next + 1) + extralen is referring to an out of bounds address. Here's the backtrace:
#0 0x00007faf58414829 in strPut (h=0x1e2d180, dst=0x7fff592cf8f8, v=0x314dac0 "SAD014H1") at gdk_atoms.c:1142 #1 0x00007faf582dc935 in BATappend (b=0x1e2cf90, n=0x32dfdb0, force=1 '\001') at gdk_batop.c:578 #2 0x00007faf584c301e in la_bat_updates (lg=0x2d9b030, la=0x2c3ef48) at gdk_logger.c:429 #3 0x00007faf584c3cf9 in la_apply (lg=0x2d9b030, c=0x2c3ef48) at gdk_logger.c:645 #4 0x00007faf584c3f26 in tr_commit (lg=0x2d9b030, tr=0x2e247d0) at gdk_logger.c:705 #5 0x00007faf584c4533 in logger_readlog (lg=0x2d9b030, filename=0x7fff592d1e80 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log.56") at gdk_logger.c:823 #6 0x00007faf584c482a in logger_readlogs (lg=0x2d9b030, fp=0x2d9b160, filename=0x7fff592d3f90 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log") at gdk_logger.c:896 #7 0x00007faf584c6f3e in logger_new (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", version=52001, prefuncp=0x7faf500746a1
, postfuncp=0x7faf500747ed ) at gdk_logger.c:1420 #8 0x00007faf584c704e in logger_create (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", version=52001, prefuncp=0x7faf500746a1 , postfuncp=0x7faf500747ed ) at gdk_logger.c:1446 #9 0x00007faf50075b19 in bl_create (logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", cat_version=52001) at bat_logger.c:249 #10 0x00007faf50060ce4 in store_init (debug=0, store=store_bat, logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", stk=0) at store.c:1287 #11 0x00007faf4ffe3d3c in mvc_init (dbname=0x1fa3da0 "click", debug=0, store=store_bat, stk=0) at sql_mvc.c:51 #12 0x00007faf4ff66874 in SQLinit () at sql_scenario.c:230 #13 0x00007faf4ff6651f in SQLprelude () at sql_scenario.c:159 #14 0x00007faf58b3085d in malCommandCall (stk=0x2d36e80, pci=0x2ea5520) at mal_interpreter.c:137 #15 0x00007faf58b331b5 in runMALsequence (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, stoppc=0, stk=0x2d36e80, env=0x0, pcicaller=0x0) at mal_interpreter.c:710 #16 0x00007faf58b323c1 in runMAL (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, mbcaller=0x0, env=0x0, pcicaller=0x0) at mal_interpreter.c:454 #17 0x00007faf58b60a08 in MALengine (c=0x7faf5988c020) at mal_session.c:619 #18 0x00007faf58b5f21f in malBootstrap () at mal_session.c:64 #19 0x00007faf58b1313b in mal_init () at mal.c:244 #20 0x000000000040340e in main (argc=22, av=0x7fff592db568) at mserver5.c:582 I'm digging into it now, but I was hoping that it might ring some bells.
Thanks,
--
Percy Wegmann +1 512 637 8500 ext 148
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
Hi Stefan,
Thanks for the quick reply! So, I tried 11.13.9 and got the same results.
The interesting thing is that this happens right at startup, which made me
suspect that the data file itself is already corrupted due to an earlier
issue.
Going through my logs, I see the following:
2013-02-22 13:59:34 MSG merovingian[1742]: sending process 20151 (database
'click') the TERM signal
2013-02-22 13:59:34 ERR merovingian[1742]: timeout of 0 seconds expired,
sending process 20151 (database 'click') the KILL signal
2013-02-22 13:59:34 MSG control[1742]: (local): stopped database 'click'
2013-02-22 13:59:34 MSG merovingian[1742]: database 'click' (20151) was
killed by signal SIGKILL
2013-02-22 13:59:34 MSG merovingian[1742]: database 'click' has crashed
after start on 2013-02-22 13:59:23, attempting restart, up min/avg/max:
0s/0s/0s, crash average: 1.00 0.10 0.03 (1-0=1)
This was done in response to issuing a "monetdb stop" command. The
exittimeout parameter for the monetdbd daemon is set to 0. I interpreted
the man page to mean that this is an unlimited timeout:
"A time-out value of 0 means no mservers will be shut down, and hence
they will continue to run after monetdbd has shut down."
I'm now setting the exittimeout to a high value to see if that does the
trick. I'll let you know how it goes after we finish running our test
scenario.
Thanks!
Percy
On Mon, Feb 25, 2013 at 11:58 AM, Stefan Manegold
Hi Percy,
your segfault does not instantly ring a bell with me, but before you/we dive deeply into this, I am wondering whether you could upgrade from you Oct2012-SP2 to Oct2012-SP3 or even Feb2013, and check whether the crash(es) still occur?
Stefan
----- Original Message -----
Dear Monet Team,
We're having an issue in which certain data is causing mserver5 to crash. After this condition is hit, mserver5 crashes at every startup and always dumps an identical core.
We're running Monet 11.13.7 on Ubuntu Linux 64 bit.
The error is happening on line 1142 of gdk_atoms.c:
if (GDK_STRCMP(v, (str) (next + 1) + extralen) == 0) {
Examining the core dump revealed that (next + 1) + extralen is referring to an out of bounds address. Here's the backtrace:
#0 0x00007faf58414829 in strPut (h=0x1e2d180, dst=0x7fff592cf8f8, v=0x314dac0 "SAD014H1") at gdk_atoms.c:1142 #1 0x00007faf582dc935 in BATappend (b=0x1e2cf90, n=0x32dfdb0, force=1 '\001') at gdk_batop.c:578 #2 0x00007faf584c301e in la_bat_updates (lg=0x2d9b030, la=0x2c3ef48) at gdk_logger.c:429 #3 0x00007faf584c3cf9 in la_apply (lg=0x2d9b030, c=0x2c3ef48) at gdk_logger.c:645 #4 0x00007faf584c3f26 in tr_commit (lg=0x2d9b030, tr=0x2e247d0) at gdk_logger.c:705 #5 0x00007faf584c4533 in logger_readlog (lg=0x2d9b030, filename=0x7fff592d1e80 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log.56") at gdk_logger.c:823 #6 0x00007faf584c482a in logger_readlogs (lg=0x2d9b030, fp=0x2d9b160, filename=0x7fff592d3f90 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log") at gdk_logger.c:896 #7 0x00007faf584c6f3e in logger_new (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", version=52001, prefuncp=0x7faf500746a1
, postfuncp=0x7faf500747ed ) at gdk_logger.c:1420 #8 0x00007faf584c704e in logger_create (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", version=52001, prefuncp=0x7faf500746a1 , postfuncp=0x7faf500747ed ) at gdk_logger.c:1446 #9 0x00007faf50075b19 in bl_create (logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", cat_version=52001) at bat_logger.c:249 #10 0x00007faf50060ce4 in store_init (debug=0, store=store_bat, logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", stk=0) at store.c:1287 #11 0x00007faf4ffe3d3c in mvc_init (dbname=0x1fa3da0 "click", debug=0, store=store_bat, stk=0) at sql_mvc.c:51 #12 0x00007faf4ff66874 in SQLinit () at sql_scenario.c:230 #13 0x00007faf4ff6651f in SQLprelude () at sql_scenario.c:159 #14 0x00007faf58b3085d in malCommandCall (stk=0x2d36e80, pci=0x2ea5520) at mal_interpreter.c:137 #15 0x00007faf58b331b5 in runMALsequence (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, stoppc=0, stk=0x2d36e80, env=0x0, pcicaller=0x0) at mal_interpreter.c:710 #16 0x00007faf58b323c1 in runMAL (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, mbcaller=0x0, env=0x0, pcicaller=0x0) at mal_interpreter.c:454 #17 0x00007faf58b60a08 in MALengine (c=0x7faf5988c020) at mal_session.c:619 #18 0x00007faf58b5f21f in malBootstrap () at mal_session.c:64 #19 0x00007faf58b1313b in mal_init () at mal.c:244 #20 0x000000000040340e in main (argc=22, av=0x7fff592db568) at mserver5.c:582 I'm digging into it now, but I was hoping that it might ring some bells.
Thanks,
--
Percy Wegmann +1 512 637 8500 ext 148
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- Percy Wegmann +1 512 637 8500 ext 148
Well, unfortunately that doesn't seem to have resolved the issue. I'm
seeing this in the log before the first crash:
2013-02-25 12:56:22 MSG merovingian[16884]: sending process 17528 (database
'click') the TERM signal
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' (17528) has
exited with exit status 0
2013-02-25 12:56:22 MSG merovingian[16884]: database 'click' has shut down
2013-02-25 12:56:22 MSG control[16884]: (local): stopped database 'click'
So it looks like it should have shut down cleanly, but I'm still getting a
segfault in exactly the same part of the code.
Thanks,
Percy
On Mon, Feb 25, 2013 at 12:44 PM, Percy Wegmann
Hi Stefan,
Thanks for the quick reply! So, I tried 11.13.9 and got the same results.
The interesting thing is that this happens right at startup, which made me suspect that the data file itself is already corrupted due to an earlier issue.
Going through my logs, I see the following:
2013-02-22 13:59:34 MSG merovingian[1742]: sending process 20151 (database 'click') the TERM signal 2013-02-22 13:59:34 ERR merovingian[1742]: timeout of 0 seconds expired, sending process 20151 (database 'click') the KILL signal 2013-02-22 13:59:34 MSG control[1742]: (local): stopped database 'click' 2013-02-22 13:59:34 MSG merovingian[1742]: database 'click' (20151) was killed by signal SIGKILL 2013-02-22 13:59:34 MSG merovingian[1742]: database 'click' has crashed after start on 2013-02-22 13:59:23, attempting restart, up min/avg/max: 0s/0s/0s, crash average: 1.00 0.10 0.03 (1-0=1)
This was done in response to issuing a "monetdb stop" command. The exittimeout parameter for the monetdbd daemon is set to 0. I interpreted the man page to mean that this is an unlimited timeout:
"A time-out value of 0 means no mservers will be shut down, and hence they will continue to run after monetdbd has shut down."
I'm now setting the exittimeout to a high value to see if that does the trick. I'll let you know how it goes after we finish running our test scenario.
Thanks! Percy
On Mon, Feb 25, 2013 at 11:58 AM, Stefan Manegold
wrote: Hi Percy,
your segfault does not instantly ring a bell with me, but before you/we dive deeply into this, I am wondering whether you could upgrade from you Oct2012-SP2 to Oct2012-SP3 or even Feb2013, and check whether the crash(es) still occur?
Stefan
----- Original Message -----
Dear Monet Team,
We're having an issue in which certain data is causing mserver5 to crash. After this condition is hit, mserver5 crashes at every startup and always dumps an identical core.
We're running Monet 11.13.7 on Ubuntu Linux 64 bit.
The error is happening on line 1142 of gdk_atoms.c:
if (GDK_STRCMP(v, (str) (next + 1) + extralen) == 0) {
Examining the core dump revealed that (next + 1) + extralen is referring to an out of bounds address. Here's the backtrace:
#0 0x00007faf58414829 in strPut (h=0x1e2d180, dst=0x7fff592cf8f8, v=0x314dac0 "SAD014H1") at gdk_atoms.c:1142 #1 0x00007faf582dc935 in BATappend (b=0x1e2cf90, n=0x32dfdb0, force=1 '\001') at gdk_batop.c:578 #2 0x00007faf584c301e in la_bat_updates (lg=0x2d9b030, la=0x2c3ef48) at gdk_logger.c:429 #3 0x00007faf584c3cf9 in la_apply (lg=0x2d9b030, c=0x2c3ef48) at gdk_logger.c:645 #4 0x00007faf584c3f26 in tr_commit (lg=0x2d9b030, tr=0x2e247d0) at gdk_logger.c:705 #5 0x00007faf584c4533 in logger_readlog (lg=0x2d9b030, filename=0x7fff592d1e80 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log.56") at gdk_logger.c:823 #6 0x00007faf584c482a in logger_readlogs (lg=0x2d9b030, fp=0x2d9b160, filename=0x7fff592d3f90 "/opt/clicksecurity/data/_monetdb/click/sql_logs/sql/log") at gdk_logger.c:896 #7 0x00007faf584c6f3e in logger_new (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", version=52001, prefuncp=0x7faf500746a1
, postfuncp=0x7faf500747ed ) at gdk_logger.c:1420 #8 0x00007faf584c704e in logger_create (debug=0, fn=0x7faf500adfa8 "sql", logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", version=52001, prefuncp=0x7faf500746a1 , postfuncp=0x7faf500747ed ) at gdk_logger.c:1446 #9 0x00007faf50075b19 in bl_create (logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", cat_version=52001) at bat_logger.c:249 #10 0x00007faf50060ce4 in store_init (debug=0, store=store_bat, logdir=0x7faf50090a08 "sql_logs", dbname=0x1fa3da0 "click", stk=0) at store.c:1287 #11 0x00007faf4ffe3d3c in mvc_init (dbname=0x1fa3da0 "click", debug=0, store=store_bat, stk=0) at sql_mvc.c:51 #12 0x00007faf4ff66874 in SQLinit () at sql_scenario.c:230 #13 0x00007faf4ff6651f in SQLprelude () at sql_scenario.c:159 #14 0x00007faf58b3085d in malCommandCall (stk=0x2d36e80, pci=0x2ea5520) at mal_interpreter.c:137 #15 0x00007faf58b331b5 in runMALsequence (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, stoppc=0, stk=0x2d36e80, env=0x0, pcicaller=0x0) at mal_interpreter.c:710 #16 0x00007faf58b323c1 in runMAL (cntxt=0x7faf5988c020, mb=0x1e04310, startpc=1, mbcaller=0x0, env=0x0, pcicaller=0x0) at mal_interpreter.c:454 #17 0x00007faf58b60a08 in MALengine (c=0x7faf5988c020) at mal_session.c:619 #18 0x00007faf58b5f21f in malBootstrap () at mal_session.c:64 #19 0x00007faf58b1313b in mal_init () at mal.c:244 #20 0x000000000040340e in main (argc=22, av=0x7fff592db568) at mserver5.c:582 I'm digging into it now, but I was hoping that it might ring some bells.
Thanks,
--
Percy Wegmann +1 512 637 8500 ext 148
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
--
Percy Wegmann +1 512 637 8500 ext 148
-- Percy Wegmann +1 512 637 8500 ext 148
participants (2)
-
Percy Wegmann
-
Stefan Manegold