
Hi I spent more time analyzing how the original SIGSEGV occurs. I hope somebody could help me push the analysis further. The SIGSEGV is always happening in a DELETE statement: delete from \"20789445e300fa1e535f3027d5d63dc9_sessions\" where session_start between 1280361600000 and 1280447999999; and is triggered on line 498 of gdk_setop.mx: HASHloop@4(ri, r->H->hash, s2, h) { The problem seems to be with `r->H->hash` whose value is 0x0 I have no idea of where r->H->hash should have been set, or how to push the investigation further. Any help would be greatly appreciated. I have included below a capture of my gdb session which will provide more information. Thanks in advance, - Philippe --------------------- gdb session -------------------- Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fd9acbb7700 (LWP 3270)] 0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498 498 HASHloop@4(ri, r->H->hash, s2, h) { (gdb) (gdb) bt #0 0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498 #1 0x00007fd9b9760865 in BATkdiff (l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:827 #2 0x00007fd9ba8be632 in CMDkdiff (result=0x7fd9acbb6838, left=0x1dd47858, right=0x37498ea0) at algebra.mx:1586 #3 0x00007fd9ba8cebce in ALGkdiff (result=0x24ad2ec8, lid=0x24ad2e98, rid=0x24ad2c28) at algebra.mx:3018 #4 0x00007fd9ba1aa5da in DFLOWstep (t=0x21bd2c8, fs=0x7fd9acfb7de0) at mal_interpreter.mx:2058 #5 0x00007fd9ba1afee3 in runDFLOWworker (t=0x21bd2c8) at mal_interpreter.mx:1174 #6 0x00007fd9b6e0c971 in start_thread () from /lib/libpthread.so.0 #7 0x00007fd9b6b6892d in clone () from /lib/libc.so.6 #8 0x0000000000000000 in ?? () (gdb) info threads 6 Thread 0x7fd9acfb9700 (LWP 4846) 0x00007fd9b6e12da0 in sem_wait () from /lib/libpthread.so.0 5 Thread 0x7fd9ad3bb700 (LWP 3266) 0x00007fd9b6b612c3 in select () from /lib/libc.so.6 4 Thread 0x7fd9ad1ba700 (LWP 3267) 0x00007fd9b6b612c3 in select () from /lib/libc.so.6 3 Thread 0x7fd9acdb8700 (LWP 3269) 0x00007fff89fff818 in gettimeofday () * 2 Thread 0x7fd9acbb7700 (LWP 3270) 0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498 1 Thread 0x7fd9bb3b1720 (LWP 3263) 0x00007fd9b6b612c3 in select () from /lib/libc.so.6 (gdb) thread 6 [Switching to thread 6 (Thread 0x7fd9acfb9700 (LWP 4846))]#0 0x00007fd9b6e12da0 in sem_wait () from /lib/libpthread.so.0 (gdb) bt #0 0x00007fd9b6e12da0 in sem_wait () from /lib/libpthread.so.0 #1 0x00007fd9ba1a5b3f in q_dequeue (q=0x1eaca38) at mal_interpreter.mx:960 #2 0x00007fd9ba1b0a6c in DFLOWscheduler (flow=0x25e1c38) at mal_interpreter.mx:1385 #3 0x00007fd9ba1b1c07 in runMALdataflow (cntxt=0x606898, mb=0x275f9a48, startpc=2, stoppc=59, stk=0x24ad2af8, env=0x0, pcicaller=0x409cad8) at mal_interpreter.mx:1583 #4 0x00007fd9bacb47e0 in MALstartDataflow (cntxt=0x606898, mb=0x275f9a48, stk=0x24ad2af8, pci=0x409cad8) at language.mx:268 #5 0x00007fd9ba192333 in runMALsequence (cntxt=0x606898, mb=0x275f9a48, startpc=1, stoppc=0, stk=0x24ad2af8, env=0x0, pcicaller=0x0) at mal_interpreter.mx:2168 #6 0x00007fd9ba1866ec in callMAL (cntxt=0x606898, mb=0x275f9a48, env=0x7fd9acfb8c80, argv=0x7fd9acfb8c40, debug=0 '\000') at mal_interpreter.mx:429 #7 0x00007fd9ad435c9f in SQLexecutePrepared (c=0x606898, be=0x278fefc8, q=0x18c31ce8) at sql_scenario.mx:1490 #8 0x00007fd9ad435f12 in SQLengineIntern (c=0x606898, be=0x278fefc8) at sql_scenario.mx:1543 #9 0x00007fd9ad436441 in SQLengine (c=0x606898) at sql_scenario.mx:1652 #10 0x00007fd9ba1d9114 in runPhase (c=0x606898, phase=4) at mal_scenario.mx:604 #11 0x00007fd9ba1d92eb in runScenarioBody (c=0x606898) at mal_scenario.mx:655 #12 0x00007fd9ba1d94d3 in runScenario (c=0x606898) at mal_scenario.mx:682 #13 0x00007fd9ba1da40d in MSserveClient (dummy=0x606898) at mal_session.mx:486 #14 0x00007fd9b6e0c971 in start_thread () from /lib/libpthread.so.0 #15 0x00007fd9b6b6892d in clone () from /lib/libc.so.6 #16 0x0000000000000000 in ?? () (gdb) frame 7 #7 0x00007fd9ad435c9f in SQLexecutePrepared (c=0x606898, be=0x278fefc8, q=0x18c31ce8) at sql_scenario.mx:1490 1490 ret= callMAL(c, mb, &glb, argv, (m->emod & mod_debug?'n':0)); (gdb) print *q $1 = {next = 0x4ec3398, type = 2, sa = 0x1b6bf468, s = 0x2426b268, params = 0x2427a878, paramlen = 2, stk = 615328504, code = 0x1b77a958, id = 58, key = 5856, codestring = 0x20083678 "delete from \"20789445e300fa1e535f3027d5d63dc9_sessions\" where session_start between 1280361600000 and 1280447999999;", name = 0x39bc698 "s58_1", count = 18} (gdb) (gdb) thread 2 [Switching to thread 2 (Thread 0x7fd9acbb7700 (LWP 3270))]#0 0x00007fd9b974c8e8 in BATins_kdiff (bn=0x24daa0c8, l=0x1dd47858, r=0x37498ea0) at gdk_setop.mx:498 498 HASHloop@4(ri, r->H->hash, s2, h) { (gdb) l 493 BATloop(l, p1, q1) { 494 h = BUNh@2(li, p1); 495 t = BUNtail(li, p1); 496 ins = TRUE; 497 if (@6) /* check for not-nil (nils don't match anyway) */ 498 HASHloop@4(ri, r->H->hash, s2, h) { 499 if (EQUAL@5(t, BUNtail(ri, s2))) { 500 HIT@1(h, t); 501 ins = FALSE; 502 break; (gdb) p ri $14 = {b = 0x37498ea0, hvid = 0, tvid = 0} (gdb) p s2 $15 = 9223372036854775807 (gdb) p h $16 = (ptr) 0x7fd9acbb3f88 (gdb) p r->H->hash $17 = (Hash *) 0x0 (gdb) p *r->H $18 = {id = 0x7fd9b9c48f7f "t", width = 8, type = 7 '\a', shift = 3 '\003', sorted = 0 '\000', varsized = 0, key = 0, dense = 0, nonil = 1, nil = 0, unused = 0, align = 0, nosorted_rev = 0, nokey = {0, 0}, nosorted = 0, nodense = 182, seq = 0, heap = {maxsize = 157280, free = 137000, size = 157280, base = 0x15c6b2b8 "", filename = 0x374990d8 "12/40/124015.tail", storage = 0 '\000', copied = 0, hashash = 0, forcemap = 0, newstorage = 0 '\000', dirty = 0 '\000', parentid = 0}, vheap = 0x0, hash = 0x0, props = 0x0} (gdb) p *r $19 = {batCacheid = -43021, H = 0x37498f58, T = 0x37498ec8, P = 0x37498fe8, U = 0x37499000} Structure of the table: sql>\d "20789445e300fa1e535f3027d5d63dc9_sessions" CREATE TABLE "reporting"."20789445e300fa1e535f3027d5d63dc9_sessions" ( "session_start" BIGINT, "session_id" CHAR(51), "the_day" CHAR(10), "cart" BOOLEAN, "purchased" BOOLEAN, "merchant_total_dollars" INTEGER, "co_total_dollars" INTEGER, "baseline_dollars" INTEGER, "billing_baseline_dollars" INTEGER, "promo_determination" VARCHAR(20), "session_enabled" BOOLEAN, "co_enabled" BOOLEAN, "co_managed" BOOLEAN, "promo" VARCHAR(100), "sushi" VARCHAR(100), "url_referrer" VARCHAR(1024) ); On Jun 2, 2011, at 2:01 PM, Philippe Hanrigou wrote:
Hi Stefan,
Thanks a lot for the help. I really appreciate it.
On Jun 2, 2011, at 7:23 AM, Stefan Manegold wrote:
2011-06-02 04:53:27 MSG prod_reporting[25449]: !SQLException:SQLinit:Catalogue initialization failed 2011-06-02 04:53:27 MSG prod_reporting[25449]: !ERROR: HEAPextend: failed to extend to 3316460814336 for 11/40/114026theap ^^^^^^^^^^^^^ This suggests that MonetDB "for some reason" (possibly wrongly) expects some (intermediate) column to grow up to 3 TB in size, hence, tries to alloced the respective memory, but fails to do so successfully.
We can try to investigate where this happens, but as much information about your usage of MonetDB (DB schema, data, query workload) as possible would be very helpful for us to be able to locate the origin of the problem.
3TB seems quite crazy, I wonder how I end up triggering this with a 4.3G database (as measured by "du -hs" on disk).
The workload is a serie of updates to our reference data in a database. I am simulating some upserts with a combination of DELETE/COPY INTO: I want to refresh data about some "sessions" for a "merchant". Each merchant has a dedicated table, named "<merchant id>_sessions";
To refresh the session metrics I execute:
-- for each day: -- for each merchant
DELETE FROM "
_sessions" WHERE session_start BETWEEN <start of day timestamp> AND <end of day timestamp>; COPY INTO " _sessions" FROM STDIN USING DELIMITERS '\\t','\\n'; ... (up to 8000 session rows for this merchant and day) This is my poor man's way of simulating upserts as my email on the topic did not generate many suggestions ;-) http://sourceforge.net/mailarchive/forum.php?thread_name=BANLkTi%3DdX-1DFka5NRnZUEj%3DVdi3Sz-Kkg%40mail.gmail.com&forum_name=monetdb-users
I am willing to try other ways to accomplish the same thing as long as it is performant for a bulk upserts.
I'm afraid, though, we might need to be able to replay your complete scenario and trigger the some error with us to be able to locate and fix the problem.
I will try again tomorrow. I will try adding an explicit maximum number of records with COPY 8000 RECORDS INTO ... to see if it makes any difference.
I see from your logs that you are using the latest Apr2011-SP1 release (64-bit on a 64-bit Linux system). Did you experience the problem also with earlier releases of MonetDB?
Yes indeed it first happened with Apr2011. I then upgraded to Apr2011-SP1, started with a fresh database, reran the import/refresh from scratch and was able to reproduce the problem again.
The server crashes with a segmentation fault, and we'd need to know where in the code (and why) this happens. I only(?) way to find out would be to start the server by hand in a debugger, using the same commandline options as monetdbd (merovingian) uses (see your log below), e.g.,