[MonetDB-users] corruption on db restart
I imported some millions of rows into a table: sql>select part, count(*) from frag group by part order by part; +------+-----------+ | part |count_node | +======+===========+ | 1 | 25205085 | | 2 | 25618423 | | 3 | 27380780 | | 4 | 26963069 | | 5 | 25661196 | | 6 | 27021186 | | 7 | 26058891 | | 8 | 28092822 | | 9 | 27528646 | | 10 | 26429092 | | 11 | 25194504 | | 12 | 25136210 | | 13 | 25105453 | | 14 | 24860071 | | 16 | 20672348 | | 17 | 19138139 | +------+-----------+ then I stopped the database, and wanted to start it back up to see how quickly it would come up. it took a few minutes to come up while it was comitting uncomitted transactions. When i started it up the space usage looked like this: 251721152 ./var/MonetDB5/dbfarm (251gb) 24181152 ./var/MonetDB5/sql_logs (24gb) after a few minutes it became like this: 249855172 ./var/MonetDB5/dbfarm (249gb) 12 ./var/MonetDB5/sql_logs however when it came up, i guess the data became corrupted because look at what happened to my query results now: sql>select part, count(*) from frag group by part order by part; +------+-----------+ | part |count_node | +======+===========+ | 1 | 25205085 | | 0 | 361722691 | | 17 | 19138139 | | 18 | 21414794 | +------+-----------+ there are total about 427 M rows, of which 361.7M are corrupted. There is nothing in the merovingian.log file at all since stop/start of database. Can someone advise what to do with this? 1. if possible to recover data, please let me know 2. otherwise what can I do to prevent this in the future? this is a huge roadblock. -- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16737747.html Sent from the monetdb-users mailing list archive at Nabble.com.
after typing this up, I did a \q on mclient, and went back in, at that time this message appeared on the merovingian.log file: TME 2008-04-16 22:50:11 MSG merovingian[20338]: database 'frdata' already running since 2008-04-16 22:10:44, up min/avg/max: 25111/25111/25111, crash average: 0.00 0.00 0.00 (2-1=0) MSG merovingian[20338]: redirecting client 127.0.0.1:49801 for database 'frdata' to mapi:monetdb://fedora.mydomain.com:50001/ mobigital1 wrote:
I imported some millions of rows into a table:
sql>select part, count(*) from frag group by part order by part; +------+-----------+ | part |count_node | +======+===========+ | 1 | 25205085 | | 2 | 25618423 | | 3 | 27380780 | | 4 | 26963069 | | 5 | 25661196 | | 6 | 27021186 | | 7 | 26058891 | | 8 | 28092822 | | 9 | 27528646 | | 10 | 26429092 | | 11 | 25194504 | | 12 | 25136210 | | 13 | 25105453 | | 14 | 24860071 | | 16 | 20672348 | | 17 | 19138139 | +------+-----------+
then I stopped the database, and wanted to start it back up to see how quickly it would come up. it took a few minutes to come up while it was comitting uncomitted transactions. When i started it up the space usage looked like this:
251721152 ./var/MonetDB5/dbfarm (251gb) 24181152 ./var/MonetDB5/sql_logs (24gb)
after a few minutes it became like this:
249855172 ./var/MonetDB5/dbfarm (249gb) 12 ./var/MonetDB5/sql_logs
however when it came up, i guess the data became corrupted because look at what happened to my query results now:
sql>select part, count(*) from frag group by part order by part; +------+-----------+ | part |count_node | +======+===========+ | 1 | 25205085 | | 0 | 361722691 | | 17 | 19138139 | | 18 | 21414794 | +------+-----------+
there are total about 427 M rows, of which 361.7M are corrupted.
There is nothing in the merovingian.log file at all since stop/start of database.
Can someone advise what to do with this? 1. if possible to recover data, please let me know 2. otherwise what can I do to prevent this in the future? this is a huge roadblock.
-- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16737776.html Sent from the monetdb-users mailing list archive at Nabble.com.
On Wed, Apr 16, 2008 at 07:51:29PM -0700, mobigital1 wrote:
after typing this up, I did a \q on mclient, and went back in, at that time this message appeared on the merovingian.log file:
TME 2008-04-16 22:50:11 MSG merovingian[20338]: database 'frdata' already running since 2008-04-16 22:10:44, up min/avg/max: 25111/25111/25111, crash average: 0.00 0.00 0.00 (2-1=0) MSG merovingian[20338]: redirecting client 127.0.0.1:49801 for database 'frdata' to mapi:monetdb://fedora.mydomain.com:50001/
This message doesn't explain the data base corruption. It seems you hit a bug here. To repeat this problem could you some how describe the data your trying to insert. Which system your using etc. How where you running your monet server etc Maybe you could also create a bug report on sourceforge for this? Niels
mobigital1 wrote:
I imported some millions of rows into a table:
sql>select part, count(*) from frag group by part order by part; +------+-----------+ | part |count_node | +======+===========+ | 1 | 25205085 | | 2 | 25618423 | | 3 | 27380780 | | 4 | 26963069 | | 5 | 25661196 | | 6 | 27021186 | | 7 | 26058891 | | 8 | 28092822 | | 9 | 27528646 | | 10 | 26429092 | | 11 | 25194504 | | 12 | 25136210 | | 13 | 25105453 | | 14 | 24860071 | | 16 | 20672348 | | 17 | 19138139 | +------+-----------+
then I stopped the database, and wanted to start it back up to see how quickly it would come up. it took a few minutes to come up while it was comitting uncomitted transactions. When i started it up the space usage looked like this:
251721152 ./var/MonetDB5/dbfarm (251gb) 24181152 ./var/MonetDB5/sql_logs (24gb)
after a few minutes it became like this:
249855172 ./var/MonetDB5/dbfarm (249gb) 12 ./var/MonetDB5/sql_logs
however when it came up, i guess the data became corrupted because look at what happened to my query results now:
sql>select part, count(*) from frag group by part order by part; +------+-----------+ | part |count_node | +======+===========+ | 1 | 25205085 | | 0 | 361722691 | | 17 | 19138139 | | 18 | 21414794 | +------+-----------+
there are total about 427 M rows, of which 361.7M are corrupted.
There is nothing in the merovingian.log file at all since stop/start of database.
Can someone advise what to do with this? 1. if possible to recover data, please let me know 2. otherwise what can I do to prevent this in the future? this is a huge roadblock.
-- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16737776.html Sent from the monetdb-users mailing list archive at Nabble.com.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao... _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl
On 17-04-2008 09:04:17 +0200, Niels Nes wrote:
On Wed, Apr 16, 2008 at 07:51:29PM -0700, mobigital1 wrote:
after typing this up, I did a \q on mclient, and went back in, at that time this message appeared on the merovingian.log file:
TME 2008-04-16 22:50:11 MSG merovingian[20338]: database 'frdata' already running since 2008-04-16 22:10:44, up min/avg/max: 25111/25111/25111, crash average: 0.00 0.00 0.00 (2-1=0) MSG merovingian[20338]: redirecting client 127.0.0.1:49801 for database 'frdata' to mapi:monetdb://fedora.mydomain.com:50001/
This message doesn't explain the data base corruption.
It seems you hit a bug here. To repeat this problem could you some how describe the data your trying to insert. Which system your using etc. How where you running your monet server etc Maybe you could also create a bug report on sourceforge for this?
One thing from mobigitall's message is unclear to me:
then I stopped the database, and wanted to start it back up to see how quickly it would come up. it took a few minutes to come up while it was comitting uncomitted transactions. When i started it up the space usage looked like this: [snip] There is nothing in the merovingian.log file at all since stop/start of database.
I assume you start your database using merovingian in this case. One thing that I know is that merovingian doesn't wait minutes for a database to come up. Instead it waits for at max 10 seconds. Due to a bug that I fixed yesterday (stable CVS), merovingian would not log anything about the database not starting up within 10 seconds. The intentions of merovingian are to terminate a database that takes to long, however due to the same bug, this didn't happen actually.
One thing from mobigitall's message is unclear to me:
then I stopped the database, and wanted to start it back up to see how quickly it would come up. it took a few minutes to come up while it was comitting uncomitted transactions. When i started it up the space usage looked like this: [snip] There is nothing in the merovingian.log file at all since stop/start of database.
I assume you start your database using merovingian in this case. One thing that I know is that merovingian doesn't wait minutes for a database to come up. Instead it waits for at max 10 seconds. Due to a bug that I fixed yesterday (stable CVS), merovingian would not log anything about the database not starting up within 10 seconds. The intentions of merovingian are to terminate a database that takes to long, however due to the same bug, this didn't happen actually.
yes, started via merovigian, but I don't think that it tried to kill mserver5. it went like this: 1. i loaded a ton of data and then stopped the server. - at this point sql_log had a leftover 24GB of logs in it. (see my prev messages) 2. i started the server via monetdb start <> command 3. the mserver5 on "top" was showing small memory footprint and no cpu activity. 4. i tried to connect via mclient. 5. the mserver5 started high CPU activity and quickly growing in virtual and res memory usage. it took may be 20 min to 30 min to get a prompt on the mclient. until then client did not show a prompt but simply hung. 6. by the time prompt appeared on mclient, the sql_log went down to 16k (from 24GB). I thought it means it committed some uncomitted data into the database by that time. I suppose the corruption happened during that commit. NOTE: it didn't make sense to me then, but knowing there is a corruption it mean something - the size of the dbfarm folder went down a couple of gigs just when the committ was complete. I would think it should not drop any size at that point since it's moving data from log to dbfarm. -- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16743999.html Sent from the monetdb-users mailing list archive at Nabble.com.
On Thu, Apr 17, 2008 at 06:24:57AM -0700, mobigital1 wrote:
yes, started via merovigian, but I don't think that it tried to kill mserver5. it went like this:
1. i loaded a ton of data and then stopped the server. - at this point sql_log had a leftover 24GB of logs in it. (see my prev messages)
2. i started the server via monetdb start <> command 3. the mserver5 on "top" was showing small memory footprint and no cpu activity. 4. i tried to connect via mclient. 5. the mserver5 started high CPU activity and quickly growing in virtual and res memory usage. it took may be 20 min to 30 min to get a prompt on the mclient. until then client did not show a prompt but simply hung. 6. by the time prompt appeared on mclient, the sql_log went down to 16k (from 24GB). I thought it means it committed some uncomitted data into the database by that time. I suppose the corruption happened during that commit.
NOTE: it didn't make sense to me then, but knowing there is a corruption it mean something - the size of the dbfarm folder went down a couple of gigs just when the committ was complete. I would think it should not drop any size at that point since it's moving data from log to dbfarm.
We could like to debug this. As we expect this to be a problem of the database server (possibly related with your os filesystem), we should start running without the monetdb script and merovigian. Ie start your mserver5 with mserver5 --dbinit="include sql;" --dbname=X The last part is added to make sure it will be an empty new database with the name X. Could you repeat the process of starting, loading, stopping and restarting to check if that still fails. Also could you find out where in the columns the 'zeros' start appearing, ie in which rows. Maybe its related to page boundaries etc Al these tests are done on a fedora linux system? Could give (possibly repeat) the exact details of your system? Niels PS we could take this of the list and return once we found an answer.
-- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16743999.html Sent from the monetdb-users mailing list archive at Nabble.com.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao... _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl
see my recent post about invalid credentials. apparently starting database by hand with these parameters is not the same as via meronginian. Niels Nes wrote:
On Thu, Apr 17, 2008 at 06:24:57AM -0700, mobigital1 wrote:
yes, started via merovigian, but I don't think that it tried to kill mserver5. it went like this:
1. i loaded a ton of data and then stopped the server. - at this point sql_log had a leftover 24GB of logs in it. (see my prev messages)
2. i started the server via monetdb start <> command 3. the mserver5 on "top" was showing small memory footprint and no cpu activity. 4. i tried to connect via mclient. 5. the mserver5 started high CPU activity and quickly growing in virtual and res memory usage. it took may be 20 min to 30 min to get a prompt on the mclient. until then client did not show a prompt but simply hung. 6. by the time prompt appeared on mclient, the sql_log went down to 16k (from 24GB). I thought it means it committed some uncomitted data into the database by that time. I suppose the corruption happened during that commit.
NOTE: it didn't make sense to me then, but knowing there is a corruption it mean something - the size of the dbfarm folder went down a couple of gigs just when the committ was complete. I would think it should not drop any size at that point since it's moving data from log to dbfarm.
We could like to debug this. As we expect this to be a problem of the database server (possibly related with your os filesystem), we should start running without the monetdb script and merovigian.
Ie start your mserver5 with
mserver5 --dbinit="include sql;" --dbname=X
The last part is added to make sure it will be an empty new database with the name X.
Could you repeat the process of starting, loading, stopping and restarting to check if that still fails.
Also could you find out where in the columns the 'zeros' start appearing, ie in which rows. Maybe its related to page boundaries etc
Al these tests are done on a fedora linux system? Could give (possibly repeat) the exact details of your system?
Niels
PS we could take this of the list and return once we found an answer.
-- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16743999.html Sent from the monetdb-users mailing list archive at Nabble.com.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao... _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
--
Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao... _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16750970.html Sent from the monetdb-users mailing list archive at Nabble.com.
On 17-04-2008 11:42:33 -0700, mobigital1 wrote:
see my recent post about invalid credentials. apparently starting database by hand with these parameters is not the same as via meronginian.
That is correct. Merovingian adds a level of security by using a vaultfile. The vaulfile it uses is dbfarm/<database>/.vaultfile, and you need to give it using --set monet_vaultkey=/path/to/file
Niels Nes wrote:
On Thu, Apr 17, 2008 at 06:24:57AM -0700, mobigital1 wrote:
yes, started via merovigian, but I don't think that it tried to kill mserver5. it went like this:
1. i loaded a ton of data and then stopped the server. - at this point sql_log had a leftover 24GB of logs in it. (see my prev messages)
2. i started the server via monetdb start <> command 3. the mserver5 on "top" was showing small memory footprint and no cpu activity. 4. i tried to connect via mclient. 5. the mserver5 started high CPU activity and quickly growing in virtual and res memory usage. it took may be 20 min to 30 min to get a prompt on the mclient. until then client did not show a prompt but simply hung. 6. by the time prompt appeared on mclient, the sql_log went down to 16k (from 24GB). I thought it means it committed some uncomitted data into the database by that time. I suppose the corruption happened during that commit.
NOTE: it didn't make sense to me then, but knowing there is a corruption it mean something - the size of the dbfarm folder went down a couple of gigs just when the committ was complete. I would think it should not drop any size at that point since it's moving data from log to dbfarm.
We could like to debug this. As we expect this to be a problem of the database server (possibly related with your os filesystem), we should start running without the monetdb script and merovigian.
Ie start your mserver5 with
mserver5 --dbinit="include sql;" --dbname=X
The last part is added to make sure it will be an empty new database with the name X.
Could you repeat the process of starting, loading, stopping and restarting to check if that still fails.
Also could you find out where in the columns the 'zeros' start appearing, ie in which rows. Maybe its related to page boundaries etc
Al these tests are done on a fedora linux system? Could give (possibly repeat) the exact details of your system?
Niels
PS we could take this of the list and return once we found an answer.
-- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16743999.html Sent from the monetdb-users mailing list archive at Nabble.com.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao... _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
--
Niels Nes, Centre for Mathematics and Computer Science (CWI) Kruislaan 413, 1098 SJ Amsterdam, The Netherlands room C0.02, phone ++31 20 592-4098, fax ++31 20 592-4312 url: http://www.cwi.nl/~niels e-mail: Niels.Nes@cwi.nl
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao... _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- View this message in context: http://www.nabble.com/corruption-on-db-restart-tp16737747p16750970.html Sent from the monetdb-users mailing list archive at Nabble.com.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javao... _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
participants (3)
-
Fabian Groffen
-
mobigital1
-
Niels Nes