[MonetDB-users] Question about OIDs and scaling
I have been reading about MonetDB for a few days now, and have played around with it some and it seems to solve some of the ultra complex queries that I'd like to do in a time that is amazing. Congrats! One thing, though, that confuses me is some scaling issues. The data set that I'm looking at is potentially in the billions of traditional rows, with about 50 columns, so from my understanding that'd be 50 different BATs each with a billion or more items. My concern is that OIDs seem to be restricted to 2^32-1, although from looking at virtual OIDs, it's unclear whether in my situation that would be useful? Any thoughts on stuffing that much data into the system, especially how it might impact dealing with VM issues? I'm more than happy to say "this is the wrong approach" or system for this problem, as this is obviously a non-normal case. Chris -- | Christopher Petrilli | petrilli@gmail.com
I have been reading about MonetDB for a few days now, and have played around with it some and it seems to solve some of the ultra complex queries that I'd like to do in a time that is amazing. Congrats! One thing, though, that confuses me is some scaling issues.
The data set that I'm looking at is potentially in the billions of traditional rows, with about 50 columns, so from my understanding that'd be 50 different BATs each with a billion or more items. My Indeed, basically correct. concern is that OIDs seem to be restricted to 2^32-1, although from looking at virtual OIDs, it's unclear whether in my situation that would be useful? Depends on your platform. We happily work with 64-bit virtual OIDs on 64bit platforms. On the 32-bit platforms you ultimately have a
Chris, Thanks for your remarks and interest. A quick reaction, others will definitely follow up if my answer is incomplete. Christopher Petrilli wrote: problem, because you indeed will run out off address space quickly.
Any thoughts on stuffing that much data into the system, especially how it might impact dealing with VM issues? I'm more than happy to
say "this is the wrong approach" or system for this problem, as this is obviously a non-normal case. We love non-normal cases and unconventional use. (But also can;t
A naive mapping of huge BATs will ultimately slow down your system to that of your IO system, and things would get worse if you are not carefull in garbage collecting intermediate results. We are playing with TPCH database of up to 100Gb right now, but are actively moving into the next realm. Two approaches are on the way, one where we replace the kernel with MonetDB/X100, specially designed for this kind of problems (actually, we have some radio astronomy database in mind, which runs in the terabytes) and using BAT groups in Monet/Five. The latter you can easily built on top of the currently released system, i.e. partition the BAT using the void ranges and adjust your algorithms to work on the partitions one (few) at a time. This is what we do to manipulate largers Multimedia Database problems. Are you planning to use the kernel from SQL or directly from MIL? Tell us a little more of the kind of application you are working on. Perhaps we can give you a few helpful pointers. promises to solve all problems arising within 24 hours) regards, Martin
Chris
On Thu, 31 Mar 2005 20:26:08 +0200, Martin Kersten
We are playing with TPCH database of up to 100Gb right now, but are actively moving into the next realm. Two approaches are on the way, one where we replace the kernel with MonetDB/X100, specially designed for this kind of problems (actually, we have some radio astronomy database in mind, which runs in the terabytes) and using BAT groups in Monet/Five.
The latter you can easily built on top of the currently released system, i.e. partition the BAT using the void ranges and adjust your algorithms to work on the partitions one (few) at a time. This is what we do to manipulate largers Multimedia Database problems.
Well, a little background. This is a system for managing events coming out of systems, everything from web traffic to security information. It presents a couple of unique challenges: * Huge volumes of data * High rate of data insertion (situations can demand 20,000 records/second) Both of these make traditional RDBMS a losing proposition, as they are tuned for transactional performance, which is something I don't care about. I care about durability, but data is never updated, it is only inserted and then queries (which is a bit more like an OLAP application, which is where my interest in MonetDB came from).
Are you planning to use the kernel from SQL or directly from MIL? Tell us a little more of the kind of application you are working on. Perhaps we can give you a few helpful pointers.
Actually I was planning to use MIL, since SQL tends to become logrithmically complex. Also, I'll be using multiple backends to deal with storage requirements, so breaking MIL up seems actually a bit easier, from my view.
say "this is the wrong approach" or system for this problem, as this is obviously a non-normal case. We love non-normal cases and unconventional use. (But also can;t promises to solve all problems arising within 24 hours)
Unless I'm paying I never expect promises :-) This is mostly an exploratory exercise right now, since I know from previous implementations, where the system collapses, and want to shore up that section. Chris -- | Christopher Petrilli | petrilli@gmail.com
Thanks. Challenging application. Loading MonetDB tables from very short (SQL) queries are not the most optimal road. A (very old) experiment of >3 years ago on simple MIL insert into a table over a shared TCP-IP connection without any further optimization on a 4 year old machine (running Linux) ran at ca >10K inserts/second. The connection was mostly the bottleneck. More recent little experiment shows 25K/sec. This stuff can all be significantly improved with a little batching on both sides. The pitfall is to make sure your BATs are large enough to capture the intended updates without the need to resize itself, because that causes IO. If you go ahead with MIL production we can give you some additional advice (that is ofcourse lacking in the documentation ;-)) Christopher Petrilli wrote:
On Thu, 31 Mar 2005 20:26:08 +0200, Martin Kersten
wrote: We are playing with TPCH database of up to 100Gb right now, but are actively moving into the next realm. Two approaches are on the way, one where we replace the kernel with MonetDB/X100, specially designed for this kind of problems (actually, we have some radio astronomy database in mind, which runs in the terabytes) and using BAT groups in Monet/Five.
The latter you can easily built on top of the currently released system, i.e. partition the BAT using the void ranges and adjust your algorithms to work on the partitions one (few) at a time. This is what we do to manipulate largers Multimedia Database problems.
Well, a little background. This is a system for managing events coming out of systems, everything from web traffic to security information. It presents a couple of unique challenges:
* Huge volumes of data * High rate of data insertion (situations can demand 20,000 records/second)
Both of these make traditional RDBMS a losing proposition, as they are tuned for transactional performance, which is something I don't care about. I care about durability, but data is never updated, it is only inserted and then queries (which is a bit more like an OLAP application, which is where my interest in MonetDB came from).
Are you planning to use the kernel from SQL or directly from MIL? Tell us a little more of the kind of application you are working on. Perhaps we can give you a few helpful pointers.
Actually I was planning to use MIL, since SQL tends to become logrithmically complex. Also, I'll be using multiple backends to deal with storage requirements, so breaking MIL up seems actually a bit easier, from my view.
say "this is the wrong approach" or system for this problem, as this is obviously a non-normal case.
We love non-normal cases and unconventional use. (But also can;t promises to solve all problems arising within 24 hours)
Unless I'm paying I never expect promises :-) This is mostly an exploratory exercise right now, since I know from previous implementations, where the system collapses, and want to shore up that section.
Chris
On Thu, 31 Mar 2005 21:04:50 +0200, Martin Kersten
Thanks. Challenging application.
Indeed, but from previous experiments, properly solving it reduces some common tasks from days to seconds, which is more than worth the investment.
Loading MonetDB tables from very short (SQL) queries are not the most optimal road. A (very old) experiment of >3 years ago on simple MIL insert into a table over a shared TCP-IP connection without any further optimization on a 4 year old machine (running Linux) ran at ca >10K inserts/second. The connection was mostly the bottleneck.
Well, batching is more than acceptable. I currently use SQL*Loader in Oracle, and the COPY syntax in PostgreSQL to accomplish this, batching them in anywhere from 100-1000 "records," depending... allowing for roughly 1 second intervals (tunable for latency). This is more than acceptable, and in fact 5 seconds would be ok too.
More recent little experiment shows 25K/sec. This stuff can all be significantly improved with a little batching on both sides. The pitfall is to make sure your BATs are large enough to capture the intended updates without the need to resize itself, because that causes IO.
Well, in this case, it seems that pre-sizing would be more than acceptable, as you can do this adaptively based on the estimated load. Obviously I'm still learning MonetDB, so I'm not sure I understood the comment about pointers, honestly, and how that applies to splitting across multiple "tables" for partitioning. Could you expand on that?
If you go ahead with MIL production we can give you some additional advice (that is ofcourse lacking in the documentation ;-))
Always appreciated. Chris -- | Christopher Petrilli | petrilli@gmail.com
participants (2)
-
Christopher Petrilli
-
Martin Kersten