[MonetDB-users] Question about OIDs and scaling

Christopher Petrilli

31 Mar 2005 31 Mar '05

5:24 p.m.

I have been reading about MonetDB for a few days now, and have played around with it some and it seems to solve some of the ultra complex queries that I'd like to do in a time that is amazing. Congrats! One thing, though, that confuses me is some scaling issues. The data set that I'm looking at is potentially in the billions of traditional rows, with about 50 columns, so from my understanding that'd be 50 different BATs each with a billion or more items. My concern is that OIDs seem to be restricted to 2^32-1, although from looking at virtual OIDs, it's unclear whether in my situation that would be useful? Any thoughts on stuffing that much data into the system, especially how it might impact dealing with VM issues? I'm more than happy to say "this is the wrong approach" or system for this problem, as this is obviously a non-normal case. Chris -- | Christopher Petrilli | petrilli@gmail.com

Show replies by date

Martin Kersten

31 Mar 31 Mar

6:26 p.m.

...

I have been reading about MonetDB for a few days now, and have played around with it some and it seems to solve some of the ultra complex queries that I'd like to do in a time that is amazing. Congrats! One thing, though, that confuses me is some scaling issues.

The data set that I'm looking at is potentially in the billions of traditional rows, with about 50 columns, so from my understanding that'd be 50 different BATs each with a billion or more items. My Indeed, basically correct. concern is that OIDs seem to be restricted to 2^32-1, although from looking at virtual OIDs, it's unclear whether in my situation that would be useful? Depends on your platform. We happily work with 64-bit virtual OIDs on 64bit platforms. On the 32-bit platforms you ultimately have a

Chris, Thanks for your remarks and interest. A quick reaction, others will definitely follow up if my answer is incomplete. Christopher Petrilli wrote: problem, because you indeed will run out off address space quickly.

...

Any thoughts on stuffing that much data into the system, especially how it might impact dealing with VM issues? I'm more than happy to

...

say "this is the wrong approach" or system for this problem, as this is obviously a non-normal case. We love non-normal cases and unconventional use. (But also can;t

A naive mapping of huge BATs will ultimately slow down your system to that of your IO system, and things would get worse if you are not carefull in garbage collecting intermediate results. We are playing with TPCH database of up to 100Gb right now, but are actively moving into the next realm. Two approaches are on the way, one where we replace the kernel with MonetDB/X100, specially designed for this kind of problems (actually, we have some radio astronomy database in mind, which runs in the terabytes) and using BAT groups in Monet/Five. The latter you can easily built on top of the currently released system, i.e. partition the BAT using the void ranges and adjust your algorithms to work on the partitions one (few) at a time. This is what we do to manipulate largers Multimedia Database problems. Are you planning to use the kernel from SQL or directly from MIL? Tell us a little more of the kind of application you are working on. Perhaps we can give you a few helpful pointers. promises to solve all problems arising within 24 hours) regards, Martin

...

Chris

Christopher Petrilli

6:56 p.m.

On Thu, 31 Mar 2005 20:26:08 +0200, Martin Kersten wrote:

...

Well, a little background. This is a system for managing events coming out of systems, everything from web traffic to security information. It presents a couple of unique challenges: * Huge volumes of data * High rate of data insertion (situations can demand 20,000 records/second) Both of these make traditional RDBMS a losing proposition, as they are tuned for transactional performance, which is something I don't care about. I care about durability, but data is never updated, it is only inserted and then queries (which is a bit more like an OLAP application, which is where my interest in MonetDB came from).

...

Actually I was planning to use MIL, since SQL tends to become logrithmically complex. Also, I'll be using multiple backends to deal with storage requirements, so breaking MIL up seems actually a bit easier, from my view.

...

Unless I'm paying I never expect promises :-) This is mostly an exploratory exercise right now, since I know from previous implementations, where the system collapses, and want to shore up that section. Chris -- | Christopher Petrilli | petrilli@gmail.com

Martin Kersten

7:04 p.m.

Thanks. Challenging application. Loading MonetDB tables from very short (SQL) queries are not the most optimal road. A (very old) experiment of >3 years ago on simple MIL insert into a table over a shared TCP-IP connection without any further optimization on a 4 year old machine (running Linux) ran at ca >10K inserts/second. The connection was mostly the bottleneck. More recent little experiment shows 25K/sec. This stuff can all be significantly improved with a little batching on both sides. The pitfall is to make sure your BATs are large enough to capture the intended updates without the need to resize itself, because that causes IO. If you go ahead with MIL production we can give you some additional advice (that is ofcourse lacking in the documentation ;-)) Christopher Petrilli wrote:

...

On Thu, 31 Mar 2005 20:26:08 +0200, Martin Kersten wrote:

...
We are playing with TPCH database of up to 100Gb right now, but are actively moving into the next realm. Two approaches are on the way, one where we replace the kernel with MonetDB/X100, specially designed for this kind of problems (actually, we have some radio astronomy database in mind, which runs in the terabytes) and using BAT groups in Monet/Five.

The latter you can easily built on top of the currently released system, i.e. partition the BAT using the void ranges and adjust your algorithms to work on the partitions one (few) at a time. This is what we do to manipulate largers Multimedia Database problems.

Well, a little background. This is a system for managing events coming out of systems, everything from web traffic to security information. It presents a couple of unique challenges:

* Huge volumes of data * High rate of data insertion (situations can demand 20,000 records/second)

Both of these make traditional RDBMS a losing proposition, as they are tuned for transactional performance, which is something I don't care about. I care about durability, but data is never updated, it is only inserted and then queries (which is a bit more like an OLAP application, which is where my interest in MonetDB came from).

...
Are you planning to use the kernel from SQL or directly from MIL? Tell us a little more of the kind of application you are working on. Perhaps we can give you a few helpful pointers.

Actually I was planning to use MIL, since SQL tends to become logrithmically complex. Also, I'll be using multiple backends to deal with storage requirements, so breaking MIL up seems actually a bit easier, from my view.

...
...
say "this is the wrong approach" or system for this problem, as this is obviously a non-normal case.

We love non-normal cases and unconventional use. (But also can;t promises to solve all problems arising within 24 hours)

Unless I'm paying I never expect promises :-) This is mostly an exploratory exercise right now, since I know from previous implementations, where the system collapses, and want to shore up that section.

Chris

Christopher Petrilli

7:15 p.m.

On Thu, 31 Mar 2005 21:04:50 +0200, Martin Kersten wrote:

...

Thanks. Challenging application.

Indeed, but from previous experiments, properly solving it reduces some common tasks from days to seconds, which is more than worth the investment.

...

Loading MonetDB tables from very short (SQL) queries are not the most optimal road. A (very old) experiment of >3 years ago on simple MIL insert into a table over a shared TCP-IP connection without any further optimization on a 4 year old machine (running Linux) ran at ca >10K inserts/second. The connection was mostly the bottleneck.

Well, batching is more than acceptable. I currently use SQL*Loader in Oracle, and the COPY syntax in PostgreSQL to accomplish this, batching them in anywhere from 100-1000 "records," depending... allowing for roughly 1 second intervals (tunable for latency). This is more than acceptable, and in fact 5 seconds would be ok too.

...

More recent little experiment shows 25K/sec. This stuff can all be significantly improved with a little batching on both sides. The pitfall is to make sure your BATs are large enough to capture the intended updates without the need to resize itself, because that causes IO.

Well, in this case, it seems that pre-sizing would be more than acceptable, as you can do this adaptively based on the estimated load. Obviously I'm still learning MonetDB, so I'm not sure I understood the comment about pointers, honestly, and how that applies to splitting across multiple "tables" for partitioning. Could you expand on that?

...

If you go ahead with MIL production we can give you some additional advice (that is ofcourse lacking in the documentation ;-))

Always appreciated. Chris -- | Christopher Petrilli | petrilli@gmail.com

7329

Age (days ago)

7329

Last active (days ago)

List overview

Download

4 comments

2 participants

participants (2)

Christopher Petrilli
Martin Kersten