
Hi, sorry for stealing your time. Indeed, the data files were corrupt and not sorted due to a bug in the program that generates them. Now, everything works like it has to. Thank you very much. Regards, Sebastian On 08.08.2013 16:28, Hannes Mühleisen wrote:
Hello Sebastian,
I just tried this on my local installation of the latest release version of MonetDB:
cat test.psv 1|N 2|N 3|N 4|N 5|N ...
then
create table genestuff (siteid int, base string) copy into genestuff from '.../test.psv'
select "column","sorted" from storage() where "table"='genestuff'; +--------+--------+ | column | sorted | +========+========+ | siteid | true | | base | true | +--------+--------+
Having successfully loaded sorted datasets in the two-digit Gigabyte range into MonetDB last week, I can confirm this is working fine. Please scrutinize your data files a bit more.
Best,
Hannes
On 08/08/2013 04:17 PM, Sebastian Dorok wrote:
;)
I use one COPY INTO:
COPY INTO genomics.grch36 FROM '/path/to/file';
Regards, Sebastian
On 08.08.2013 15:50, Hannes Mühleisen wrote:
Hello Sebastian,
On 08/08/2013 01:05 PM, Sebastian Dorok wrote:
sql>select "column","sorted" from storage() where "table"='grch36'; +------------------+--------+ | column | sorted | +==================+========+ | site | false | | base | false | | grch36_site_pkey | true | +------------------+--------+
Apparently it is not recognized. Apparently your file is not sorted :)
Jokes aside, do you load the data in one COPY INTO run or several?
Best,
Hannes
Here is what the first lines of the CSV file look like:
1|N 2|N 3|N 4|N 5|N 6|N 7|N 8|N 9|N 10|N 11|N 12|N 13|N 14|N ...
On 08.08.2013 12:49, Hannes Mühleisen wrote:
On 08/08/2013 12:46 PM, Sebastian Dorok wrote:
I have some slow queries that I want to accelerate by utilizing information about sorting.
My table definition: create table genomics.grch36 (site bigint, base char(1)); I use COPY INTO to populate the table. The source file size is 40GB and contains more than 3 billion rows. I can guarantee that the data in file is ordered by site.
I query the data like this:
sql>select base from genomics.grch36 where site between 10000 and 10010; +------+ | base | +======+ | N | | T | | A | | A | | C | | C | | C | | T | | A | | A | | C | +------+ 11 tuples (5m 23s)
I think 5 minutes seem too much for this query. Primary key or an index on 'site' don't work or at least aren't recognized in query execution. I think MonetDB would benefit when knowing that the data is ordered by site. Wouldn't it? Actually, the COPY INTO should recognize the "sortedness" and mark the
column accordingly.
What is the output of running
select "column","sorted" from storage() where "table"='genomics.grch36';
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list