Hi,
I´m running some analysis on MonetDB and have quite bad results with MonetDB. At first I thought it would be the fault of the Operator-at-a-Time Modell. I could replicate the behaviour that predicates on more columns resulted in a linear growing in execution times. This is a direct consequence of the Operator-at-a-Time Modell. But know i realized that the behaviour heavily depends on the used disk in the system. If I have the database on my primary disk storage I get quite good results. But because the primary disk storage is quite limited in our experimental PC I have worked on an NAS with 11TB. When I launch the same query on the NAS the query doesn`t take 0,4s, but 25s! After further testing I could see with iotop that the mserver is constantly writing something to the disks. But why is that so?
I`m running Ubuntu 14.04 64bit, with 2 Intel Xeon E5-2690 CPUs and 62GB RAM. I have installed the TPC-H benchmark and running
select count(*) from lineitem where l_orderkey%2=0;
on my primary disk storage took about 400ms. On the external disk storage it took about 23s. The table contains 149996355 tuple and I created the TPC-H database with a scaling factor of 25.
While running this query
SELECT anon_1.sample_id AS anon_1_sample_id, anon_1.variant_id AS anon_1_variant_id, anon_1.qual AS anon_1_qual, anon_1.is_heterozygous AS anon_1_is_heterozygous, anon_1.read_depth AS anon_1_read_depth, anon_1.ref_depth AS anon_1_ref_depth, anon_1.alt_depth AS anon_1_alt_depth, anon_1.strand_bias AS anon_1_strand_bias, anon_1.qual_by_depth AS anon_1_qual_by_depth, anon_1.mapping_qual AS anon_1_mapping_qual, anon_1.haplotype_score AS anon_1_haplotype_score, anon_1.mapping_qual_bias AS anon_1_mapping_qual_bias, anon_1.read_pos_bias AS anon_1_read_pos_bias, annotations.variant_id AS annotations_variant_id, annotations.feature_id AS annotations_feature_id, annotations.ref_codon AS annotations_ref_codon, annotations.alt_codon AS annotations_alt_codon, annotations.ref_acid AS annotations_ref_acid, annotations.alt_acid AS annotations_alt_acid, annotations.type AS annotations_type, annotations.region AS annotations_region, annotations.splice_dist AS annotations_splice_dist FROM (SELECT calls.sample_id AS sample_id, calls.variant_id AS variant_id, calls.qual AS qual, calls.is_heterozygous AS is_heterozygous, calls.read_depth AS read_depth, calls.ref_depth AS ref_depth, calls.alt_depth AS alt_depth, calls.strand_bias AS strand_bias, calls.qual_by_depth AS qual_by_depth, calls.mapping_qual AS mapping_qual, calls.haplotype_score AS haplotype_score, calls.mapping_qual_bias AS mapping_qual_bias, calls.read_pos_bias AS read_pos_bias FROM calls JOIN samples ON samples.id = calls.sample_id JOIN patients ON patients.id = samples.patient_id JOIN diseases ON diseases.id = samples.disease_id LEFT OUTER JOIN (SELECT calls.sample_id AS sample_id, calls.variant_id AS variant_id, calls.qual AS qual, calls.is_heterozygous AS is_heterozygous, calls.read_depth AS read_depth, calls.ref_depth AS ref_depth, calls.alt_depth AS alt_depth, calls.strand_bias AS strand_bias, calls.qual_by_depth AS qual_by_depth, calls.mapping_qual AS mapping_qual, calls.haplotype_score AS haplotype_score, calls.mapping_qual_bias AS mapping_qual_bias, calls.read_pos_bias AS read_pos_bias FROM calls WHERE calls.qual > 0 AND calls.sample_id IN (227, 230, 233, 234, 237, 190, 195, 198, 199, 203, 270, 273, 276, 189, 343, 366, 367, 368)) AS anon_2 ON anon_2.variant_id = calls.variant_id LEFT OUTER JOIN known ON known.variant_id = calls.variant_id AND known.source_id IN (1, 2, 20, 19, 46) WHERE samples.accession IN ('17041R5', '20195R', '21984R', '23273R', '23390R', '13264R', '18337R', '18533R', '19811R', '20039R', '21776R', '21809R', '22927R', '17294R', '17071R', '21016R', 'Greif1R', '18337T', '18533T', '19811T', '20039T', '21809T', '22927T', '20195T', '21984T', '23273T', '23390T', '17041T', '17294T', '17071T', '21016T', '13264T', '21776T', 'Greif1T1S') AND calls.qual >= 50 AND anon_2.variant_id IS NULL AND (known.variant_id IS NULL OR known.clinical AND (NOT known.clinical_significance = 2) OR known.precious OR known.locus_specific_db)) AS anon_1 JOIN variants ON variants.id = anon_1.variant_id JOIN annotations ON variants.id = annotations.variant_id JOIN features ON features.id = annotations.feature_id JOIN transcripts ON transcripts.id = features.transcript_id JOIN genes ON genes.id = transcripts.gene_id WHERE (variants.is_transition IS NULL OR variants.is_transversion IS NULL OR (variants.is_transition OR variants.is_transversion) AND (anon_1.strand_bias IS NULL OR anon_1.strand_bias <= 60.0) AND (anon_1.qual_by_depth IS NULL OR anon_1.qual_by_depth >= 2.0) AND (anon_1.mapping_qual IS NULL OR anon_1.mapping_qual >= 40.0) AND (anon_1.haplotype_score IS NULL OR anon_1.haplotype_score <= 13.0) AND (anon_1.mapping_qual_bias IS NULL OR anon_1.mapping_qual_bias >= -12.5) AND (anon_1.read_pos_bias IS NULL OR anon_1.read_pos_bias >= -8.0) OR NOT (variants.is_transition OR variants.is_transversion) AND (anon_1.strand_bias IS NULL OR anon_1.strand_bias <= 200.0) AND (anon_1.qual_by_depth IS NULL OR anon_1.qual_by_depth >= 2.0) AND (anon_1.read_pos_bias IS NULL OR anon_1.read_pos_bias >= -20.0)) AND (abs(annotations.splice_dist) <= 10 OR annotations.region = 2 AND (NOT annotations.type = 2) OR annotations.region = 3 OR annotations.region = 1);
I could see that the mserver is constantly writing to the disks.
So why is MonetDB constantly writing to the disks, even when the query is quite simpel like the one TPC-H query above? The data should fit into the memory without problems. Any ideas or suggestions?
Kind Regards,
Martin