a few performance improvement questions
hi, i've been running some pretty big queries on a few tables (total of 900 million records) on a 64GB windows machine and i'm curious if anyone might provide advice about whether any of these strategies are computationally expensive? i am maxing out my available RAM in my queries a lot, and at the point it hits that ceiling, the computation starts dragging. here's my pseudo-sql that displays the kinds of commands i've been running. thanks! ============ (a) i am conducting a left join where not all records match. the non-matching records need to be zeroes in general, is this the least computationally demanding way to get an average amount per person, including the zeroes? SELECT AVG( sum_amount ) AS avg_amt_per_person FROM ( SELECT person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS sum_amount FROM ( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id ) AS temp_table GROUP BY person_id ) ; ============ (b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy format but some january thru september are m/dd/yy (i.e. the month can either be one or two digits). i am just calculating the month as part of my query on-the-fly. is it foolish of me not to add a new, permanent column to the data set? i don't have a good sense of whether this segment slows down my query by a lot. it is one of my join/merge variables. CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month ============ (c) i have a three tables that have the same structure that get LEFT JOINed onto my main table. so right now i am doing something like-- main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN fourth_table is it smarter fo me to do this instead? -- CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION ALL fourth_table main_table LEFT JOIN stacked_table
here is the CASE / CAST / SUBSTRING performance cost on a data set with about 200 million records and a windows system with 64gb of ram. this might help others as well..
# no conversion commands system.time( dbGetQuery( db , "SELECT COUNT(*) , some_date AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.03 0.00 7.94
# SUBSTRING only system.time( dbGetQuery( db , "SELECT COUNT(*) , SUBSTRING( some_date , 1 , 1 ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.00 0.00 16.39
# CAST only system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( some_date AS VARCHAR(255) ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.02 0.00 35.58
# CASE only system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN ( some_date IS NULL ) THEN 'hello' ELSE some_date END ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.02 0.00 8.73
# CAST + SUBSTRING system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( SUBSTRING( some_date , 1 , 1 ) AS DOUBLE ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.0 0.0 34.5
# CASE + SUBSTRING system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.00 0.00 51.49
# CASE + CAST + SUBSTRING system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.00 0.00 73.03
On Wed, May 27, 2015 at 5:52 AM, Anthony Damico
hi, i've been running some pretty big queries on a few tables (total of 900 million records) on a 64GB windows machine and i'm curious if anyone might provide advice about whether any of these strategies are computationally expensive? i am maxing out my available RAM in my queries a lot, and at the point it hits that ceiling, the computation starts dragging. here's my pseudo-sql that displays the kinds of commands i've been running. thanks!
============
(a) i am conducting a left join where not all records match. the non-matching records need to be zeroes
in general, is this the least computationally demanding way to get an average amount per person, including the zeroes?
SELECT AVG( sum_amount ) AS avg_amt_per_person
FROM
( SELECT person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS sum_amount FROM ( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id ) AS temp_table GROUP BY person_id ) ;
============
(b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy format but some january thru september are m/dd/yy (i.e. the month can either be one or two digits). i am just calculating the month as part of my query on-the-fly. is it foolish of me not to add a new, permanent column to the data set? i don't have a good sense of whether this segment slows down my query by a lot. it is one of my join/merge variables.
CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month
============
(c) i have a three tables that have the same structure that get LEFT JOINed onto my main table. so right now i am doing something like--
main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN fourth_table
is it smarter fo me to do this instead? --
CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION ALL fourth_table
main_table LEFT JOIN stacked_table
On 27-05-15 11:52, Anthony Damico wrote:
hi, i've been running some pretty big queries on a few tables (total of 900 million records) on a 64GB windows machine and i'm curious if anyone might provide advice about whether any of these strategies are computationally expensive? i am maxing out my available RAM in my queries a lot, and at the point it hits that ceiling, the computation starts dragging. here's my pseudo-sql that displays the kinds of commands i've been running. thanks!
============
(a) i am conducting a left join where not all records match. the non-matching records need to be zeroes
in general, is this the least computationally demanding way to get an average amount per person, including the zeroes?
SELECT AVG( sum_amount ) AS avg_amt_per_person
FROM
( SELECT person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS sum_amount FROM ( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id ) AS temp_table GROUP BY person_id ) ;
Assuming person_id in person_table has unique values (so is a key) and all three columns (person_table.person_id, amount_table.person_id, amount_table.amount) are defined as not null you may use following variations: SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_id, (CASE WHEN (SELECT SUM(amount) FROM amount_table WHERE person_table.person_id = amount_table.person_id) IS NULL THEN 0 ELSE (SELECT SUM(amount) FROM amount_table WHERE person_table.person_id = amount_table.person_id) END) as sum_amount FROM person_table ) T; or SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_table.person_id, (CASE WHEN SUM(amount_table.amount) IS NULL THEN 0 ELSE SUM(amount_table.amount) END)as sum_amount FROM person_table LEFT JOIN amount_table on (person_table.person_id = amount_table.person_id) GROUP BY person_table.person_id ) T; or SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_id, SUM(amount) as sum_amount FROM amount_table GROUP BY person_id UNION ALL SELECT person_id, 0 as sum_amount FROM person_table WHERE person_id NOT IN (SELECT person_id FROM amount_table) ) T; or CREATE VIEW person_view AS SELECT person_id, (SELECT SUM(amount) FROM amount_table WHERE amount IS NOT NULL AND person_table.person_id = amount_table.person_id) as sum_amount FROM person_table; SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_id, (CASE WHEN sum_amount IS NULL THEN 0 ELSE sum_amount END) as sum_amount FROM person_view ) T; or ALTER TABLE person_table ADD COLUMN sum_amount double not null default 0; UPDATE person_table SET sum_amount = (SELECT SUM(amount) FROM amount_table WHERE amount IS NOT NULL AND person_table.person_id = amount_table.person_id) WHERE person_id IN (SELECT person_id FROM amount_table); SELECT AVG(sum_amount) AS avg_amt_per_person FROM person_table; This last one (adding a column to person_table and populating it) only works correct when the source data (in this case the amount_table data) is static during running of the queries or is re-updated before a set of queries is run. It is an approach often used in datawarehouses where the source data is refreshed at known times (e.g. in the night) and the derivable redundant data can be re-derived after the nightly refresh.
============
(b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy format but some january thru september are m/dd/yy (i.e. the month can either be one or two digits). i am just calculating the month as part of my query on-the-fly. is it foolish of me not to add a new, permanent column to the data set? i don't have a good sense of whether this segment slows down my query by a lot. it is one of my join/merge variables.
CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month
As month number is expected to be an integer I would cast it to SMALLINT (or TINYINT) instead of DOUBLE. Instead of replicating this expression part in all queries where you need the some_month value, create a view which extends the table with this some_month column using the extraction expression in one place and next use the view in the queries where you need some_month. When performance of the view is not good enough and the data is reasonably static, consider to add additional column(s) to the table, which are populated using a one-time UPDATE statement as presented above with the ALTER TABLE and UPDATE. With Triggers (on insert and on update) you could automate the update of these derivable column data when new records are added or specific columns are updated.
============
(c) i have a three tables that have the same structure that get LEFT JOINed onto my main table. so right now i am doing something like--
main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN fourth_table
is it smarter fo me to do this instead? --
CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION ALL fourth_table
main_table LEFT JOIN stacked_table
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
wow, martin, your response to (b) is *exactly* what i was looking for. thank you!! for (a), i'm overwhelmed! ;) i could certainly speed test all of your alternatives, but i'm curious - in general - why certain strategies might be better/worse than others? when improving performance, do you have a sense of which runs faster? looks like the options are: nested CASE statements, JOINs, UNION ALL+WHERE, VIEWs, UPDATE+WHERE. are some of these options better at conserving RAM - and therefore avoiding ultra-slow disk paging? it'd be nice to know about any "best practices" but it'd be fair of you to tell me that i just need to try for myself and see. thanks again for your time. -anthony On Thu, May 28, 2015 at 12:36 PM, martin van dinther < martin.van.dinther@monetdbsolutions.com> wrote:
On 27-05-15 11:52, Anthony Damico wrote:
hi, i've been running some pretty big queries on a few tables (total of 900 million records) on a 64GB windows machine and i'm curious if anyone might provide advice about whether any of these strategies are computationally expensive? i am maxing out my available RAM in my queries a lot, and at the point it hits that ceiling, the computation starts dragging. here's my pseudo-sql that displays the kinds of commands i've been running. thanks!
============
(a) i am conducting a left join where not all records match. the non-matching records need to be zeroes
in general, is this the least computationally demanding way to get an average amount per person, including the zeroes?
SELECT AVG( sum_amount ) AS avg_amt_per_person
FROM
( SELECT person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS sum_amount FROM ( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id ) AS temp_table GROUP BY person_id ) ;
Assuming person_id in person_table has unique values (so is a key) and all three columns (person_table.person_id, amount_table.person_id, amount_table.amount) are defined as not null you may use following variations:
SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_id, (CASE WHEN (SELECT SUM(amount) FROM amount_table WHERE person_table.person_id = amount_table.person_id) IS NULL THEN 0 ELSE (SELECT SUM(amount) FROM amount_table WHERE person_table.person_id = amount_table.person_id) END) as sum_amount FROM person_table ) T;
or
SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_table.person_id, (CASE WHEN SUM(amount_table.amount) IS NULL THEN 0 ELSE SUM(amount_table.amount) END)as sum_amount FROM person_table LEFT JOIN amount_table on (person_table.person_id = amount_table.person_id) GROUP BY person_table.person_id ) T;
or
SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_id, SUM(amount) as sum_amount FROM amount_table GROUP BY person_id UNION ALL SELECT person_id, 0 as sum_amount FROM person_table WHERE person_id NOT IN (SELECT person_id FROM amount_table) ) T;
or
CREATE VIEW person_view AS SELECT person_id, (SELECT SUM(amount) FROM amount_table WHERE amount IS NOT NULL AND person_table.person_id = amount_table.person_id) as sum_amount FROM person_table;
SELECT AVG(sum_amount) AS avg_amt_per_person FROM ( SELECT person_id, (CASE WHEN sum_amount IS NULL THEN 0 ELSE sum_amount END) as sum_amount FROM person_view ) T;
or
ALTER TABLE person_table ADD COLUMN sum_amount double not null default 0; UPDATE person_table SET sum_amount = (SELECT SUM(amount) FROM amount_table WHERE amount IS NOT NULL AND person_table.person_id = amount_table.person_id) WHERE person_id IN (SELECT person_id FROM amount_table);
SELECT AVG(sum_amount) AS avg_amt_per_person FROM person_table;
This last one (adding a column to person_table and populating it) only works correct when the source data (in this case the amount_table data) is static during running of the queries or is re-updated before a set of queries is run. It is an approach often used in datawarehouses where the source data is refreshed at known times (e.g. in the night) and the derivable redundant data can be re-derived after the nightly refresh.
============
(b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy format but some january thru september are m/dd/yy (i.e. the month can either be one or two digits). i am just calculating the month as part of my query on-the-fly. is it foolish of me not to add a new, permanent column to the data set? i don't have a good sense of whether this segment slows down my query by a lot. it is one of my join/merge variables.
CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month
As month number is expected to be an integer I would cast it to SMALLINT (or TINYINT) instead of DOUBLE. Instead of replicating this expression part in all queries where you need the some_month value, create a view which extends the table with this some_month column using the extraction expression in one place and next use the view in the queries where you need some_month. When performance of the view is not good enough and the data is reasonably static, consider to add additional column(s) to the table, which are populated using a one-time UPDATE statement as presented above with the ALTER TABLE and UPDATE. With Triggers (on insert and on update) you could automate the update of these derivable column data when new records are added or specific columns are updated.
============
(c) i have a three tables that have the same structure that get LEFT JOINed onto my main table. so right now i am doing something like--
main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN fourth_table
is it smarter fo me to do this instead? --
CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION ALL fourth_table
main_table LEFT JOIN stacked_table
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (2)
-
Anthony Damico
-
martin van dinther