here is the CASE / CAST / SUBSTRING performance cost on a data set with about 200 million records and a windows system with 64gb of ram. this might help others as well..
# no conversion commands system.time( dbGetQuery( db , "SELECT COUNT(*) , some_date AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.03 0.00 7.94
# SUBSTRING only system.time( dbGetQuery( db , "SELECT COUNT(*) , SUBSTRING( some_date , 1 , 1 ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.00 0.00 16.39
# CAST only system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( some_date AS VARCHAR(255) ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.02 0.00 35.58
# CASE only system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN ( some_date IS NULL ) THEN 'hello' ELSE some_date END ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.02 0.00 8.73
# CAST + SUBSTRING system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( SUBSTRING( some_date , 1 , 1 ) AS DOUBLE ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.0 0.0 34.5
# CASE + SUBSTRING system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.00 0.00 51.49
# CASE + CAST + SUBSTRING system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month FROM some_table GROUP BY some_month" ) ) user system elapsed 0.00 0.00 73.03
On Wed, May 27, 2015 at 5:52 AM, Anthony Damico
hi, i've been running some pretty big queries on a few tables (total of 900 million records) on a 64GB windows machine and i'm curious if anyone might provide advice about whether any of these strategies are computationally expensive? i am maxing out my available RAM in my queries a lot, and at the point it hits that ceiling, the computation starts dragging. here's my pseudo-sql that displays the kinds of commands i've been running. thanks!
============
(a) i am conducting a left join where not all records match. the non-matching records need to be zeroes
in general, is this the least computationally demanding way to get an average amount per person, including the zeroes?
SELECT AVG( sum_amount ) AS avg_amt_per_person
FROM
( SELECT person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS sum_amount FROM ( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id ) AS temp_table GROUP BY person_id ) ;
============
(b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy format but some january thru september are m/dd/yy (i.e. the month can either be one or two digits). i am just calculating the month as part of my query on-the-fly. is it foolish of me not to add a new, permanent column to the data set? i don't have a good sense of whether this segment slows down my query by a lot. it is one of my join/merge variables.
CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month
============
(c) i have a three tables that have the same structure that get LEFT JOINed onto my main table. so right now i am doing something like--
main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN fourth_table
is it smarter fo me to do this instead? --
CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION ALL fourth_table
main_table LEFT JOIN stacked_table