Re: a few performance improvement questions

28 May 2015


      here is the CASE / CAST / SUBSTRING performance cost on a data set with
about 200 million records and a windows system with 64gb of ram.  this
might help others as well..
...
# no conversion commands
system.time( dbGetQuery( db , "SELECT COUNT(*) , some_date AS some_month
FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.03    0.00    7.94
# SUBSTRING only
system.time( dbGetQuery( db , "SELECT COUNT(*) , SUBSTRING( some_date , 1
, 1 ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.00    0.00   16.39
# CAST only
system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( some_date AS
VARCHAR(255) ) AS some_month FROM some_table GROUP BY some_month" ) )
   user  system elapsed
   0.02    0.00   35.58
# CASE only
system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN ( some_date
IS NULL ) THEN 'hello' ELSE some_date END ) AS some_month FROM some_table
GROUP BY some_month" ) )
   user  system elapsed
   0.02    0.00    8.73
# CAST + SUBSTRING
system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( SUBSTRING(
some_date , 1 , 1 ) AS DOUBLE ) AS some_month FROM some_table GROUP BY
some_month" ) )
   user  system elapsed
    0.0     0.0    34.5
# CASE + SUBSTRING
system.time( dbGetQuery( db , "SELECT COUNT(*) , ( CASE WHEN SUBSTRING(
some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 ) ELSE
SUBSTRING( some_date , 1 , 2 ) END ) AS some_month FROM some_table GROUP BY
some_month" ) )
   user  system elapsed
   0.00    0.00   51.49
# CASE + CAST + SUBSTRING
system.time( dbGetQuery( db , "SELECT COUNT(*) , CAST( CASE WHEN
SUBSTRING( some_date , 2 , 1 ) = '/' THEN SUBSTRING( some_date , 1 , 1 )
ELSE SUBSTRING( some_date , 1 , 2 ) END AS DOUBLE ) AS some_month FROM
some_table GROUP BY some_month" ) )
   user  system elapsed
   0.00    0.00   73.03
On Wed, May 27, 2015 at 5:52 AM, Anthony Damico  wrote:
...
hi, i've been running some pretty big queries on a few tables (total of
900 million records) on a 64GB windows machine and i'm curious if anyone
might provide advice about whether any of these strategies are
computationally expensive?  i am maxing out my available RAM in my queries
a lot, and at the point it hits that ceiling, the computation starts
dragging.  here's my pseudo-sql that displays the kinds of commands i've
been running.  thanks!
============
(a) i am conducting a left join where not all records match. the
non-matching records need to be zeroes
in general, is this the least computationally demanding way to get an
average amount per person, including the zeroes?
SELECT AVG( sum_amount ) AS avg_amt_per_person
FROM
( SELECT
  person_id , SUM( CASE WHEN amount IS NULL THEN 0 ELSE amount END ) AS
sum_amount
FROM
   ( ( SELECT person_id FROM person_table ) AS a LEFT JOIN ( SELECT
person_id , amount FROM amount_table ) AS b ON a.person_id = b.person_id )
AS temp_table
GROUP BY person_id ) ;
============
(b) i am on windows (so no strptime) and i have dates in either a mm/dd/yy
format but some january thru september are m/dd/yy (i.e. the month can
either be one or two digits).  i am just calculating the month as part of
my query on-the-fly.  is it foolish of me not to add a new, permanent
column to the data set?  i don't have a good sense of whether this segment
slows down my query by a lot.  it is one of my join/merge variables.
CAST( CASE WHEN SUBSTRING( some_date , 2 , 1 ) = '/' THEN
SUBSTRING( some_date , 1 , 1 ) ELSE SUBSTRING( some_date , 1 , 2 ) END AS
DOUBLE ) AS some_month
============
(c) i have a three tables that have the same structure that get LEFT
JOINed onto my main table.  so right now i am doing something like--
main_table LEFT JOIN second_table LEFT JOIN third_table LEFT JOIN
fourth_table
is it smarter fo me to do this instead? --
CREATE TABLE stacked_table FROM second_table UNION ALL third_table UNION
ALL fourth_table
main_table LEFT JOIN stacked_table