should the SAMPLE command produce random results?
# here's a reproducible example using R code to repeat the sampling 1000 times. in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000. shouldn't it be close to 500 out of 1000? this seems not random (misleading to users?) sorry if i'm misunderstanding something.. thank you!! # start in an empty directory somewhere # setwd( "C:/My Directory/MonetDB" ) # # # # # # # # # START OF SETUP - no editing required library(MonetDB.R) batfile <- monetdb.server.setup( database.directory = paste0( getwd() , "/MonetDB" ) , monetdb.program.path = ifelse( .Platform$OS.type == "windows" , "C:/Program Files/MonetDB/MonetDB5" , "" ) , dbname = "test" , dbport = 50000 ) pid <- monetdb.server.start( batfile ) db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE ) # # # END OF SETUP dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" ) out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) ) } # not random table( unlist( out ) ) # 1 2 # 880 120 out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) ) } # ALSO not random table( unlist( out ) ) # 1 2 # 856 144
reported here, thanks! i am pretty sure SAMPLE is not sampling randomly
(at least in these two cases).
https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730
On Mon, May 25, 2015 at 2:04 AM, Anthony Damico
# here's a reproducible example using R code to repeat the sampling 1000 times. in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000. shouldn't it be close to 500 out of 1000? this seems not random (misleading to users?) sorry if i'm misunderstanding something.. thank you!!
# start in an empty directory somewhere # setwd( "C:/My Directory/MonetDB" )
# # # # # # # # # START OF SETUP - no editing required
library(MonetDB.R)
batfile <- monetdb.server.setup( database.directory = paste0( getwd() , "/MonetDB" ) , monetdb.program.path = ifelse( .Platform$OS.type == "windows" , "C:/Program Files/MonetDB/MonetDB5" , "" ) , dbname = "test" , dbport = 50000 )
pid <- monetdb.server.start( batfile )
db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE )
# # # END OF SETUP
dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" )
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) ) }
# not random table( unlist( out ) ) # 1 2 # 880 120
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) ) }
# ALSO not random table( unlist( out ) ) # 1 2 # 856 144
Stupid question: column 1 and 2 have the same size, right?
I am not familiar with the current sample algorithm, but it might favor the
beginning of a column. At least the old code did, and that was a trade-off
to achieve only forward jumps on disk when sampling instead of random
access.
On Thu, May 28, 2015 at 10:00 AM, Anthony Damico
reported here, thanks! i am pretty sure SAMPLE is not sampling randomly (at least in these two cases). https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730
On Mon, May 25, 2015 at 2:04 AM, Anthony Damico
wrote: # here's a reproducible example using R code to repeat the sampling 1000 times. in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000. shouldn't it be close to 500 out of 1000? this seems not random (misleading to users?) sorry if i'm misunderstanding something.. thank you!!
# start in an empty directory somewhere # setwd( "C:/My Directory/MonetDB" )
# # # # # # # # # START OF SETUP - no editing required
library(MonetDB.R)
batfile <- monetdb.server.setup( database.directory = paste0( getwd() , "/MonetDB" ) , monetdb.program.path = ifelse( .Platform$OS.type == "windows" , "C:/Program Files/MonetDB/MonetDB5" , "" ) , dbname = "test" , dbport = 50000 )
pid <- monetdb.server.start( batfile )
db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE )
# # # END OF SETUP
dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" )
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) ) }
# not random table( unlist( out ) ) # 1 2 # 880 120
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) ) }
# ALSO not random table( unlist( out ) ) # 1 2 # 856 144
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.
hi lefteris, the table it's sampling from is only one column and two rows.
if the algorithm is so biased that it samples 80%+ of its records from the
top-half of a smallish table, it should display some sort of warning, no?
the docs say this is a uniform sample, but i don't think that's true? :(
SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE
1 ;
https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/Sampling
On Thu, May 28, 2015 at 4:17 AM, Lefteris
Stupid question: column 1 and 2 have the same size, right?
I am not familiar with the current sample algorithm, but it might favor the beginning of a column. At least the old code did, and that was a trade-off to achieve only forward jumps on disk when sampling instead of random access.
On Thu, May 28, 2015 at 10:00 AM, Anthony Damico
wrote: reported here, thanks! i am pretty sure SAMPLE is not sampling randomly (at least in these two cases). https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730
On Mon, May 25, 2015 at 2:04 AM, Anthony Damico
wrote: # here's a reproducible example using R code to repeat the sampling 1000 times. in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000. shouldn't it be close to 500 out of 1000? this seems not random (misleading to users?) sorry if i'm misunderstanding something.. thank you!!
# start in an empty directory somewhere # setwd( "C:/My Directory/MonetDB" )
# # # # # # # # # START OF SETUP - no editing required
library(MonetDB.R)
batfile <- monetdb.server.setup( database.directory = paste0( getwd() , "/MonetDB" ) , monetdb.program.path = ifelse( .Platform$OS.type == "windows" , "C:/Program Files/MonetDB/MonetDB5" , "" ) , dbname = "test" , dbport = 50000 )
pid <- monetdb.server.start( batfile )
db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE )
# # # END OF SETUP
dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" )
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) ) }
# not random table( unlist( out ) ) # 1 2 # 880 120
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) ) }
# ALSO not random table( unlist( out ) ) # 1 2 # 856 144
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hi Anthony,
sorry I did not read the query before, I thought you were sampling from 2
columns and counting. Anyway, the current algorithm is not biased to select
from the top half. It is actually as uniform as it can be according to our
tests. I run your experiment and I am not getting the bias you are
reporting. It might be a problem of the windows rand function (I am
assuming you are using windows), because on my linux machine this is not
reproducible.
We will perform a few more tests to find the cause of this.
On Thu, May 28, 2015 at 10:28 AM, Anthony Damico
hi lefteris, the table it's sampling from is only one column and two rows. if the algorithm is so biased that it samples 80%+ of its records from the top-half of a smallish table, it should display some sort of warning, no? the docs say this is a uniform sample, but i don't think that's true? :(
SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1 ;
https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/Sampling
On Thu, May 28, 2015 at 4:17 AM, Lefteris
wrote: Stupid question: column 1 and 2 have the same size, right?
I am not familiar with the current sample algorithm, but it might favor the beginning of a column. At least the old code did, and that was a trade-off to achieve only forward jumps on disk when sampling instead of random access.
On Thu, May 28, 2015 at 10:00 AM, Anthony Damico
wrote: reported here, thanks! i am pretty sure SAMPLE is not sampling randomly (at least in these two cases). https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730
On Mon, May 25, 2015 at 2:04 AM, Anthony Damico
wrote: # here's a reproducible example using R code to repeat the sampling 1000 times. in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000. shouldn't it be close to 500 out of 1000? this seems not random (misleading to users?) sorry if i'm misunderstanding something.. thank you!!
# start in an empty directory somewhere # setwd( "C:/My Directory/MonetDB" )
# # # # # # # # # START OF SETUP - no editing required
library(MonetDB.R)
batfile <- monetdb.server.setup( database.directory = paste0( getwd() , "/MonetDB" ) , monetdb.program.path = ifelse( .Platform$OS.type == "windows" , "C:/Program Files/MonetDB/MonetDB5" , "" ) , dbname = "test" , dbport = 50000 )
pid <- monetdb.server.start( batfile )
db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE )
# # # END OF SETUP
dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" )
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) ) }
# not random table( unlist( out ) ) # 1 2 # 880 120
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) ) }
# ALSO not random table( unlist( out ) ) # 1 2 # 856 144
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.
yep, i'm on windows using
https://www.monetdb.org/downloads/testing/Windows/Oct2014-SP4/
and reported in bugzilla here
https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730
thanks!!
On Mon, Jun 1, 2015 at 8:58 AM, Lefteris
Hi Anthony,
sorry I did not read the query before, I thought you were sampling from 2 columns and counting. Anyway, the current algorithm is not biased to select from the top half. It is actually as uniform as it can be according to our tests. I run your experiment and I am not getting the bias you are reporting. It might be a problem of the windows rand function (I am assuming you are using windows), because on my linux machine this is not reproducible.
We will perform a few more tests to find the cause of this.
On Thu, May 28, 2015 at 10:28 AM, Anthony Damico
wrote: hi lefteris, the table it's sampling from is only one column and two rows. if the algorithm is so biased that it samples 80%+ of its records from the top-half of a smallish table, it should display some sort of warning, no? the docs say this is a uniform sample, but i don't think that's true? :(
SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1 ;
https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/Sampling
On Thu, May 28, 2015 at 4:17 AM, Lefteris
wrote: Stupid question: column 1 and 2 have the same size, right?
I am not familiar with the current sample algorithm, but it might favor the beginning of a column. At least the old code did, and that was a trade-off to achieve only forward jumps on disk when sampling instead of random access.
On Thu, May 28, 2015 at 10:00 AM, Anthony Damico
wrote: reported here, thanks! i am pretty sure SAMPLE is not sampling randomly (at least in these two cases). https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730
On Mon, May 25, 2015 at 2:04 AM, Anthony Damico
wrote: # here's a reproducible example using R code to repeat the sampling 1000 times. in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000. shouldn't it be close to 500 out of 1000? this seems not random (misleading to users?) sorry if i'm misunderstanding something.. thank you!!
# start in an empty directory somewhere # setwd( "C:/My Directory/MonetDB" )
# # # # # # # # # START OF SETUP - no editing required
library(MonetDB.R)
batfile <- monetdb.server.setup( database.directory = paste0( getwd() , "/MonetDB" ) , monetdb.program.path = ifelse( .Platform$OS.type == "windows" , "C:/Program Files/MonetDB/MonetDB5" , "" ) , dbname = "test" , dbport = 50000 )
pid <- monetdb.server.start( batfile )
db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE )
# # # END OF SETUP
dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" )
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) ) }
# not random table( unlist( out ) ) # 1 2 # 880 120
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) ) }
# ALSO not random table( unlist( out ) ) # 1 2 # 856 144
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (2)
-
Anthony Damico
-
Lefteris