Hi Anthony,
sorry I did not read the query before, I thought you were sampling from 2
columns and counting. Anyway, the current algorithm is not biased to select
from the top half. It is actually as uniform as it can be according to our
tests. I run your experiment and I am not getting the bias you are
reporting. It might be a problem of the windows rand function (I am
assuming you are using windows), because on my linux machine this is not
reproducible.
We will perform a few more tests to find the cause of this.
On Thu, May 28, 2015 at 10:28 AM, Anthony Damico
hi lefteris, the table it's sampling from is only one column and two rows. if the algorithm is so biased that it samples 80%+ of its records from the top-half of a smallish table, it should display some sort of warning, no? the docs say this is a uniform sample, but i don't think that's true? :(
SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1 ;
https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/Sampling
On Thu, May 28, 2015 at 4:17 AM, Lefteris
wrote: Stupid question: column 1 and 2 have the same size, right?
I am not familiar with the current sample algorithm, but it might favor the beginning of a column. At least the old code did, and that was a trade-off to achieve only forward jumps on disk when sampling instead of random access.
On Thu, May 28, 2015 at 10:00 AM, Anthony Damico
wrote: reported here, thanks! i am pretty sure SAMPLE is not sampling randomly (at least in these two cases). https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730
On Mon, May 25, 2015 at 2:04 AM, Anthony Damico
wrote: # here's a reproducible example using R code to repeat the sampling 1000 times. in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000. shouldn't it be close to 500 out of 1000? this seems not random (misleading to users?) sorry if i'm misunderstanding something.. thank you!!
# start in an empty directory somewhere # setwd( "C:/My Directory/MonetDB" )
# # # # # # # # # START OF SETUP - no editing required
library(MonetDB.R)
batfile <- monetdb.server.setup( database.directory = paste0( getwd() , "/MonetDB" ) , monetdb.program.path = ifelse( .Platform$OS.type == "windows" , "C:/Program Files/MonetDB/MonetDB5" , "" ) , dbname = "test" , dbport = 50000 )
pid <- monetdb.server.start( batfile )
db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE )
# # # END OF SETUP
dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" )
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) ) }
# not random table( unlist( out ) ) # 1 2 # 880 120
out <- NULL for ( i in 1:1000 ){ out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) ) }
# ALSO not random table( unlist( out ) ) # 1 2 # 856 144
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- DISCLAIMER -- I have been diagnosed with* mixed and other personality disorders* (World Health Organization, Chapter V, Mental and behavioural disorders, F61 http://apps.who.int/classifications/apps/icd/icd10online2003/fr-icd.htm?gf60...). The contents of this email might be a direct consequence of my mental condition. Therefore, you are *not allowed* to argue with me, and you are *not allowed* to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.