hi lefteris, the table it's sampling from is only one column and two rows.  if the algorithm is so biased that it samples 80%+ of its records from the top-half of a smallish table, it should display some sort of warning, no?  the docs say this is a uniform sample, but i don't think that's true?  :(


SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1 ;

https://www.monetdb.org/Documentation/Cookbooks/SQLrecipes/Sampling











On Thu, May 28, 2015 at 4:17 AM, Lefteris <lsidir@gmail.com> wrote:
Stupid question: column 1 and 2 have the same size, right?

I am not familiar with the current sample algorithm, but it might favor the beginning of a column. At least the old code did, and that was a trade-off to achieve only forward jumps on disk when sampling instead of random access.

On Thu, May 28, 2015 at 10:00 AM, Anthony Damico <ajdamico@gmail.com> wrote:
reported here, thanks!  i am pretty sure SAMPLE is not sampling randomly (at least in these two cases).  https://www.monetdb.org/bugzilla/show_bug.cgi?id=3730

On Mon, May 25, 2015 at 2:04 AM, Anthony Damico <ajdamico@gmail.com> wrote:
# here's a reproducible example using R code to repeat the sampling 1000 times.  in both SAMPLE examples below, the database pulls the 2 less than 200 times out of 1000.  shouldn't it be close to 500 out of 1000?  this seems not random (misleading to users?)  sorry if i'm misunderstanding something..  thank you!!



# start in an empty directory somewhere
# setwd( "C:/My Directory/MonetDB" )



# # # # # # # # # START OF SETUP - no editing required


library(MonetDB.R)

batfile <-
    monetdb.server.setup(
        database.directory = paste0( getwd() , "/MonetDB" ) ,
        monetdb.program.path =
            ifelse(
                .Platform$OS.type == "windows" ,
                "C:/Program Files/MonetDB/MonetDB5" ,
                ""
            ) ,
        dbname = "test" ,
        dbport = 50000
    )

pid <- monetdb.server.start( batfile )

db <- dbConnect( MonetDB.R() , "monetdb://localhost:50000/test" , wait = TRUE )

# # # END OF SETUP



dbGetQuery( db , "SELECT 1 AS col UNION ALL SELECT 2 AS col" )


out <- NULL
for ( i in 1:1000 ){
out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 0.5" ) )
}

# not random
table( unlist( out ) )
  # 1   2
# 880 120




out <- NULL
for ( i in 1:1000 ){
    out <- c( out , dbGetQuery( db , "SELECT * FROM ( SELECT 1 AS col UNION ALL SELECT 2 AS col ) AS temp SAMPLE 1" ) )
}

# ALSO not random
table( unlist( out ) )
  # 1   2
# 856 144





_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list




--
DISCLAIMER -- I have been diagnosed with mixed and other personality disorders (World Health Organization, Chapter V, Mental and behavioural disorders, F61). The contents of this email might be a direct consequence of my mental condition. Therefore, you are not allowed to argue with me, and you are not allowed to tell me that I am being unreasonable or wrong or that I complain too much, the same way that you will not mock any other sufferer by constantly reminding him of his condition.


_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list