Re: Monetdb-R integration
Hi Imad,
I just noticed this, grouped aggregates using R need to consider the aggr_group implicit parameter. This vector denotes which group the entries in the input belong to. In your case, something like this should work:
CREATE AGGREGATE "ttest2samples"(arg1 double, arg2 varchar(250)) RETURNS double LANGUAGE R {
unlist(lapply(split(data.frame(arg1, arg2, stringsAsFactors=F), aggr_group), function(x) {t.test(arg1 ~ arg2, data=x)$p.value}))
};
This first splits the input according to the grouping variables, then calculates the p-value for every group separately, and then returns a single result value for each group.
select "2-Age","ttest2samplesStatistic"("9-Score", "4-Gender") as "stat", "ttest2samples"("9-Score", "4-Gender") as "pvalue" from "Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" group by "2-Age”;
then results in
+---------+--------------------------+--------------------------+
| 2-Age | stat | pvalue |
+=========+==========================+==========================+
| [21-30] | -1.3224483820692825 | 0.18902257300165592 |
| [10-20] | 0.15797419230807905 | 0.8781391200296775 |
| > 30 | -2.636786922876647 | 0.011182421832548708 |
+---------+--------------------------+--------------------------+
Sorry for the delay,
Hannes
On 09 Jul 2015, at 09:52, imad hajj chahine
Hi Hannes,
Any update on the issue?
Thank you.
On Mon, Jun 22, 2015 at 4:18 PM, imad hajj chahine
wrote: ok thanks, if you want you can also test with this function (a one sample test that takes the second argument as simple double value): CREATE AGGREGATE "ttest1sample"(arg1 double, arg2 double) RETURNS double LANGUAGE R { t.test(arg1, arg2)$p.value };
select "2-Age", "ttest1sample"("9-Score",6.08963) from "Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" is not null group by "2-Age"
On Mon, Jun 22, 2015 at 4:12 PM, Hannes Mühleisen
wrote: ok, reproduced issue. let me see why this only returns one row. ----- Original Message ----- From: "imad hajj chahine"
To: "Hannes Mühleisen" Sent: Monday, June 22, 2015 3:10:55 PM Subject: Re: Monetdb-R integration Sorry Hannes, you need to do a trim on "4-Gender" column as its the one used in the test. Sorry again its my fault.
On Mon, Jun 22, 2015 at 3:42 PM, Hannes Mühleisen
wrote: Did a trim() on the column but same error. There are three distinct values in that column. I get
select distinct "2-Age" from "Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea”; +---------+ | 2-Age | +=========+ | [21-30] | | [10-20] | | > 30 | +————+
----- Original Message ----- From: "imad hajj chahine"
To: "Hannes Mühleisen" Sent: Monday, June 22, 2015 2:36:43 PM Subject: Re: Monetdb-R integration Hi Hannes,
Please let me know if you need more information from my side.
Regards
On Fri, Jun 19, 2015 at 11:38 AM, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote:
Hi Hannes,
This error is due that the "2-age" column has more than 2 distinct values, there is a space after some values. can you run a trim on the "2-Age" and then try it again.
As for the vector thing, I didnt get the point, shouldnt the db engine split the data into vectors and feed the aggregate function which return one data point?
Thanks
On Fri, Jun 19, 2015 at 11:12 AM, Hannes Mühleisen < Hannes.Muehleisen@cwi.nl> wrote:
I am running attached script, which I pieced together from your emails. This does not run, it produces the following error:
Error running R expression. Error message: Error in t.test.formula(arg1 ~ arg2) : grouping factor must have exactly 2 levels Calls: as.data.frame -> <Anonymous> -> t.test -> t.test.formula
In general, for aggregations, the R code needs to return a vector with a single entry for each group.
Best,
Hannes
On 19 Jun 2015, at 09:44, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote:
Thanks, if you need more information from my side let me know.
Also can you answer me on the following when i use 2 functions one to retrieve Pvalue and one for the stat value: Am I doing the correct way to retrieve the stat and p-value, is the ttest get executed twice in this case? is there a more optimal way where i can retrieve both information.
On Fri, Jun 19, 2015 at 10:39 AM, Hannes Mühleisen < Hannes.Muehleisen@cwi.nl> wrote: Looking into this at the moment.
On 16 Jun 2015, at 09:56, Dimitar Nedev < dimitar.nedev@monetdbsolutions.com> wrote:
Hannes, do you know if the STD_DEV is the issue there or if the queries should actually yield different results?
Dimitar
> On 2015-Jun-15, at 13:28 , imad hajj chahine < imad.hajj.chahine@gmail.com> wrote: > > Hi, > > Shouldn't we have the same result from running the first query and the second query? > after all its an aggregate function and should return the test for each group. > > I checked that monetdb had problem with running group by with statistics functions like STD_DEV, are these 2 issues related? > > Thank you. > > > > > On Mon, Jun 15, 2015 at 1:52 PM, Dimitar Nedev < dimitar.nedev@monetdbsolutions.com> wrote: > Hi Imad, > > Here is the output I got from running the queries. Note, this might not be entirely representative since I did clean-up some data before loading it. > > sql>select "2-Age", "tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" is not null group by "2-Age"; > +---------+--------------------------+ > | 2-Age | L1 | > +=========+==========================+ > | [10-20] | 0.89311103869734754 | > +---------+--------------------------+ > 3 tuples (3.218ms) > > sql>select '[10-20]' as "2-Age","tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" = '[10-20]' > more>union all > more>select '[21-30]' as "2-Age","tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" = '[21-30]' > more>union all > more>select '[> 30]' as "2-Age","tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" = '> 30'; > +---------+--------------------------+ > | 2-Age | L1 | > +=========+==========================+ > | [10-20] | 0.14150742355349472 | > | [21-30] | 0.32872830865003566 | > | [> 30] | 0.37127075363219841 | > +---------+--------------------------+ > 3 tuples (8.825ms) > > > sql>select "tone"."ttest2samplesStatistic"("9-Score", "4-Gender") as "stat", "tone"."ttest2samples"("9-Score", "4-Gender") as "pvalue" from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" group by "2-Age"; > +--------------------------+--------------------------+ > | stat | pvalue | > +==========================+==========================+ > | -2.1588506639013736 | 0.032555033970368859 | > +--------------------------+--------------------------+ > 1 tuple (13.287ms) > > > I dropped the aggregate with: > sql>DROP AGGREGATE "tone"."ttest2samplesStatistic"; > operation successful (2.049ms) > > > Best regards, > Dimitar > >> On 2015-Jun-12, at 00:46 , imad hajj chahine < imad.hajj.chahine@gmail.com> wrote: >> >> Thank you Dimitar, >> >> I was using Squirrel to connect from my host machine to the VM. >> >> Now when I connect using SSH-client and run the short group by query it works but the result is not similar to the union queries (PS: i also run it in the VM direclty and had the same result) >> >> >> +---------+--------------------------+ >> | 2-Age | pval | >> +=========+==========================+ >> | [21-30] | 0.99996611809271041 | >> +---------+--------------------------+ >> 3 tuples (2.638ms) >> >> >> What i am trying to achieve is to run a one sample ttest for all the age categories, and it seems that the group by is ignored. >> >> I have also created 2 functions for the two sample ttest: >> >> CREATE AGGREGATE "tone"."ttest2samples"(arg1 double, arg2 varchar(250)) RETURNS double LANGUAGE R { t.test(arg1 ~ arg2)$p.value }; >> CREATE AGGREGATE "tone"."ttest2samplesStatistic"(arg1 double, arg2 varchar(250)) RETURNS double LANGUAGE R { t.test(arg1 ~ arg2)$statistic }; >> >> select "tone"."ttest2samplesStatistic"("9-Score", "4-Gender") as "stat", "tone"."ttest2samples"("9-Score", "4-Gender") as "pvalue" from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" group by "2-Age" >> >> Am I doing the correct way to retrieve the stat and p-value, is the ttest get executed twice in this case? is there a more optimal way where i can retrieve both information. Please note that I also have the same grouping issue as before. >> >> Do you have the same behavior, or I am missing something in the creation of these functions. >> >> A final question, how do u drop a R function? i am trying to execute the following: >> drop all function "tone"."ttest2samplesStat" >> >> >> Thank you. >> >> >> >> On Thu, Jun 11, 2015 at 11:28 PM, Dimitar Nedev < dimitar.nedev@monetdbsolutions.com> wrote: >> Hi Imad, >> >> Hannes asked me to look at your issue with the MonetDB-R VM. >> >> I loaded the data and ran the queries you mentioned, but I did not encounter any issues. This holds for both the short and the union queries. I am also not familiar with the issue you report, nor its cause. At this time I think it is local to your instance. If you can export your db, or re-load the data, I would recommend you to create a new instance with a fresh image and try again. If you import the image as a new VM instance, you can keep your old one around as well. >> >> Please note that during the test: >> - I ran as the admin (monetdb) user, as the one created default on the VM image cannot create new schemas. >> - I have to clean the data a bit up to load it, removing dozen lines or so. >> - Renamed ttest1 to ttest2 to match the queries. >> >> Best regards, >> Dimitar >> >>>>>>>> Begin forwarded message: >>>>>>>> >>>>>>>> From: imad hajj chahine
>>>>>>>> Subject: Re: Monetdb-R integration >>>>>>>> Date: 10 Jun 2015 10:11:48 CEST >>>>>>>> To: Hannes Mühleisen >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am using the virtual box provided from https://www.monetdbsolutions.com/solutions/cloud/vm: >>>>>>>> OS:linux version 3.10.0-123.9.3.el7.x86_64 ( builder@kbuilder.dev.centos.org) >>>>>>>> Monetdb: V1.7: (Oct2014-SP1) >>>>>>>> R: 3.1.2 (2014-10-31) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> create table "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" ("Consumer Number" INTEGER,"1-Disloyalty" varchar(250),"2-Age" varchar(250),"3-Amount transferred" INTEGER,"4-Gender" varchar(250),"5-Citiz enship" varchar(250),"6- Profession" varchar(250),"7-City" varchar(250),"8-1 NC Billboard" varchar(250),"8-2 NC Friends" varchar(250),"8-3 NC Magazine" varchar(250),"8-4 NC Others" varchar(250),"9-Score" decimal(18,3),"10- Type of operation" varchar(250),"11-1 Welcoming" INTEGER,"11-2 Waiting time" INTEGER,"11-3 Time to finish request" INTEGER,"11-4 Fees" INTEGER,"11-5 Courtesy of employees" INTEGER,"11-6 Branch layout" INTEGER,"11-7 Attention to special requests" INTEGER,"12- Total time for service" decimal(18,3),"13- Distance from POS" decimal(18,3),"14- Weight" INTEGER) >>>>>>>> >>>>>>>> COPY offset 2 INTO "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" FROM USING DELIMITERS ',','\n','"';
>>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>>> On Wed, Jun 10, 2015 at 10:50 AM, Hannes Mühleisen < Hannes.Muehleisen@cwi.nl> wrote: >>>>>>>> Hello Imad, >>>>>>>> >>>>>>>> which version of MonetDB and R are you using? Which operating system? >>>>>>>> >>>>>>>> Can you please also include the commands you used to load your CSV into MonetDB/ create the table? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Hannes >>>>>>>> >>>>>>>>> On 10 Jun 2015, at 00:25, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hi Hannes, >>>>>>>>> >>>>>>>>> I am testing Monetdb-R integration to do some statistical test, i followed the tutorial you posted on youtube and i have created
'/home/App_Data/TestingDataSimple_2003_Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea.csv' the
following function:
>>>>>>>>> >>>>>>>>> CREATE AGGREGATE "tone"."ttest1"(arg1 double, arg2 double) RETURNS double LANGUAGE R { t.test(arg1, arg2)$p.value }; >>>>>>>>> >>>>>>>>> this function will basically do a one sample t-test. >>>>>>>>> >>>>>>>>> I have loaded the sample attached file into a demo database and trying to run the following query: >>>>>>>>> >>>>>>>>> select "2-Age", "tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" is not null group by "2-Age" >>>>>>>>> >>>>>>>>> The following error happened >>>>>>>>> Error: protocol violation: unexpected line in data block: . >>>>>>>>> SQLState: M0M10 >>>>>>>>> ErrorCode: 0 >>>>>>>>> >>>>>>>>> alternatively i managed to have the intended result by running this query: >>>>>>>>> >>>>>>>>> select '[10-20]' as "2-Age","tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" = '[10-20]' >>>>>>>>> union all >>>>>>>>> select '[21-30]' as "2-Age","tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" = '[21-30]' >>>>>>>>> union all >>>>>>>>> select '[> 30]' as "2-Age","tone"."ttest2"("9-Score",6.08963) from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where "2-Age" = '> 30' >>>>>>>>> >>>>>>>>> >>>>>>>>> any help or guidance is appreciated. >>>>>>>>> Thank you. >>>>>>>>> >>>>>>>>>
>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > >
participants (1)
-
Hannes Mühleisen