Re: Monetdb-R integration

23 Jul 2015

      Hi Imad,

I just noticed this, grouped aggregates using R need to consider the aggr_group implicit parameter. This vector denotes which group the entries in the input belong to. In your case, something like this should work:

CREATE AGGREGATE "ttest2samples"(arg1 double, arg2 varchar(250)) RETURNS double LANGUAGE R { 
unlist(lapply(split(data.frame(arg1, arg2, stringsAsFactors=F), aggr_group), function(x) {t.test(arg1 ~ arg2, data=x)$p.value}))
};

This first splits the input according to the grouping variables, then calculates the p-value for every group separately, and then returns a single result value for each group.

select "2-Age","ttest2samplesStatistic"("9-Score", "4-Gender") as "stat", "ttest2samples"("9-Score", "4-Gender") as "pvalue" from "Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"  group by  "2-Age”;

then results in

+---------+--------------------------+--------------------------+
| 2-Age   | stat                     | pvalue                   |
+=========+==========================+==========================+
| [21-30] |      -1.3224483820692825 |      0.18902257300165592 |
| [10-20] |      0.15797419230807905 |       0.8781391200296775 |
| > 30    |       -2.636786922876647 |     0.011182421832548708 |
+---------+--------------------------+--------------------------+

Sorry for the delay,

Hannes

On 09 Jul 2015, at 09:52, imad hajj chahine  wrote:
...
Hi Hannes,
Any update on the issue?
Thank you.
On Mon, Jun 22, 2015 at 4:18 PM, imad hajj chahine  wrote:
ok thanks, if you want you can also test with this function (a one sample test that takes the second argument as simple double value):
CREATE AGGREGATE "ttest1sample"(arg1 double, arg2 double) RETURNS double LANGUAGE R { t.test(arg1, arg2)$p.value };
select "2-Age", "ttest1sample"("9-Score",6.08963) from "Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"  where "2-Age" is not null group by  "2-Age"
On Mon, Jun 22, 2015 at 4:12 PM, Hannes Mühleisen  wrote:
ok, reproduced issue. let me see why this only returns one row.
----- Original Message -----
From: "imad hajj chahine" 
To: "Hannes Mühleisen" 
Sent: Monday, June 22, 2015 3:10:55 PM
Subject: Re: Monetdb-R integration
Sorry Hannes, you need to do a trim on "4-Gender" column as its the one
used in the test.
Sorry again its my fault.
On Mon, Jun 22, 2015 at 3:42 PM, Hannes Mühleisen 
wrote:
...
Did a trim() on the column but same error. There are three distinct values
in that column.
I get
select distinct "2-Age" from
"Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea”;
+---------+
| 2-Age   |
+=========+
| [21-30] |
| [10-20] |
| > 30    |
+————+
----- Original Message -----
From: "imad hajj chahine" 
To: "Hannes Mühleisen" 
Sent: Monday, June 22, 2015 2:36:43 PM
Subject: Re: Monetdb-R integration
Hi Hannes,
Please let me know if you need more information from my side.
Regards
On Fri, Jun 19, 2015 at 11:38 AM, imad hajj chahine <
imad.hajj.chahine@gmail.com> wrote:
...
Hi Hannes,
This error is due that the "2-age" column has more than 2 distinct
values,
there is a space after some values. can you run a trim on the "2-Age" and
then try it again.
As for the vector thing, I didnt get the point, shouldnt the db engine
split the data into vectors and feed the aggregate function which return
one data point?
Thanks
On Fri, Jun 19, 2015 at 11:12 AM, Hannes Mühleisen <
Hannes.Muehleisen@cwi.nl> wrote:
...
I am running attached script, which I pieced together from your emails.
This does not run, it produces the following error:
Error running R expression. Error message: Error in t.test.formula(arg1
~
arg2) :
  grouping factor must have exactly 2 levels
Calls: as.data.frame -> <Anonymous> -> t.test -> t.test.formula
In general, for aggregations, the R code needs to return a vector with a
single entry for each group.
Best,
Hannes
...
On 19 Jun 2015, at 09:44, imad hajj chahine <
imad.hajj.chahine@gmail.com> wrote:
Thanks, if you need more information from my side let me know.
Also can you answer me on the following when i use 2 functions one to
retrieve Pvalue and one for the stat value:
Am I doing the correct way to retrieve the stat and p-value, is the
ttest get executed twice in this case? is there a more optimal way
where i
can retrieve both information.
On Fri, Jun 19, 2015 at 10:39 AM, Hannes Mühleisen <
Hannes.Muehleisen@cwi.nl> wrote:
Looking into this at the moment.
...
On 16 Jun 2015, at 09:56, Dimitar Nedev <
dimitar.nedev@monetdbsolutions.com> wrote:
Hannes, do you know if the STD_DEV is the issue there or if the
queries should actually yield different results?
Dimitar
> On 2015-Jun-15, at 13:28 , imad hajj chahine <
imad.hajj.chahine@gmail.com> wrote:
>
> Hi,
>
> Shouldn't we have the same result from running the first query and
the second query?
> after all its an aggregate function and should return the test for
each group.
>
> I checked that monetdb had problem with running group by with
statistics functions like STD_DEV, are these 2 issues related?
>
> Thank you.
>
>
>
>
> On Mon, Jun 15, 2015 at 1:52 PM, Dimitar Nedev <
dimitar.nedev@monetdbsolutions.com> wrote:
> Hi Imad,
>
> Here is the output I got from running the queries. Note, this might
not be entirely representative since I did clean-up some data before
loading it.
>
> sql>select "2-Age", "tone"."ttest2"("9-Score",6.08963) from
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"  where
"2-Age" is not null group by  "2-Age";
> +---------+--------------------------+
> | 2-Age   | L1                       |
> +=========+==========================+
> | [10-20] |      0.89311103869734754 |
> +---------+--------------------------+
> 3 tuples (3.218ms)
>
> sql>select '[10-20]' as "2-Age","tone"."ttest2"("9-Score",6.08963)
from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"
where
"2-Age" = '[10-20]'
> more>union all
> more>select '[21-30]' as "2-Age","tone"."ttest2"("9-Score",6.08963)
from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"
where
"2-Age" = '[21-30]'
> more>union all
> more>select '[> 30]' as "2-Age","tone"."ttest2"("9-Score",6.08963)
from "tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"
where
"2-Age" = '> 30';
> +---------+--------------------------+
> | 2-Age   | L1                       |
> +=========+==========================+
> | [10-20] |      0.14150742355349472 |
> | [21-30] |      0.32872830865003566 |
> | [> 30]  |      0.37127075363219841 |
> +---------+--------------------------+
> 3 tuples (8.825ms)
>
>
> sql>select "tone"."ttest2samplesStatistic"("9-Score", "4-Gender")
as
"stat", "tone"."ttest2samples"("9-Score", "4-Gender") as "pvalue" from
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"  group
by
"2-Age";
> +--------------------------+--------------------------+
> | stat                     | pvalue                   |
> +==========================+==========================+
> |      -2.1588506639013736 |     0.032555033970368859 |
> +--------------------------+--------------------------+
> 1 tuple (13.287ms)
>
>
> I dropped the aggregate with:
> sql>DROP AGGREGATE "tone"."ttest2samplesStatistic";
> operation successful (2.049ms)
>
>
> Best regards,
> Dimitar
>
>> On 2015-Jun-12, at 00:46 , imad hajj chahine <
imad.hajj.chahine@gmail.com> wrote:
>>
>> Thank you Dimitar,
>>
>> I was using Squirrel to connect from my host machine to the VM.
>>
>> Now when I connect using SSH-client and run the short group by
query it works but the result is not similar to the union queries (PS: i
also run it in the VM direclty and had the same result)
>>
>>
>> +---------+--------------------------+
>> | 2-Age   | pval                     |
>> +=========+==========================+
>> | [21-30] |      0.99996611809271041 |
>> +---------+--------------------------+
>> 3 tuples (2.638ms)
>>
>>
>> What i am trying to achieve is to run a one sample ttest for all
the age categories, and it seems that the group by is ignored.
>>
>> I have also created 2 functions for the two sample ttest:
>>
>> CREATE AGGREGATE "tone"."ttest2samples"(arg1 double, arg2
varchar(250)) RETURNS double LANGUAGE R { t.test(arg1 ~ arg2)$p.value };
>> CREATE AGGREGATE "tone"."ttest2samplesStatistic"(arg1 double, arg2
varchar(250)) RETURNS double LANGUAGE R { t.test(arg1 ~ arg2)$statistic
};
>>
>> select "tone"."ttest2samplesStatistic"("9-Score", "4-Gender") as
"stat", "tone"."ttest2samples"("9-Score", "4-Gender") as "pvalue" from
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"  group
by
"2-Age"
>>
>> Am I doing the correct way to retrieve the stat and p-value, is
the
ttest get executed twice in this case? is there a more optimal way
where i
can retrieve both information. Please note that I also have the same
grouping issue as before.
>>
>> Do you have the same behavior, or I am missing something in the
creation of these functions.
>>
>> A final question, how do u drop a R function? i am trying to
execute the following:
>> drop all function "tone"."ttest2samplesStat"
>>
>>
>> Thank you.
>>
>>
>>
>> On Thu, Jun 11, 2015 at 11:28 PM, Dimitar Nedev <
dimitar.nedev@monetdbsolutions.com> wrote:
>> Hi Imad,
>>
>> Hannes asked me to look at your issue with the MonetDB-R VM.
>>
>> I loaded the data and ran the queries you mentioned, but I did not
encounter any issues. This holds for both the short and the union
queries.
I am also not familiar with the issue you report, nor its cause. At this
time I think it is local to your instance. If you can export your db, or
re-load the data, I would recommend you to create a new instance with a
fresh image and try again. If you import the image as a new VM instance,
you can keep your old one around as well.
>>
>> Please note that during the test:
>> - I ran as the admin (monetdb) user, as the one created default on
the VM image cannot create new schemas.
>> - I have to clean the data a bit up to load it, removing dozen
lines or so.
>> - Renamed ttest1 to ttest2 to match the queries.
>>
>> Best regards,
>> Dimitar
>>
>>>>>>>> Begin forwarded message:
>>>>>>>>
>>>>>>>> From: imad hajj chahine 
>>>>>>>> Subject: Re: Monetdb-R integration
>>>>>>>> Date: 10 Jun 2015 10:11:48 CEST
>>>>>>>> To: Hannes Mühleisen 
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am using the virtual box provided from
https://www.monetdbsolutions.com/solutions/cloud/vm:
>>>>>>>> OS:linux version 3.10.0-123.9.3.el7.x86_64 (
builder@kbuilder.dev.centos.org)
>>>>>>>> Monetdb: V1.7: (Oct2014-SP1)
>>>>>>>> R: 3.1.2 (2014-10-31)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> create table
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"
("Consumer
Number" INTEGER,"1-Disloyalty" varchar(250),"2-Age"
varchar(250),"3-Amount
transferred" INTEGER,"4-Gender" varchar(250),"5-Citiz enship"
varchar(250),"6- Profession" varchar(250),"7-City" varchar(250),"8-1 NC
Billboard" varchar(250),"8-2 NC Friends" varchar(250),"8-3 NC Magazine"
varchar(250),"8-4 NC Others" varchar(250),"9-Score" decimal(18,3),"10-
Type
of operation" varchar(250),"11-1 Welcoming" INTEGER,"11-2 Waiting time"
INTEGER,"11-3 Time to finish request" INTEGER,"11-4 Fees" INTEGER,"11-5
Courtesy of employees" INTEGER,"11-6 Branch layout" INTEGER,"11-7
Attention
to special requests" INTEGER,"12- Total time for service"
decimal(18,3),"13- Distance from POS" decimal(18,3),"14- Weight"
INTEGER)
>>>>>>>>
>>>>>>>> COPY offset 2 INTO
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" FROM
...
USING DELIMITERS ',','\n','"';
...
...
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Wed, Jun 10, 2015 at 10:50 AM, Hannes Mühleisen <
Hannes.Muehleisen@cwi.nl> wrote:
>>>>>>>> Hello Imad,
>>>>>>>>
>>>>>>>> which version of MonetDB and R are you using? Which
operating
system?
>>>>>>>>
>>>>>>>> Can you please also include the commands you used to load
your CSV into MonetDB/ create the table?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Hannes
>>>>>>>>
>>>>>>>>> On 10 Jun 2015, at 00:25, imad hajj chahine <
imad.hajj.chahine@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Hannes,
>>>>>>>>>
>>>>>>>>> I am testing Monetdb-R integration to do some statistical
test, i followed the tutorial you posted on youtube and i have created
'/home/App_Data/TestingDataSimple_2003_Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea.csv'
the
...
following function:
...
...
>>>>>>>>>
>>>>>>>>> CREATE AGGREGATE "tone"."ttest1"(arg1 double, arg2 double)
RETURNS double LANGUAGE R { t.test(arg1, arg2)$p.value };
>>>>>>>>>
>>>>>>>>> this function will basically do a one sample t-test.
>>>>>>>>>
>>>>>>>>> I have loaded the sample attached file into a demo database
and trying to run the following query:
>>>>>>>>>
>>>>>>>>> select "2-Age", "tone"."ttest2"("9-Score",6.08963) from
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea"  where
"2-Age" is not null group by  "2-Age"
>>>>>>>>>
>>>>>>>>> The following error happened
>>>>>>>>> Error: protocol violation: unexpected line in data block: .
>>>>>>>>> SQLState:  M0M10
>>>>>>>>> ErrorCode: 0
>>>>>>>>>
>>>>>>>>> alternatively i managed to have the intended result by
running this query:
>>>>>>>>>
>>>>>>>>> select '[10-20]' as
"2-Age","tone"."ttest2"("9-Score",6.08963) from
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where
"2-Age" = '[10-20]'
>>>>>>>>> union all
>>>>>>>>> select '[21-30]' as
"2-Age","tone"."ttest2"("9-Score",6.08963) from
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where
"2-Age" = '[21-30]'
>>>>>>>>> union all
>>>>>>>>> select '[> 30]' as
"2-Age","tone"."ttest2"("9-Score",6.08963) from
"tone"."Marketing_Loyalty_4700298d-9862-40b3-9028-b0f15dab9dea" where
"2-Age" = '> 30'
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> any help or guidance is appreciated.
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>>

...
...
...
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Hannes Mühleisen

tags

participants (1)