Re: Distributed Join and Replica Tables

9 May 2017


      The activity table on master is a merge table of remote tables activity_w0
and activity_w1 on the workers, which is essentially a view.  The query
"works" in the sense that it returns the correct results.  The issue is
that the query plan pulls the entire activity_wX tables back to the master
node and filters rows there rather than filtering at the worker nodes and
just doing a merge at the master.

-K

On Mon, May 8, 2017 at 5:59 PM, Brian Hood 
wrote:
...
Hi Kelly,
Maybe you should create a view on the worker and reference it from the
master i am not sure if this works as i have never tried it as of yet.
Let me know of your findings.
Regards.
Brian Hood
On Mon, May 8, 2017 at 5:09 PM, Kelly Burkhart 
wrote:
...
Can someone please provide more concrete information on this?  Is it
known to work?
Say I have a master and two worker nodes.
I have a usr table on master and remote versions of this table on each
worker:
sql>select * from usr;
+------+-------+
| id   | name  |
+======+=======+
|    1 | kelly |
|    2 | fred  |
|    3 | bob   |
|    4 | igor  |
+------+-------+
4 tuples (3.239ms)
I have an activity table that is sharded across the two workers, each
remoted to master and a merge table (called activity) combining them.  On
worker 0:
sql>select * from activity_w0;
+--------+------+
| usr_id | what |
+========+======+
|      1 | act1 |
|      3 | act3 |
|      1 | act5 |
|      3 | act8 |
+--------+------+
4 tuples (2.331ms)
On worker 1:
sql>select * from activity_w1;
+--------+------+
| usr_id | what |
+========+======+
|      2 | act2 |
|      4 | act4 |
|      2 | act6 |
|      4 | act7 |
+--------+------+
4 tuples (3.899ms)
So worker 0 has only activity for odd usr ids, worker 1 has only activity
for even worker ids.  If I perform this query on master:
sql>select
more>  what
more>from
more>  activity a
more>  join usr u on (a.usr_id = u.id)
more>where
more>  u.name = 'kelly';
+------+
| what |
+======+
| act1 |
| act5 |
+------+
2 tuples (10.822ms)
I assume that master is pulling back the entire contents of both
activity_w0 and activity_w1 because it doesn't know that there is a remote
usr table on each worker.  It is not clear to me how to set up a replica
table on the workers to allow the filtering to be done there rather than at
the master.
Is this possible or am I chasing a red herring?
-K
On Mon, May 8, 2017 at 6:31 AM, Ying Zhang  wrote:
...
...
On 5 May 2017, at 23:07, Kelly Burkhart 
wrote:
Can someone describe how exactly replica tables are used to push joins
to worker nodes?
Assume a master node and two worker nodes.  I assume my reference
table would go on my master node, and on my worker nodes I would have my
large data table and replicas of the reference tables from the master node.
Hai Kelly,
This would be a strategy you can start with.  I’d expect that some joins
can be pushed to be executed (solely) locally on the worker nodes.
But how exactly replica tables are used, and joins are handled largely
depends on your query.
Best,
Jennie
...
Is that the general strategy?
Thanks,
-K
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list