Re: [MonetDB-users] mal index

21 Apr 2010

      Hello Miquel,

first of all, I agree with you that MonetDB is the proper choice for your
problem ;-)

As a first quick answer --- I hope to come back with more once I find more
time to dive into it --- here is one general hint of using MonetDB the most
efficient way (in addition to the comments that Lefteris already gave).

Your below code for calculating the "Maxium out degree" basically
re-implements in MAL (and interpreted language) what does already exist as
((hopefully higly optimized) MAL primitive that is implemented in C (and
hence compiled code: a histogram.

In other words, to get the most out of MonetDB, one should not see and use
MAL as an imperative programming language to iteratively process data items
one at a time, but rather as a declarative query language to process large
amount of data in one go (you may also think of it as something like
"SIMD").

Here's a small example regarding your "Maxium out degree" code.

My attached code compares the execution time (and results) of your "Maxium
out degree" and a simple "aggr.histrgram()" call in MAL that does exactly
the same.
(I use our "microbenchmark" module to conveniently create a synthetic input
BAT of 1000000 tuples with 10 different randomly distributed edgeHead values.)

Here are the result for two runs of each approach:

time for 'Maxium out degree':   1053135 usec
[ "result of 'Maxium out degree'" ]
#-----------------#
# h	t	  # name
# lng	int	  # type
#-----------------#
[ 9,	  99597	  ]
[ 7,	  100023  ]
[ 4,	  99919	  ]
[ 2,	  100936  ]
[ 6,	  99970	  ]
[ 8,	  99796	  ]
[ 0,	  100056  ]
[ 1,	  100046  ]
[ 5,	  99942	  ]
[ 3,	  99715	  ]
time for 'aggr.histogram()':       9864 usec
[ "result of 'aggr.histogram()'" ]
#-----------------#
# h	t	  # name
# lng	int	  # type
#-----------------#
[ 9,	  99597	  ]
[ 7,	  100023  ]
[ 4,	  99919	  ]
[ 2,	  100936  ]
[ 6,	  99970	  ]
[ 8,	  99796	  ]
[ 0,	  100056  ]
[ 1,	  100046  ]
[ 5,	  99942	  ]
[ 3,	  99715	  ]
time for 'Maxium out degree':   1035288 usec
[ "result of 'Maxium out degree'" ]
#-----------------#
# h	t	  # name
# lng	int	  # type
#-----------------#
[ 9,	  99597	  ]
[ 7,	  100023  ]
[ 4,	  99919	  ]
[ 2,	  100936  ]
[ 6,	  99970	  ]
[ 8,	  99796	  ]
[ 0,	  100056  ]
[ 1,	  100046  ]
[ 5,	  99942	  ]
[ 3,	  99715	  ]
time for 'aggr.histogram()':       9978 usec
[ "result of 'aggr.histogram()'" ]
#-----------------#
# h	t	  # name
# lng	int	  # type
#-----------------#
[ 9,	  99597	  ]
[ 7,	  100023  ]
[ 4,	  99919	  ]
[ 2,	  100936  ]
[ 6,	  99970	  ]
[ 8,	  99796	  ]
[ 0,	  100056  ]
[ 1,	  100046  ]
[ 5,	  99942	  ]
[ 3,	  99715	  ]

Note in particular that aggr.histogram() is more than 100(!) times faster
than your code (for the reasons mentioned above).

I expect that similar changes to the remainder of your code will have
likewise positive effects.
Please feel free to play yourself --- I know that getting familiar with
"thinking the MonetDB way" takes some time.
As mentioned above, I hope to come back with more sometime later (within the
next days).

Regards,
Stefan

On Wed, Apr 21, 2010 at 09:15:52AM +0200, Miguel Ángel Águila Lorente wrote:
...
Hello,
I'm working in a research problem and we need to create a graph
database. We think that MonetDB can be useful to represent this graph
since the use of bats may make the graph traversal efficient.
Nevertheless, SQL is limited for the type of operations that we need to
do, and that is why we have decided to resort to MAL.
Our database is like this:
- Nodes: 
        a) node ident
        b) name
- Edges:
        a) edge ident
        b) head node ident
        c) tail node ident
One of the operations is to find the node with maximum out degree and,
after that, to do a BFS using this node as the source.
I have stored all the information in bats:
batNode1=[oid,node ident]
batNode2=[oid,name]
batEdge=[oid, edge ident]
batEdgeHead=[oid, head node ident]
batEdgeTail=[oid,tail node ident]
For getting the node with the maximum out degree, I access batEdgeHead
and store the number of occurrences of every node id in a batAux.
batAux=[node ident, times]
After that I only select the node id with the largest number of
occurrences.
Now, we are ready to do the BFS. In every iteration, I have to save all
the tail nodes for exploring them in the next generation. This is the
pseudocode explanation:
Put the node id to NodesToVisit
        While there are NodesToVisit
                currentNode=NodesToVisit;
                edgeToVisit=get edges idents from batEdgeHead from
currentNode ident
                While there are edgesToVisit
                        currentEdge=edgesToVisit
                        nodeFinal=get tail node ident from batEdgeTail
from currentEdge ident
                        If notExist nodeFinal in NodesToVisit
                                put nodeFinal in NodesToVisit
                        endIf
                endWhile 
        endWhile
The main problem happens when I perform a select from batEdgeHead from
the ident node. batEdgeHead might become very large and accessing it
might be very expensive in terms of time. Is there any kind of indexing?
I tried with crackers and AVL index but, if I am not wrong, this is only
useful if I had to do the same selection more than once.
The number of the nodes is 325.000 and the number of edges 925.000. At
the end of this e-mail you will find the code.
May any one help in improving the performance of this code?
Best regards,
Miquel Àngel Àguila
Bat equivalence:
batEdge, batNode1 --> b1:= bat.new(:oid,:lng); 
batNode2 --> b2:= bat.new(:oid,:str); 
batEdgeHead --> rh:= bat.new(:oid,:lng);
batEdgeTail --> rt:= bat.new(:oid,:lng);
batType --> b4:= bat.new(:oid, int) (Type: 1. nodeType1, 2. nodeType2 ,
3. edgeType1, 4. edgeType2, 5. edgeType3)
Loaded the bats, here is the code:
Maxium out degree:-----------------------------------------------------
edart:=algebra.select(b4,3);
nodOut:=algebra.semijoin(rh,edart);
res:=bat.new(:lng,:int);
valor:=0;
barrier (mloop16:lng ,h16:oid ,tl16:lng ) := bat.newIterator(nodOut);
        bite1:=algebra.exist(res,tl16);
        barrier ifpart:= bite1;
                valor:=algebra.find(res, tl16);
                valor:=calc.+(valor,1);
                res:=bat.replace(res,tl16,valor);
        exit ifpart;
        barrier elsepart:= calc.!=(bite1,true);
                bat.insert(res,tl16,1);
        exit elsepart;
    redo (mloop16:lng ,h16:oid ,tl16:lng ) :=
bat.hasMoreElements(nodOut);
exit mloop16:lng ;
maxOutDegree:=aggr.max(res);
node:=algebra.select(res,maxOutDegree);
io.print(node);
nodart:=algebra.select(b4,1);
nodart2:=algebra.semijoin(b1,nodart);
ident2:=algebra.join(nodart2,node);
name2:=algebra.semijoin(b3,ident2);
secondtime:=alarm.time();
#Traversals------------------------------------------------------------
node2:=algebra.semijoin(b1,name2);
aux:=0:lng;
idoid:=0@0;
barrier (mloop17:lng ,h17:oid ,tl17:lng) := bat.newIterator(node2);
        aux:=tl17;
        idoid:=h17;
exit mloop17:lng ;
nod:=bat.new(:lng,:lng);
dif:=bat.new(:lng,:lng);
vis:=bat.new(:lng,:lng);
ar:=bat.new(:oid,:oid);
tr:=bat.new(:oid,:oid);
bat.insert(nod,aux,aux);
dif:=algebra.difference(nod,vis);
outer6:= aggr.count(dif);
rhtaux:=bat.new(:oid,:int);
rhaux:=bat.new(:oid,:lng);
rtaux:=bat.new(:oid,:lng);
rhtaux:=algebra.select(b4,3);
rtaux:=algebra.semijoin(rt,rhtaux);
rhaux:=algebra.semijoin(rh,rhtaux);
barrier outer:=true;
        barrier (mloop18:lng ,h18:lng ,tl18:lng ) :=
bat.newIterator(dif);
                # This is the problem, I need to do a lot of selects
                rhaux2:=algebra.select(rhaux,tl18);
                rtaux2:=algebra.semijoin(rtaux,rhaux2);
                barrier (mloop19:lng ,h19:oid,tl19:lng ) :=
bat.newIterator(rtaux2);
                        bite1:=algebra.exist(nod,tl19);
                        barrier ifpart:= bite1;
                                bat.insert(tr,h19,h19);
                        exit ifpart;
                        barrier elsepart:= calc.!=(bite1,true);
                                bat.insert(ar,h19,h19);
                                bat.insert(nod,tl19,tl19);
                        exit elsepart;
                    redo (mloop19:lng ,h19:oid ,tl19:lng ) :=
bat.hasMoreElements(rtaux2);
                exit mloop19:lng ;
                bat.insert(vis,h18,tl18);
            redo (mloop18:lng ,h18:lng ,tl18:lng ) :=
bat.hasMoreElements(dif);
        exit mloop18:lng ;
        dif:=algebra.difference(nod,vis);
        outer6:= aggr.count(dif);
        outer:=calc.!=(outer6,0);
        redo outer:=calc.!=(outer6,0);
exit outer;
------------------------------------------------------------------------------
_______________________________________________
MonetDB-users mailing list
MonetDB-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- 
| Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl |
| CWI,  P.O.Box 94079 | http://www.cwi.nl/~manegold/  |
| 1090 GB Amsterdam   | Tel.: +31 (20) 592-4212       |
| The Netherlands     | Fax : +31 (20) 592-4199       |