Hello Miquel, first of all, I agree with you that MonetDB is the proper choice for your problem ;-) As a first quick answer --- I hope to come back with more once I find more time to dive into it --- here is one general hint of using MonetDB the most efficient way (in addition to the comments that Lefteris already gave). Your below code for calculating the "Maxium out degree" basically re-implements in MAL (and interpreted language) what does already exist as ((hopefully higly optimized) MAL primitive that is implemented in C (and hence compiled code: a histogram. In other words, to get the most out of MonetDB, one should not see and use MAL as an imperative programming language to iteratively process data items one at a time, but rather as a declarative query language to process large amount of data in one go (you may also think of it as something like "SIMD"). Here's a small example regarding your "Maxium out degree" code. My attached code compares the execution time (and results) of your "Maxium out degree" and a simple "aggr.histrgram()" call in MAL that does exactly the same. (I use our "microbenchmark" module to conveniently create a synthetic input BAT of 1000000 tuples with 10 different randomly distributed edgeHead values.) Here are the result for two runs of each approach: time for 'Maxium out degree': 1053135 usec [ "result of 'Maxium out degree'" ] #-----------------# # h t # name # lng int # type #-----------------# [ 9, 99597 ] [ 7, 100023 ] [ 4, 99919 ] [ 2, 100936 ] [ 6, 99970 ] [ 8, 99796 ] [ 0, 100056 ] [ 1, 100046 ] [ 5, 99942 ] [ 3, 99715 ] time for 'aggr.histogram()': 9864 usec [ "result of 'aggr.histogram()'" ] #-----------------# # h t # name # lng int # type #-----------------# [ 9, 99597 ] [ 7, 100023 ] [ 4, 99919 ] [ 2, 100936 ] [ 6, 99970 ] [ 8, 99796 ] [ 0, 100056 ] [ 1, 100046 ] [ 5, 99942 ] [ 3, 99715 ] time for 'Maxium out degree': 1035288 usec [ "result of 'Maxium out degree'" ] #-----------------# # h t # name # lng int # type #-----------------# [ 9, 99597 ] [ 7, 100023 ] [ 4, 99919 ] [ 2, 100936 ] [ 6, 99970 ] [ 8, 99796 ] [ 0, 100056 ] [ 1, 100046 ] [ 5, 99942 ] [ 3, 99715 ] time for 'aggr.histogram()': 9978 usec [ "result of 'aggr.histogram()'" ] #-----------------# # h t # name # lng int # type #-----------------# [ 9, 99597 ] [ 7, 100023 ] [ 4, 99919 ] [ 2, 100936 ] [ 6, 99970 ] [ 8, 99796 ] [ 0, 100056 ] [ 1, 100046 ] [ 5, 99942 ] [ 3, 99715 ] Note in particular that aggr.histogram() is more than 100(!) times faster than your code (for the reasons mentioned above). I expect that similar changes to the remainder of your code will have likewise positive effects. Please feel free to play yourself --- I know that getting familiar with "thinking the MonetDB way" takes some time. As mentioned above, I hope to come back with more sometime later (within the next days). Regards, Stefan On Wed, Apr 21, 2010 at 09:15:52AM +0200, Miguel Ángel Águila Lorente wrote:
Hello,
I'm working in a research problem and we need to create a graph database. We think that MonetDB can be useful to represent this graph since the use of bats may make the graph traversal efficient. Nevertheless, SQL is limited for the type of operations that we need to do, and that is why we have decided to resort to MAL.
Our database is like this:
- Nodes: a) node ident b) name - Edges: a) edge ident b) head node ident c) tail node ident
One of the operations is to find the node with maximum out degree and, after that, to do a BFS using this node as the source.
I have stored all the information in bats:
batNode1=[oid,node ident] batNode2=[oid,name] batEdge=[oid, edge ident] batEdgeHead=[oid, head node ident] batEdgeTail=[oid,tail node ident]
For getting the node with the maximum out degree, I access batEdgeHead and store the number of occurrences of every node id in a batAux.
batAux=[node ident, times]
After that I only select the node id with the largest number of occurrences.
Now, we are ready to do the BFS. In every iteration, I have to save all the tail nodes for exploring them in the next generation. This is the pseudocode explanation:
Put the node id to NodesToVisit While there are NodesToVisit currentNode=NodesToVisit; edgeToVisit=get edges idents from batEdgeHead from currentNode ident While there are edgesToVisit currentEdge=edgesToVisit nodeFinal=get tail node ident from batEdgeTail from currentEdge ident If notExist nodeFinal in NodesToVisit put nodeFinal in NodesToVisit endIf endWhile endWhile
The main problem happens when I perform a select from batEdgeHead from the ident node. batEdgeHead might become very large and accessing it might be very expensive in terms of time. Is there any kind of indexing? I tried with crackers and AVL index but, if I am not wrong, this is only useful if I had to do the same selection more than once. The number of the nodes is 325.000 and the number of edges 925.000. At the end of this e-mail you will find the code.
May any one help in improving the performance of this code?
Best regards,
Miquel Àngel Àguila
Bat equivalence: batEdge, batNode1 --> b1:= bat.new(:oid,:lng); batNode2 --> b2:= bat.new(:oid,:str); batEdgeHead --> rh:= bat.new(:oid,:lng); batEdgeTail --> rt:= bat.new(:oid,:lng); batType --> b4:= bat.new(:oid, int) (Type: 1. nodeType1, 2. nodeType2 , 3. edgeType1, 4. edgeType2, 5. edgeType3)
Loaded the bats, here is the code:
Maxium out degree:----------------------------------------------------- edart:=algebra.select(b4,3); nodOut:=algebra.semijoin(rh,edart); res:=bat.new(:lng,:int); valor:=0; barrier (mloop16:lng ,h16:oid ,tl16:lng ) := bat.newIterator(nodOut); bite1:=algebra.exist(res,tl16); barrier ifpart:= bite1; valor:=algebra.find(res, tl16); valor:=calc.+(valor,1); res:=bat.replace(res,tl16,valor); exit ifpart; barrier elsepart:= calc.!=(bite1,true); bat.insert(res,tl16,1); exit elsepart; redo (mloop16:lng ,h16:oid ,tl16:lng ) := bat.hasMoreElements(nodOut); exit mloop16:lng ;
maxOutDegree:=aggr.max(res); node:=algebra.select(res,maxOutDegree); io.print(node); nodart:=algebra.select(b4,1); nodart2:=algebra.semijoin(b1,nodart); ident2:=algebra.join(nodart2,node); name2:=algebra.semijoin(b3,ident2); secondtime:=alarm.time(); #Traversals------------------------------------------------------------ node2:=algebra.semijoin(b1,name2); aux:=0:lng; idoid:=0@0; barrier (mloop17:lng ,h17:oid ,tl17:lng) := bat.newIterator(node2); aux:=tl17; idoid:=h17; exit mloop17:lng ; nod:=bat.new(:lng,:lng); dif:=bat.new(:lng,:lng); vis:=bat.new(:lng,:lng); ar:=bat.new(:oid,:oid); tr:=bat.new(:oid,:oid); bat.insert(nod,aux,aux); dif:=algebra.difference(nod,vis); outer6:= aggr.count(dif); rhtaux:=bat.new(:oid,:int); rhaux:=bat.new(:oid,:lng); rtaux:=bat.new(:oid,:lng); rhtaux:=algebra.select(b4,3); rtaux:=algebra.semijoin(rt,rhtaux); rhaux:=algebra.semijoin(rh,rhtaux); barrier outer:=true; barrier (mloop18:lng ,h18:lng ,tl18:lng ) := bat.newIterator(dif); # This is the problem, I need to do a lot of selects rhaux2:=algebra.select(rhaux,tl18); rtaux2:=algebra.semijoin(rtaux,rhaux2); barrier (mloop19:lng ,h19:oid,tl19:lng ) := bat.newIterator(rtaux2); bite1:=algebra.exist(nod,tl19); barrier ifpart:= bite1; bat.insert(tr,h19,h19); exit ifpart; barrier elsepart:= calc.!=(bite1,true); bat.insert(ar,h19,h19); bat.insert(nod,tl19,tl19); exit elsepart; redo (mloop19:lng ,h19:oid ,tl19:lng ) := bat.hasMoreElements(rtaux2); exit mloop19:lng ; bat.insert(vis,h18,tl18); redo (mloop18:lng ,h18:lng ,tl18:lng ) := bat.hasMoreElements(dif); exit mloop18:lng ; dif:=algebra.difference(nod,vis); outer6:= aggr.count(dif); outer:=calc.!=(outer6,0); redo outer:=calc.!=(outer6,0); exit outer;
------------------------------------------------------------------------------ _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
-- | Dr. Stefan Manegold | mailto:Stefan.Manegold@cwi.nl | | CWI, P.O.Box 94079 | http://www.cwi.nl/~manegold/ | | 1090 GB Amsterdam | Tel.: +31 (20) 592-4212 | | The Netherlands | Fax : +31 (20) 592-4199 |