[MonetDB-users] [PF/Tijah] How to do a full text search?
Dear all, We are trying to do a full text search in a database, but we can use some help to get started. A small excerpt from our database is stated below: //=== xquery>doc("ep") <?xml version="1.0" encoding="utf-8"?> <eprintsdata xmlns="http://eprints.org/ep2/data"> <record> <field name="eprintid">629</field> <field name="keywords">Model-based on-the-fly Testing, Timed Automata, Real-Time Testing, TorX, Tools.</field> <field name="research_projects">STRESS: Systematic Testing of Real-time Software Systems</field> </record> <record> <field name="eprintid">642</field> <field name="keywords">Compositional modelling, supervisory control</field> <field name="research_projects">HYBRIDGE: Distributed Control and Stochastic Analysis of Hybrid Systems Supporting Safety Critical Real-Time Systems Design</field> </record> </eprintsdata> //=== The PF/Tijah Getting Started page (http://dbappl.cs.utwente.nl/pftijah/Documentation/GettingStarted) mentions we can do a full text query as follows: let $c := collection("MyCollection") for $res in tijah:query($c, "//html[about(., ir db)]") return $res/head/title I believe this way is probably outdated, as tijah:query does not work. To suit our needs, I changed the query to the following form: //=== let $c := collection("ep") for $res in pf:tijah-query($c,"//field[@name='research_projects'][about(., Distributed)]") return $res //=== But this results in the following error: "ERROR = !ERROR: interpret: unknown variable 'collName'" That is why I incorporated the collection name in the query, like this: //=== for $res in pf:tijah-query(ep,"//field[@name='research_projects'][about(., Distributed)]") return $res //=== However, this also does not work, as I get the following error: "ERROR = !illegal reference to context node: at (1,29-1,30): ``.'' is unbound" I have tried numerous of other variations, but neither of them work. Can somebody please help us with an explanation about how we can do a full text search within a certain node or point us at a more recent getting started page (if it already exists)? Kind regards, Sander Bockting
Hej Sander, Indeed the syntax you find now on our documentation page is for a newer version, then the one you are using. To work with that syntax, you would need to check-out the development brunch from the pathfinder CVS repository on sourgeforge.net. The release version of pf/tijah did not have its own namespace for search functions. So, you were right to replace tijah: by pf:tijah- in all search function calls. If you want to perform full-text queries, you would first have to create a full-text index. Unfortunately, with the release version you are using, this can only be done on MIL level, not within XQuery. (I will attach a mil script for creating an index as an example) Furthermore, if you want to specify a certain ft-index (earlier also named collection) to be used for the full-text query, you need to specify this, in a <TijahOptions collection="name"/> node, and give this node as the first parameter to your query: let $opt := <TijahOptions collection="name"/> return pf:tijah-query($opt, (), "nexi query string") And one last remark, you cannot perform attribute selections in the NEXI query string. NEXI does not work with attributes at all. best -Henning Sander Bockting wrote:
Dear all,
We are trying to do a full text search in a database, but we can use some help to get started. A small excerpt from our database is stated below:
//=== xquery>doc("ep")
<?xml version="1.0" encoding="utf-8"?> <eprintsdata xmlns="http://eprints.org/ep2/data"> <record> <field name="eprintid">629</field> <field name="keywords">Model-based on-the-fly Testing, Timed Automata, Real-Time Testing, TorX, Tools.</field> <field name="research_projects">STRESS: Systematic Testing of Real-time Software Systems</field> </record>
<record> <field name="eprintid">642</field> <field name="keywords">Compositional modelling, supervisory control</field> <field name="research_projects">HYBRIDGE: Distributed Control and Stochastic Analysis of Hybrid Systems Supporting Safety Critical Real-Time Systems Design</field> </record> </eprintsdata> //===
The PF/Tijah Getting Started page (http://dbappl.cs.utwente.nl/pftijah/Documentation/GettingStarted) mentions we can do a full text query as follows:
let $c := collection("MyCollection") for $res in tijah:query($c, "//html[about(., ir db)]") return $res/head/title
I believe this way is probably outdated, as tijah:query does not work. To suit our needs, I changed the query to the following form:
//=== let $c := collection("ep") for $res in pf:tijah-query($c,"//field[@name='research_projects'][about(., Distributed)]") return $res //===
But this results in the following error: "ERROR = !ERROR: interpret: unknown variable 'collName'" That is why I incorporated the collection name in the query, like this:
//=== for $res in pf:tijah-query(ep,"//field[@name='research_projects'][about(., Distributed)]") return $res //===
However, this also does not work, as I get the following error: "ERROR = !illegal reference to context node: at (1,29-1,30): ``.'' is unbound"
I have tried numerous of other variations, but neither of them work. Can somebody please help us with an explanation about how we can do a full text search within a certain node or point us at a more recent getting started page (if it already exists)?
Kind regards,
Sander Bockting
------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ MonetDB-users mailing list MonetDB-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/monetdb-users
module(pftijah); module(ascii_io); var coll_param := new(str,str); coll_param.insert("stemmer","snowball-english"); #coll_param.insert("tagFilter","HEADLINE,TEXT"); coll_param.insert("pf_collection","col"); #coll_param.insert("tokenizer","fast"); #coll_param.insert("fragmentSize","1000"); tj_init_collection("col",coll_param); var corpus_path := "/local/rodeh/aquaint/xml_aggr/"; var names := new(void, str).seqbase(0@0); names.append("corpus"); var seps := new(void, str).seqbase(0@0); seps.append("\n"); var types := new(void, str).seqbase(0@0); types.append("str"); var filename := corpus_path + "corpus.lst"; filename.print(); var tmp := load(names, seps, types, filename, -1); var corpus := tmp.find("corpus").seqbase(1@0); var url_corpus := [+](const corpus_path, corpus); var docs := url_corpus.reverse().join(corpus); tj_add2collection("col",docs,true);
Dear Henning,
If you want to perform full-text queries, you would first have to create a full-text index. Unfortunately, with the release version you are using, this can only be done on MIL level, not within XQuery. (I will attach a mil script for creating an index as an example)
Thank you for your reply. Unfortunately, I still do not know how to create an index. I do not know how to run the script for instance. I have tried it in the way as shown below (resulting in a number of errors). Is this the right way of invoking a MIL-script? === [sander@localhost ~]$ Mserver --dbinit="module(pathfinder);" < index_coll.mil # Monet Database Server V4.16.0 # Copyright (c) 1993-2007, CWI. All rights reserved. # Compiled for i686-redhat-linux-gnu/32bit with 32bit OIDs; dynamically linked. # Visit http://monetdb.cwi.nl/ for further information. MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>!ERROR: # tj_init_global() unkonwn parameter [pf_collection]. MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>MonetDB>[ "/local/rodeh/aquaint/xml_aggr/corpus.lst" ] MonetDB>!ERROR: could not open file /local/rodeh/aquaint/xml_aggr/corpus.lst !ERROR: ascii_io_load: operation failed. MonetDB>!ERROR: interpret: no matching MIL operator to 'find(void, str)'. !MAYBE YOU MEAN: ! find(BAT[any::1,any::2], any::1) : any::2 ! find(any::1, oid) : any::1 !ERROR: interpret_params: seqbase(param 1): evaluation error. MonetDB>MonetDB>!ERROR: interpret: no matching MIL operator to 'reverse(str)'. !MAYBE YOU MEAN: ! reverse(BAT[any::1,any::2]) : BAT[any::2,any::1] !ERROR: interpret_params: join(param 1): evaluation error. MonetDB>!ERROR: interpret: no matching MIL operator to 'tj_add2collection(str, void, bit)'. !MAYBE YOU MEAN: ! tj_add2collection(str, BAT[str,str], bit) : void ! tj_add2collection(str, str, str, bit) : void === As you can see, this results in a number of errors. I have not altered your script, as I do not know what all parameters and variables are and I do not know a location where I can find the meaning of them. Obviously, some things need to be changed for us to be able to use the script, but I do not know what. What is a .lst file and what does it look like? And could it be expected that the first error occurs (tj_init_global() unkonwn parameter [pf_collection].)? Can you, or someone else of course, provide us with a script that has some more comments about what everything means, how things should be changed and what the external files should look like? Or point us to a location where we can find information about that? Or is it more advisable to use the latest branch of MonetDB as these kind of problems simply do not occur there? Kind regards, Sander Bockting
participants (2)
-
Henning Rode
-
Sander Bockting