wiki:LuceneIndexBasedSearchManual

Version 2 (modified by antonak, 13 years ago) (diff)

--

Lucene Index based Search Manual

Provide GCC developers with a search on (1) your Database contents (you can pick which fields you want to search) and (2) on Ontology terms retrieved from OntoCAT for a specific keyword from EBI Ontology Lookup Service (OLS) and the NCBO BioPortal? results .

Biobank Search on Db /Ontocat based in Lucene Indexing featuring Query Expansion using Ontologies

How to set up , configure & run

(Run molgenis project - molgenis4phenotype or in your own)

  1. Download & install molgenis (http://www.molgenis.org/wiki/MolgenisOnWindows,)
  1. Download latest version of molgenis4phenotype from http://www.molgenis.org/svn/molgenis_projects/molgenis4phenotype
  1. Set Build path (add Lucene/Ontocat?/ols/owlapi/wsdl/ jars)

Make sure the following jar exist in your WEB-INF/lib directory:

  • § lucene-core-3.0.2.jar
  • § lucene-demos-3.0.2.jar
  • § lucene-highlighter-3.0.1.jar
  • § lucene-memory-3.0.2.jar
  • § ols-client.jar
  • § ontoCAT_v0.9.4-SNAPSHOT.jar
  • § opencsv-1.8.jar
  • § owlapi-bin.jar
  • § owlapi-src.jar
  • § wsdl4j-1.6.2.jar
  • § xpp3_min-1.1.4c.jar

Next select project properties à java build pathà Librariesà Add Jarsà select the jars mentioned above.

The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped .

  1. Select two of preconfigured existing sets of files, AnimalDB , lifelines, or create your own. Currently adjusted for animalDB. Respective configuration includes (consider the corresponding files if you want to create your own) :
  1. animaldb.molgenis.properties :

db_user = molgenis

db_password = molgenis

  1. Create db : animaldb_pheno or yours as follows

mysql> create database animaldb_pheno;

mysql> grant all privileges on animaldb_pheno.* to molgenis@localhost identified by 'molgenis'; flush privileges; *

c. Plugin files already exist in molgenis4phenotype. For your own project:

Add in molgenis_ui.xml the menu:

<''!-- Lucene biobank search plugin -->{{{

<menu name="submenu" position="left" label="Indexing...">

<plugin name="DBIndex" label="DB Index and Search" type="plugins.LuceneIndex?.DBIndexPlugin" />

<plugin name="GenericWizard? " type="plugin.genericwizard.GenericWizard? " label="Excel upload"/> <plugin name="OntoCatIndexPlugin? 2" label="Index OntoCAT" type="plugins.LuceneIndex? .OntoCatIndexPlugin? 2" />

</menu> }}}

  1. AnimalDBGenerate.java : run
  1. AnimalDBUpdateDatabase.java: run
  1. Files are now created. Replace created files with plugin’s files. Also add LuceneIndexConfiguration.properties in directory where all .properties files leave (/molgenis4phenotypeWorkspace/molgenis4phenotype)
  1. Adjust in LuceneIndexConfiguration.properties configuration file:

The number of DB fields in which the search “makes sense “ (mainly description fields). Also fill in the names of the fields.

Also select if you want to use ontologies in query expansion [1] by selecting useOntologiesInQueryExpansion = “true"

[1] In case you use ontologies for query expansion you need to follow the instruction in section B .

Building & searching an index on database contents

  1. Fill in data (in case of Animal Db project used). Run in server and select from the menu “System Tasksà Fill in Database ”. You can also load latest animal db files from the same page.
  1. Db is now ready to be search. From the main menu, select “Indexing à DB Index & Search”.

The index is created in folder predefined in variable LUCENE_INDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created. (Be sure to remove the contents for the new index).

After the creation of the index is complete you can search by entering a term (or a sentence) in the search box.

If useOntologiesInQueryExpansion is selected (true), the query is being expanded by terms retrieved from the downloaded ontologies that leave in LUCENE_INDEX_DIRECTORY

Building & searching an index on Ontocat contents

http://sourceforge.net/projects/ontocat/

In order to build an index based on Ontocat contents some adjustments must be made:

  1. Set the VM arguments for !OntoCatIndexPlugin2.java to “–Xms1024M –Xmx1024M”

(Select projectà Run As à Run configurationsà Arguments à Add in VM arguments -Xms1024M -Xmx1024M)

  1. Enter a desired term in order to retrieve from online ontology resources and press “Build Ontocat Index”.
  1. The index is created in folder predefined in variable LUCENE_ONTOINDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created. (Be sure to remove the contents for the new index).
  1. After the creation of the index (this may take a while depending on the response of the ontology resources that Ontocat is speaking to – EBI Ontology service) is complete you can search by entering a term (or a sentence) in the search box.

B. How to run query expansion enabled search (using ontologies)

1) Download the ontologies from http://bioportal.bioontology.org/

You should download

  • (http://rest.bioontology.org/bioportal/ontologies/download/44307?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
  • Human Disease (http://rest.bioontology.org/bioportal/ontologies/download/44309?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
  • NCI Thesaurus (http://rest.bioontology.org/bioportal/ontologies/download/42838?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)

MeSH can be taken from biobank_search\WebContent\WEB-INF

2) Change the directory names:

  • In DBIndexPlugin: LUCENE_INDEX_DIRECTORY, INDEX_CONFIGURATION
  • In !OntoCatIndexPlugin2: LUCENE_ONTOINDEX_DIRECTORY, ONTOLOGIES_DIRECTORY

C. Lucene scoring

 

Lucene scoring uses a combination of the [http://en.wikipedia.org/wiki/Vector_Space_Model Vector Space Model (VSM) of Information Retrieval] and the Boolean model to determine how relevant a given Document is to a User's query.

In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification.

Lucene also adds some capabilities and refinements onto this model to support boolean and fuzzy searching, but it essentially remains a VSM based system at the heart. For some valuable references on VSM and IR in general refer to the [http://wiki.apache.org/lucene-java/InformationRetrieval Lucene Wiki IR references].

 

(see more in Appendinx B)

The score for a document given a query is the cosine of the angle formed between the query vector and the document vector. The explain() method can be used to show exactly what  score  calculation is for a given query and a given document. So explanation() ‘s results (explanation. toString()) is presented to the user.