Changes between Version 1 and Version 2 of LuceneIndexBasedSearchManual


Ignore:
Timestamp:
2011-02-14T15:28:37+01:00 (13 years ago)
Author:
antonak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LuceneIndexBasedSearchManual

    v1 v2  
    1 = '''Lucene Index based Search Manual ''' =
     1= '''Lucene Index based Search Manual ''' =
    22Provide GCC developers with a search on (1) your Database contents (you can pick which fields you want to search) and (2) on Ontology terms retrieved from OntoCAT for a specific keyword from EBI Ontology Lookup Service (OLS) and the NCBO BioPortal results .
    33
     4= Biobank Search on Db /Ontocat based in Lucene Indexing featuring Query Expansion using Ontologies =
     5== How to set up , configure & run ==
     6== (Run molgenis project   - molgenis4phenotype or in your own) ==
     7 1. Download & install molgenis  (http://www.molgenis.org/wiki/MolgenisOnWindows,)
    48
    5 = Biobank Search on Db /Ontocat based in Lucene Indexing featuring Query Expansion using Ontologies   =
     9 2. Download latest version of molgenis4phenotype from http://www.molgenis.org/svn/molgenis_projects/molgenis4phenotype
    610
    7 ==  How to set up , configure & run ==
    8 
    9 == (Run molgenis project   - molgenis4phenotype or in your own) ==
    10  
    11 
    12 1.     Download & install molgenis  (http://www.molgenis.org/wiki/MolgenisOnWindows,)
    13 
    14 2.     Download latest version of molgenis4phenotype from http://www.molgenis.org/svn/molgenis_projects/molgenis4phenotype
    15 
    16 3.     Set Build path (add Lucene/Ontocat/ols/owlapi/wsdl/ jars)
     11 3. Set Build path (add Lucene/Ontocat/ols/owlapi/wsdl/ jars)
    1712
    1813Make sure the following jar exist in your WEB-INF/lib directory:
    1914
    20 §  lucene-core-3.0.2.jar
    21 
    22 §  lucene-demos-3.0.2.jar
    23 
    24 §  lucene-highlighter-3.0.1.jar
    25 
    26 §  lucene-memory-3.0.2.jar
    27 
    28 §  ols-client.jar
    29 
    30 §  ontoCAT_v0.9.4-SNAPSHOT.jar
    31 
    32 §  opencsv-1.8.jar
    33 
    34 §  owlapi-bin.jar
    35 
    36 §  owlapi-src.jar
    37 
    38 §  wsdl4j-1.6.2.jar
    39 
    40 §  xpp3_min-1.1.4c.jar
     15 * §  lucene-core-3.0.2.jar
     16 * §  lucene-demos-3.0.2.jar
     17 * §  lucene-highlighter-3.0.1.jar
     18 * §  lucene-memory-3.0.2.jar
     19 * §  ols-client.jar
     20 * §  ontoCAT_v0.9.4-SNAPSHOT.jar
     21 * §  opencsv-1.8.jar
     22 * §  owlapi-bin.jar
     23 * §  owlapi-src.jar
     24 * §  wsdl4j-1.6.2.jar
     25 * §  xpp3_min-1.1.4c.jar
    4126
    4227Next select project properties à java build pathà Librariesà Add Jarsà select the jars mentioned above.
    4328
    44 '''''__The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped .   __'''''
     29  ''''' __ The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped .   __ '''''
    4530
    46 4.     Select two of preconfigured existing sets of files, AnimalDB , lifelines, or create your own. Currently adjusted for animalDB. Respective configuration includes (consider the corresponding files if you want to create your own) : 
     31 4. Select two of preconfigured existing sets of files, AnimalDB , lifelines, or create your own. Currently adjusted for animalDB. Respective configuration includes (consider the corresponding files if you want to create your own) :
     32 4.
    4733
    48 a.     animaldb.molgenis.properties :
     34 a. animaldb.molgenis.properties :
    4935
    5036''db_user = molgenis''
     
    5238''db_password = molgenis''
    5339
    54 b.    Create db : animaldb_pheno or yours as follows 
     40 a. Create db : animaldb_pheno or yours as follows
    5541
    5642''mysql> create database animaldb_pheno;''
     
    5844''mysql> grant all privileges on animaldb_pheno.* to molgenis@localhost identified by 'molgenis'; flush privileges; *''
    5945
    60 '' ''
    61 
    62 ''c.          Plugin files already exist in molgenis4phenotype. For your own project:''
     46''c.          Plugin files already exist in molgenis4phenotype. For your own project:''
    6347
    6448''Add in molgenis_ui.xml the menu:''
    6549
    66 ''<!-- Lucene biobank search plugin  -->''
     50''<''!''!-- Lucene biobank search plugin  -->{{{
    6751
    68 '' <menu name="submenu" position="left" label="Indexing...">''
     52  <menu name="submenu" position="left" label="Indexing...">
    6953
    70 ''<plugin name="DBIndex" label="DB Index and Search" type="plugins.!LuceneIndex.DBIndexPlugin" />''
     54<plugin name="DBIndex" label="DB Index and Search" type="plugins.[wiki:LuceneIndex].DBIndexPlugin" />
    7155
    72 ''     <plugin name="!GenericWizard" type="plugin.genericwizard.!GenericWizard" label="Excel upload"/>''
     56  <plugin name="GenericWizard " type="plugin.genericwizard.[wiki:GenericWizard] " label="Excel upload"/> <plugin name="OntoCatIndexPlugin 2" label="Index OntoCAT"  type="plugins.[wiki:LuceneIndex] .[wiki:OntoCatIndexPlugin] 2" />
    7357
    74 ''     <plugin name="!OntoCatIndexPlugin2" label="Index OntoCAT"  type="plugins.!LuceneIndex.!OntoCatIndexPlugin2" />''
     58</menu>'' }}}''
    7559
    76 ''</menu>''
     60 a. AnimalDBGenerate.java : run
    7761
    78 '' ''
     62 a. AnimalDBUpdateDatabase.java: run
    7963
    80 d.    AnimalDBGenerate.java : run
     64 a. Files are now created. Replace created files with plugin’s files. Also add ''__!LuceneIndexConfiguration.properties __''in directory where all .properties files leave (/molgenis4phenotypeWorkspace/molgenis4phenotype)
    8165
    82 e.     AnimalDBUpdateDatabase.java: run
    83 
    84 f.      Files are now created. Replace created files with plugin’s files. Also add ''__!LuceneIndexConfiguration.properties __''in directory where all .properties files leave (/molgenis4phenotypeWorkspace/molgenis4phenotype)
    85 
    86 g.     Adjust in ''__!LuceneIndexConfiguration.properties configuration__'' file:
     66 a. Adjust in ''__!LuceneIndexConfiguration.properties configuration__'' file:
    8767
    8868The number of DB fields in which the search “makes sense “ (mainly description fields). Also fill in the names of the fields.
     
    9070Also select if you want to use ontologies in query expansion [http://www.molgenis.org/#_ftn1 "[1]"] by selecting'' __useOntologiesInQueryExpansion = “true"__''
    9171
    92 [[BR]]
    93 ----
    9472[http://www.molgenis.org/#_ftnref "[1]"] In case you use ontologies for query expansion you need to follow the instruction in section B .
    9573
     74== Building & searching an index on database contents ==
     75 a. Fill in data (in case of Animal Db project used). Run in server and select from the menu  “System Tasksà Fill in Database ”. You can also load latest animal db files from the same page.
    9676
    97 == Building & searching an index on database contents ==
    98  
    99 
    100 a.     Fill in data (in case of Animal Db project used). Run in server and select from the menu  “System Tasksà Fill in Database ”. You can also load latest animal db files from the same page.
    101 
    102 b.     Db is now ready to be search. From the main menu, select “Indexing à DB Index & Search”.
     77 a. Db is now ready to be search. From the main menu, select “Indexing à DB Index & Search”.
    10378
    10479The index is created in folder predefined in variable LUCENE_INDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __
    10580
    106  
    107 
    10881After the creation of the index is complete you can search by entering a term (or a sentence) in the search box.
    10982
    110  
    111 
    11283If useOntologiesInQueryExpansion is selected (true), the query is being expanded by terms retrieved from the downloaded ontologies that leave in LUCENE_INDEX_DIRECTORY
    113 
    11484
    11585== Building & searching an index on Ontocat contents ==
     
    11888In order to build an index based on Ontocat contents some adjustments must be made:
    11989
    120 1.      Set the VM arguments for !OntoCatIndexPlugin2.java to   “–Xms1024M –Xmx1024M”
     90 1. Set the VM arguments for !OntoCatIndexPlugin2.java to   “–Xms1024M –Xmx1024M”
    12191
    12292(Select projectà Run As à Run configurationsà Arguments à Add in VM arguments -Xms1024M -Xmx1024M)
    12393
    124 2.     Enter a desired term in order to retrieve from online ontology resources and press “Build Ontocat Index”.
     94 2. Enter a desired term in order to retrieve from online ontology resources and press “Build Ontocat Index”.
    12595
    126  
     96 3. The index is created in folder predefined in variable LUCENE_ONTOINDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __
    12797
    128 3.     The index is created in folder predefined in variable LUCENE_ONTOINDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __
    129 
    130  
    131 
    132 4.     After the creation of the index (this may take a while depending on the response of the ontology resources that Ontocat is speaking to – EBI Ontology service) is complete you can search by entering a term (or a sentence) in the search box.
    133 
    134  
    135 
    136  
    137 
     98 4. After the creation of the index (this may take a while depending on the response of the ontology resources that Ontocat is speaking to – EBI Ontology service) is complete you can search by entering a term (or a sentence) in the search box.
    13899
    139100== B. How to run query expansion enabled search (using ontologies) ==
    140 1)            Download the ontologies from !http://bioportal.bioontology.org/
     1011)            Download the ontologies from !http://bioportal.bioontology.org/
    141102
    142103You should download
    143104
    144 -            (!http://rest.bioontology.org/bioportal/ontologies/download/44307?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
     105 * (!http://rest.bioontology.org/bioportal/ontologies/download/44307?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
    145106
    146 -            Human Disease (!http://rest.bioontology.org/bioportal/ontologies/download/44309?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
     107 * Human Disease (!http://rest.bioontology.org/bioportal/ontologies/download/44309?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
    147108
    148 -            NCI Thesaurus (!http://rest.bioontology.org/bioportal/ontologies/download/42838?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
     109 * NCI Thesaurus (!http://rest.bioontology.org/bioportal/ontologies/download/42838?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)
    149110
    150111MeSH can be taken from biobank_search\!WebContent\WEB-INF
    151112
    152 2)            Change the directory names:
     1132)            Change the directory names:
    153114
    154 -            In DBIndexPlugin: LUCENE_INDEX_DIRECTORY, INDEX_CONFIGURATION
     115 * In DBIndexPlugin: LUCENE_INDEX_DIRECTORY, INDEX_CONFIGURATION
    155116
    156 -            In !OntoCatIndexPlugin2: LUCENE_ONTOINDEX_DIRECTORY, ONTOLOGIES_DIRECTORY
    157 
    158  
    159 
    160  
    161 
    162 
     117 * In !OntoCatIndexPlugin2: LUCENE_ONTOINDEX_DIRECTORY, ONTOLOGIES_DIRECTORY
     118 * [[BR]]
    163119== C. Lucene scoring ==
    164120 
     
    183139
    184140 
    185 
    186 
    187 = Appendix =
    188 
    189 = A. Information Retrieval with Query Expansion =
    190  
    191 
    192 General ideas:
    193 
    194 1) Query expansion adds additional terms related to initial query terms to the query. They shouldn’t be obligatory contained in a document (it would be difficult to find a document in a database, containing every term of ["exercise-induced asthma", "chronic obstructive asthma with acute exacerbation”,” exercise-induced asthma (disorder)", "bronchial hypersensitivity", "chronic obstructive asthma", "chronic obstructive asthma with status asthmaticus", "bronchial hyperreactivity", "exercise induced asthma", "cough variant asthma", "intrinsic asthma", "status asthmaticus", "allergic asthma"]), that’s why they should be appended by “OR” operator and can be assigned a lower weight. Thus such query expansion usually changes the document ranking and consequently the order of retrieved documents in the output, rather than significantly changes the number of documents retrieved.
    195 
    196  
    197 
    198 2) What terms to add? Obviously, the added terms should be very close to the query term, that’s why in information retrieval as expansion terms usually synonyms and children (terms, related to the query term by IS_A relationship) are added. For example, if a user enters a broad query, such as lung disease, query expansion will add documents concerning narrower terms, such as pneumonia (children node),
    199 
    200  
    201 
    202 3) It is very important to have good ontologies at hand. Otherwise the expansion terms may turn out to be very inaccurate. This is the problem with nonscientific terminology: it’s practically impossible to construct an accurate ontology, due to the vagueness of words of natural languages. Synonymy is very approximate here and it’s difficult to determine where exactly in the ontology tree the term is to be put. Scientific terminology is much better in this respect, because it is much more exact. Of course there is still some inaccuracy, but query expansion can be efficient.
    203 
    204  
    205 
    206 4) Even if query expansion itself doesn’t improve the search, the query can be made more precise: if some of the terms are found in the ontologies, they are put in quotation marks, thus avoiding wrong results.
    207 
    208 For example, if user doesn’t put quotation marks in his query: cystic lung disease, then documents, containing disease will be retrieved:
    209 
    210  
    211 
    212 (1) New diagnosis of heart disease since last study visit
    213 
    214 (2) The score on the Unified Parkinson's Disease Rating Scale
    215 
    216  
    217 
    218 During query expansion cystic lung disease will be found in ontologies and the query will become: asthma “cystic lung disease” OR (…expansion terms…).  This query won’t find those two irrelevant documents, because of the quotation marks.
    219 
    220  
    221 
    222 Ontologies
    223 
    224 What ontologies to use?
    225 
    226 It should be decided by the user in accordance with his query. He should be given the list of ontologies to choose:
    227 
    228 I would propose to choose among the following ontologies:
    229 
    230  
    231 
    232 •            Human Phenotype Ontology
    233 
    234  
    235 
    236 •            Human disease
    237 
    238  
    239 
    240 •            NCI Thesaurus
    241 
    242  
    243 
    244 •            Medical Subject Headings
    245 
    246  
    247 
    248 •            International Classification of Diseases (!http://bioportal.bioontology.org/visualize/35686)
    249 
    250 -Synonyms are graphical variants used in special cases
    251 
    252  
    253 
    254 •            Online Mendelian Inheritance in Man (!http://bioportal.bioontology.org/visualize/40398)
    255 
    256 (The relation "manifestation of" may be useful, though too broad)
    257 
    258 - Practically no synonyms
    259 
    260  
    261 
    262 How to search the ontologies?
    263 
    264 The search is performed by OntoCAT.
    265 
    266 In this project I tried different ways of accessing the ontologies:
    267 
    268 (1)            Directly on !BioPortal
    269 
    270 (2)            Downloading the ontologies on local computer
    271 
    272 (3)            Indexing them and searching in the index files
    273 
    274 The third variant turned out to be significantly faster, so it is used in the project.
    275 
    276  
    277 
    278 What is done?
    279 
    280 The User first can index his database and ontologies and then search for relevant database entries.
    281 
    282 The User enters his query in the textbox, he can choose whether to expand the query or not. Choose “Search with query expansion”, the query is expanded with synonyms and children from indexed ontologies (Human disease, Human Phenotype ontology? and NCI Thesaurus). Then Lucene performs the search in the indexed database.
    283 
    284  
    285 
    286 
    287 = B. Lucene Indexing: scoring =
    288 
    289 === Fields and Documents ===
    290  
    291 
    292 In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the !DefaultSimilarity on the Fields).
    293 
    294  
    295 
    296 
    297 === Score Boosting ===
    298  
    299 
    300 Lucene allows influencing search results by "boosting" in more than one level:
    301 
    302  
    303 
    304 ·      Document level boosting - while indexing - by calling document.setBoost() before a document is added to the index.
    305 
    306 ·      Document's Field level boosting - while indexing - by calling field.setBoost() before adding a field to the document (and before adding the document to the index).
    307 
    308 ·      Query level boosting - during search, by setting a boost on a query clause, calling Query.setBoost().
    309 
    310  
    311 
    312 Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). The result is decoded as a single byte (with some precision loss of course) and stored in the directory. The similarity object in effect at indexing computes the length-norm of the field.
    313 
    314  
    315 
    316 This composition of 1-byte representation of norms (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) is nicely described in Fieldable.setBoost().
    317 
    318  
    319 
    320 Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: encodeNorm() and decodeNorm(). Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought into the score of document as norm(t, d), as shown by the formula in Similarity.
    321 
    322  
    323 
    324 
    325 =    =
    326  
    327 
    328 
    329 = C. Documentation =
    330  
    331 
    332 public class DBIndexPlugin
    333 
    334 the plugin to index and search the database (with or without query expansion):
    335 
    336 @param LUCENE_INDEX_DIRECTORY – empty directory to put index files in
    337 
    338  
    339 
    340 public void buildIndexAllTables(Database db) –makes the index
    341 
    342 public void SearchAllDBTablesIndex(Database db) –searches the index (in “description” field)
    343 
    344 public void !ExpandQuery(Database db) –expands the query by calling expand(!OntologiesForExpansion)from !OntocatQueryExpansion_lucene
    345 
    346  
    347 
    348 public class !OntocatQueryExpansion_lucene
    349 
    350  
    351 
    352 public List<String> parseQuery(String query) –parses the query by ignoring the punctuation, splitting the query by ‘ ‘, Boolean operators, reading phrases in quotation marks as a single unit. Calls public List<String> chunk (List<String> words)
    353 
    354  
    355 
    356 public List<String> chunk (List<String> words) – chunks the query (List<String> words) into all possible n-grams (combinations of subsequent query words) (n ranges from 1 to words.size())
    357 
    358  
    359 
    360 public void expand(List<String> ontologiesToUse) – finds expansion terms in ontologiesToUse. For every n-gram of the chunked query searches it in ontologies, if found, adds expansion terms to initial query list
    361 
    362  
    363 
    364 public String output(List<String> parsed) – constructs a new query of the initial query list, adding expansion terms with lower weight, using the same Boolean operators and quotes (if any) as in user query.
    365 
    366  
    367 
    368 public class !OntoCatIndexPlugin2
    369 
    370 the plugin that indexes and searches the ontologies
    371 
    372 @param LUCENE_ONTOINDEX_DIRECTORY - empty directory to put index files in
    373 
    374 @param ONTOLOGIES_DIRECTORY – the directory, where the ontologies are stored
    375 
    376 @param ontologyNamesMap – the list of ontologies and the correspondence between ontology names and file names containing them
    377 
    378  
    379 
    380 public String !SearchIndexOntocat(String query, List<String> ontologyLabels) – searches the query in the ontologies with names ontologyLabels. Returns a string “!term:expansion term1; expansion term2;… expansion termN;”
    381 
    382  
    383 
    384 public void buildIndexOntocat() -  builds the ontology index. Pairs (!term:expansion) are stored for each term of each ontology
    385 
    386  
    387 
    388  
    389 
    390  
    391 
    392  
    393 
    394  
    395 
    396