Changes between Version 1 and Version 2 of LuceneIndexBasedSearchManual
- Timestamp:
- 2011-02-14T15:28:37+01:00 (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
LuceneIndexBasedSearchManual
v1 v2 1 = '''Lucene Index based Search Manual 1 = '''Lucene Index based Search Manual ''' = 2 2 Provide GCC developers with a search on (1) your Database contents (you can pick which fields you want to search) and (2) on Ontology terms retrieved from OntoCAT for a specific keyword from EBI Ontology Lookup Service (OLS) and the NCBO BioPortal results . 3 3 4 = Biobank Search on Db /Ontocat based in Lucene Indexing featuring Query Expansion using Ontologies = 5 == How to set up , configure & run == 6 == (Run molgenis project - molgenis4phenotype or in your own) == 7 1. Download & install molgenis (http://www.molgenis.org/wiki/MolgenisOnWindows,) 4 8 5 = Biobank Search on Db /Ontocat based in Lucene Indexing featuring Query Expansion using Ontologies = 9 2. Download latest version of molgenis4phenotype from http://www.molgenis.org/svn/molgenis_projects/molgenis4phenotype 6 10 7 == How to set up , configure & run == 8 9 == (Run molgenis project - molgenis4phenotype or in your own) == 10 11 12 1. Download & install molgenis (http://www.molgenis.org/wiki/MolgenisOnWindows,) 13 14 2. Download latest version of molgenis4phenotype from http://www.molgenis.org/svn/molgenis_projects/molgenis4phenotype 15 16 3. Set Build path (add Lucene/Ontocat/ols/owlapi/wsdl/ jars) 11 3. Set Build path (add Lucene/Ontocat/ols/owlapi/wsdl/ jars) 17 12 18 13 Make sure the following jar exist in your WEB-INF/lib directory: 19 14 20 § lucene-core-3.0.2.jar 21 22 § lucene-demos-3.0.2.jar 23 24 § lucene-highlighter-3.0.1.jar 25 26 § lucene-memory-3.0.2.jar 27 28 § ols-client.jar 29 30 § ontoCAT_v0.9.4-SNAPSHOT.jar 31 32 § opencsv-1.8.jar 33 34 § owlapi-bin.jar 35 36 § owlapi-src.jar 37 38 § wsdl4j-1.6.2.jar 39 40 § xpp3_min-1.1.4c.jar 15 * § lucene-core-3.0.2.jar 16 * § lucene-demos-3.0.2.jar 17 * § lucene-highlighter-3.0.1.jar 18 * § lucene-memory-3.0.2.jar 19 * § ols-client.jar 20 * § ontoCAT_v0.9.4-SNAPSHOT.jar 21 * § opencsv-1.8.jar 22 * § owlapi-bin.jar 23 * § owlapi-src.jar 24 * § wsdl4j-1.6.2.jar 25 * § xpp3_min-1.1.4c.jar 41 26 42 27 Next select project properties à java build pathà Librariesà Add Jarsà select the jars mentioned above. 43 28 44 '''''__The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped . __'''''29 ''''' __ The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped . __ ''''' 45 30 46 4. Select two of preconfigured existing sets of files, AnimalDB , lifelines, or create your own. Currently adjusted for animalDB. Respective configuration includes (consider the corresponding files if you want to create your own) : 31 4. Select two of preconfigured existing sets of files, AnimalDB , lifelines, or create your own. Currently adjusted for animalDB. Respective configuration includes (consider the corresponding files if you want to create your own) : 32 4. 47 33 48 a.animaldb.molgenis.properties :34 a. animaldb.molgenis.properties : 49 35 50 36 ''db_user = molgenis'' … … 52 38 ''db_password = molgenis'' 53 39 54 b. Create db : animaldb_pheno or yours as follows 40 a. Create db : animaldb_pheno or yours as follows 55 41 56 42 ''mysql> create database animaldb_pheno;'' … … 58 44 ''mysql> grant all privileges on animaldb_pheno.* to molgenis@localhost identified by 'molgenis'; flush privileges; *'' 59 45 60 '' '' 61 62 ''c. Plugin files already exist in molgenis4phenotype. For your own project:'' 46 ''c. Plugin files already exist in molgenis4phenotype. For your own project:'' 63 47 64 48 ''Add in molgenis_ui.xml the menu:'' 65 49 66 ''< !-- Lucene biobank search plugin -->''50 ''<''!''!-- Lucene biobank search plugin -->{{{ 67 51 68 '' <menu name="submenu" position="left" label="Indexing...">'' 52 <menu name="submenu" position="left" label="Indexing..."> 69 53 70 ''<plugin name="DBIndex" label="DB Index and Search" type="plugins.!LuceneIndex.DBIndexPlugin" />'' 54 <plugin name="DBIndex" label="DB Index and Search" type="plugins.[wiki:LuceneIndex].DBIndexPlugin" /> 71 55 72 '' <plugin name="!GenericWizard" type="plugin.genericwizard.!GenericWizard" label="Excel upload"/>'' 56 <plugin name="GenericWizard " type="plugin.genericwizard.[wiki:GenericWizard] " label="Excel upload"/> <plugin name="OntoCatIndexPlugin 2" label="Index OntoCAT" type="plugins.[wiki:LuceneIndex] .[wiki:OntoCatIndexPlugin] 2" /> 73 57 74 '' <plugin name="!OntoCatIndexPlugin2" label="Index OntoCAT" type="plugins.!LuceneIndex.!OntoCatIndexPlugin2" />''58 </menu>'' }}}'' 75 59 76 ''</menu>'' 60 a. AnimalDBGenerate.java : run 77 61 78 '' '' 62 a. AnimalDBUpdateDatabase.java: run 79 63 80 d. AnimalDBGenerate.java : run 64 a. Files are now created. Replace created files with plugin’s files. Also add ''__!LuceneIndexConfiguration.properties __''in directory where all .properties files leave (/molgenis4phenotypeWorkspace/molgenis4phenotype) 81 65 82 e. AnimalDBUpdateDatabase.java: run 83 84 f. Files are now created. Replace created files with plugin’s files. Also add ''__!LuceneIndexConfiguration.properties __''in directory where all .properties files leave (/molgenis4phenotypeWorkspace/molgenis4phenotype) 85 86 g. Adjust in ''__!LuceneIndexConfiguration.properties configuration__'' file: 66 a. Adjust in ''__!LuceneIndexConfiguration.properties configuration__'' file: 87 67 88 68 The number of DB fields in which the search “makes sense “ (mainly description fields). Also fill in the names of the fields. … … 90 70 Also select if you want to use ontologies in query expansion [http://www.molgenis.org/#_ftn1 "[1]"] by selecting'' __useOntologiesInQueryExpansion = “true"__'' 91 71 92 [[BR]]93 ----94 72 [http://www.molgenis.org/#_ftnref "[1]"] In case you use ontologies for query expansion you need to follow the instruction in section B . 95 73 74 == Building & searching an index on database contents == 75 a. Fill in data (in case of Animal Db project used). Run in server and select from the menu “System Tasksà Fill in Database ”. You can also load latest animal db files from the same page. 96 76 97 == Building & searching an index on database contents == 98 99 100 a. Fill in data (in case of Animal Db project used). Run in server and select from the menu “System Tasksà Fill in Database ”. You can also load latest animal db files from the same page. 101 102 b. Db is now ready to be search. From the main menu, select “Indexing à DB Index & Search”. 77 a. Db is now ready to be search. From the main menu, select “Indexing à DB Index & Search”. 103 78 104 79 The index is created in folder predefined in variable LUCENE_INDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __ 105 80 106 107 108 81 After the creation of the index is complete you can search by entering a term (or a sentence) in the search box. 109 82 110 111 112 83 If useOntologiesInQueryExpansion is selected (true), the query is being expanded by terms retrieved from the downloaded ontologies that leave in LUCENE_INDEX_DIRECTORY 113 114 84 115 85 == Building & searching an index on Ontocat contents == … … 118 88 In order to build an index based on Ontocat contents some adjustments must be made: 119 89 120 1. Set the VM arguments for !OntoCatIndexPlugin2.java to“–Xms1024M –Xmx1024M”90 1. Set the VM arguments for !OntoCatIndexPlugin2.java to “–Xms1024M –Xmx1024M” 121 91 122 92 (Select projectà Run As à Run configurationsà Arguments à Add in VM arguments -Xms1024M -Xmx1024M) 123 93 124 2.Enter a desired term in order to retrieve from online ontology resources and press “Build Ontocat Index”.94 2. Enter a desired term in order to retrieve from online ontology resources and press “Build Ontocat Index”. 125 95 126 96 3. The index is created in folder predefined in variable LUCENE_ONTOINDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __ 127 97 128 3. The index is created in folder predefined in variable LUCENE_ONTOINDEX_DIRECTORY. If it does not exist, the directory is being created. If the directory has contents, the index is NOT created.__ (Be sure to remove the contents for the new index). __ 129 130 131 132 4. After the creation of the index (this may take a while depending on the response of the ontology resources that Ontocat is speaking to – EBI Ontology service) is complete you can search by entering a term (or a sentence) in the search box. 133 134 135 136 137 98 4. After the creation of the index (this may take a while depending on the response of the ontology resources that Ontocat is speaking to – EBI Ontology service) is complete you can search by entering a term (or a sentence) in the search box. 138 99 139 100 == B. How to run query expansion enabled search (using ontologies) == 140 1) 101 1) Download the ontologies from !http://bioportal.bioontology.org/ 141 102 142 103 You should download 143 104 144 -(!http://rest.bioontology.org/bioportal/ontologies/download/44307?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)105 * (!http://rest.bioontology.org/bioportal/ontologies/download/44307?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff) 145 106 146 -Human Disease (!http://rest.bioontology.org/bioportal/ontologies/download/44309?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)107 * Human Disease (!http://rest.bioontology.org/bioportal/ontologies/download/44309?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff) 147 108 148 -NCI Thesaurus (!http://rest.bioontology.org/bioportal/ontologies/download/42838?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff)109 * NCI Thesaurus (!http://rest.bioontology.org/bioportal/ontologies/download/42838?applicationid=4ea81d74-8960-4525-810b-fa1baab576ff) 149 110 150 111 MeSH can be taken from biobank_search\!WebContent\WEB-INF 151 112 152 2) 113 2) Change the directory names: 153 114 154 -In DBIndexPlugin: LUCENE_INDEX_DIRECTORY, INDEX_CONFIGURATION115 * In DBIndexPlugin: LUCENE_INDEX_DIRECTORY, INDEX_CONFIGURATION 155 116 156 - In !OntoCatIndexPlugin2: LUCENE_ONTOINDEX_DIRECTORY, ONTOLOGIES_DIRECTORY 157 158 159 160 161 162 117 * In !OntoCatIndexPlugin2: LUCENE_ONTOINDEX_DIRECTORY, ONTOLOGIES_DIRECTORY 118 * [[BR]] 163 119 == C. Lucene scoring == 164 120 … … 183 139 184 140 185 186 187 = Appendix =188 189 = A. Information Retrieval with Query Expansion =190 191 192 General ideas:193 194 1) Query expansion adds additional terms related to initial query terms to the query. They shouldn’t be obligatory contained in a document (it would be difficult to find a document in a database, containing every term of ["exercise-induced asthma", "chronic obstructive asthma with acute exacerbation”,” exercise-induced asthma (disorder)", "bronchial hypersensitivity", "chronic obstructive asthma", "chronic obstructive asthma with status asthmaticus", "bronchial hyperreactivity", "exercise induced asthma", "cough variant asthma", "intrinsic asthma", "status asthmaticus", "allergic asthma"]), that’s why they should be appended by “OR” operator and can be assigned a lower weight. Thus such query expansion usually changes the document ranking and consequently the order of retrieved documents in the output, rather than significantly changes the number of documents retrieved.195 196 197 198 2) What terms to add? Obviously, the added terms should be very close to the query term, that’s why in information retrieval as expansion terms usually synonyms and children (terms, related to the query term by IS_A relationship) are added. For example, if a user enters a broad query, such as lung disease, query expansion will add documents concerning narrower terms, such as pneumonia (children node),199 200 201 202 3) It is very important to have good ontologies at hand. Otherwise the expansion terms may turn out to be very inaccurate. This is the problem with nonscientific terminology: it’s practically impossible to construct an accurate ontology, due to the vagueness of words of natural languages. Synonymy is very approximate here and it’s difficult to determine where exactly in the ontology tree the term is to be put. Scientific terminology is much better in this respect, because it is much more exact. Of course there is still some inaccuracy, but query expansion can be efficient.203 204 205 206 4) Even if query expansion itself doesn’t improve the search, the query can be made more precise: if some of the terms are found in the ontologies, they are put in quotation marks, thus avoiding wrong results.207 208 For example, if user doesn’t put quotation marks in his query: cystic lung disease, then documents, containing disease will be retrieved:209 210 211 212 (1) New diagnosis of heart disease since last study visit213 214 (2) The score on the Unified Parkinson's Disease Rating Scale215 216 217 218 During query expansion cystic lung disease will be found in ontologies and the query will become: asthma “cystic lung disease” OR (…expansion terms…). This query won’t find those two irrelevant documents, because of the quotation marks.219 220 221 222 Ontologies223 224 What ontologies to use?225 226 It should be decided by the user in accordance with his query. He should be given the list of ontologies to choose:227 228 I would propose to choose among the following ontologies:229 230 231 232 • Human Phenotype Ontology233 234 235 236 • Human disease237 238 239 240 • NCI Thesaurus241 242 243 244 • Medical Subject Headings245 246 247 248 • International Classification of Diseases (!http://bioportal.bioontology.org/visualize/35686)249 250 -Synonyms are graphical variants used in special cases251 252 253 254 • Online Mendelian Inheritance in Man (!http://bioportal.bioontology.org/visualize/40398)255 256 (The relation "manifestation of" may be useful, though too broad)257 258 - Practically no synonyms259 260 261 262 How to search the ontologies?263 264 The search is performed by OntoCAT.265 266 In this project I tried different ways of accessing the ontologies:267 268 (1) Directly on !BioPortal269 270 (2) Downloading the ontologies on local computer271 272 (3) Indexing them and searching in the index files273 274 The third variant turned out to be significantly faster, so it is used in the project.275 276 277 278 What is done?279 280 The User first can index his database and ontologies and then search for relevant database entries.281 282 The User enters his query in the textbox, he can choose whether to expand the query or not. Choose “Search with query expansion”, the query is expanded with synonyms and children from indexed ontologies (Human disease, Human Phenotype ontology? and NCI Thesaurus). Then Lucene performs the search in the indexed database.283 284 285 286 287 = B. Lucene Indexing: scoring =288 289 === Fields and Documents ===290 291 292 In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the !DefaultSimilarity on the Fields).293 294 295 296 297 === Score Boosting ===298 299 300 Lucene allows influencing search results by "boosting" in more than one level:301 302 303 304 · Document level boosting - while indexing - by calling document.setBoost() before a document is added to the index.305 306 · Document's Field level boosting - while indexing - by calling field.setBoost() before adding a field to the document (and before adding the document to the index).307 308 · Query level boosting - during search, by setting a boost on a query clause, calling Query.setBoost().309 310 311 312 Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). The result is decoded as a single byte (with some precision loss of course) and stored in the directory. The similarity object in effect at indexing computes the length-norm of the field.313 314 315 316 This composition of 1-byte representation of norms (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) is nicely described in Fieldable.setBoost().317 318 319 320 Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: encodeNorm() and decodeNorm(). Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought into the score of document as norm(t, d), as shown by the formula in Similarity.321 322 323 324 325 = =326 327 328 329 = C. Documentation =330 331 332 public class DBIndexPlugin333 334 the plugin to index and search the database (with or without query expansion):335 336 @param LUCENE_INDEX_DIRECTORY – empty directory to put index files in337 338 339 340 public void buildIndexAllTables(Database db) –makes the index341 342 public void SearchAllDBTablesIndex(Database db) –searches the index (in “description” field)343 344 public void !ExpandQuery(Database db) –expands the query by calling expand(!OntologiesForExpansion)from !OntocatQueryExpansion_lucene345 346 347 348 public class !OntocatQueryExpansion_lucene349 350 351 352 public List<String> parseQuery(String query) –parses the query by ignoring the punctuation, splitting the query by ‘ ‘, Boolean operators, reading phrases in quotation marks as a single unit. Calls public List<String> chunk (List<String> words)353 354 355 356 public List<String> chunk (List<String> words) – chunks the query (List<String> words) into all possible n-grams (combinations of subsequent query words) (n ranges from 1 to words.size())357 358 359 360 public void expand(List<String> ontologiesToUse) – finds expansion terms in ontologiesToUse. For every n-gram of the chunked query searches it in ontologies, if found, adds expansion terms to initial query list361 362 363 364 public String output(List<String> parsed) – constructs a new query of the initial query list, adding expansion terms with lower weight, using the same Boolean operators and quotes (if any) as in user query.365 366 367 368 public class !OntoCatIndexPlugin2369 370 the plugin that indexes and searches the ontologies371 372 @param LUCENE_ONTOINDEX_DIRECTORY - empty directory to put index files in373 374 @param ONTOLOGIES_DIRECTORY – the directory, where the ontologies are stored375 376 @param ontologyNamesMap – the list of ontologies and the correspondence between ontology names and file names containing them377 378 379 380 public String !SearchIndexOntocat(String query, List<String> ontologyLabels) – searches the query in the ontologies with names ontologyLabels. Returns a string “!term:expansion term1; expansion term2;… expansion termN;”381 382 383 384 public void buildIndexOntocat() - builds the ontology index. Pairs (!term:expansion) are stored for each term of each ontology385 386 387 388 389 390 391 392 393 394 395 396