Changes between Version 2 and Version 3 of LuceneIndexBasedSearchManual


Ignore:
Timestamp:
2011-02-14T15:29:11+01:00 (13 years ago)
Author:
antonak
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LuceneIndexBasedSearchManual

    v2 v3  
    2727Next select project properties à java build pathà Librariesà Add Jarsà select the jars mentioned above.
    2828
    29   ''''' __ The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped .   __ '''''
     29  '''''  __  The libraries may also exist in the preconfigured Web App Libraries , if so , the current step (3) can be skipped .   __ '''''
    3030
    3131 4. Select two of preconfigured existing sets of files, AnimalDB , lifelines, or create your own. Currently adjusted for animalDB. Respective configuration includes (consider the corresponding files if you want to create your own) :
     
    5454<plugin name="DBIndex" label="DB Index and Search" type="plugins.[wiki:LuceneIndex].DBIndexPlugin" />
    5555
    56   <plugin name="GenericWizard " type="plugin.genericwizard.[wiki:GenericWizard] " label="Excel upload"/> <plugin name="OntoCatIndexPlugin 2" label="Index OntoCAT"  type="plugins.[wiki:LuceneIndex] .[wiki:OntoCatIndexPlugin] 2" />
     56  <plugin name="GenericWizard  " type="plugin.genericwizard.[wiki:GenericWizard]  " label="Excel upload"/> <plugin name="OntoCatIndexPlugin  2" label="Index OntoCAT"  type="plugins.[wiki:LuceneIndex]  .[wiki:OntoCatIndexPlugin] 2" />
    5757
    5858</menu>'' }}}''
     
    117117 * In !OntoCatIndexPlugin2: LUCENE_ONTOINDEX_DIRECTORY, ONTOLOGIES_DIRECTORY
    118118 * [[BR]]
     119
    119120== C. Lucene scoring ==
    120  
    121 
    122 Lucene scoring uses a combination of the [http://en.wikipedia.org/wiki/Vector_Space_Model Vector Space Model (VSM)
    123 of Information Retrieval] and the [http://en.wikipedia.org/wiki/Standard_Boolean_model Boolean model] to determine how relevant a given Document is to a User's query.
    124 
    125  In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification.
    126 
    127 Lucene also adds some capabilities and refinements onto this model to support boolean and fuzzy searching, but it essentially remains a VSM based system at the heart. For some valuable references on VSM and IR in general refer to the [http://wiki.apache.org/lucene-java/InformationRetrieval Lucene Wiki IR
    128 references].
    129 
    130  
     121Lucene scoring uses a combination of the [http://en.wikipedia.org/wiki/Vector_Space_Model Vector Space Model (VSM) of Information Retrieval] and the [http://en.wikipedia.org/wiki/Standard_Boolean_model Boolean model] to determine how relevant a given Document is to a User's query.
     122
     123  In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification.
     124
     125Lucene also adds some capabilities and refinements onto this model to support boolean and fuzzy searching, but it essentially remains a VSM based system at the heart. For some valuable references on VSM and IR in general refer to the [http://wiki.apache.org/lucene-java/InformationRetrieval Lucene Wiki IR references].
    131126
    132127(see more in Appendinx B)
    133128
    134 The score for a document given a query is the cosine of the angle formed between the query vector and the document vector. The explain() method can be used to show exactly what  score  calculation is for a given query and a given document. So explanation() ‘s results (explanation. toString()) is presented to the user. 
    135 
    136  
    137 
    138  
    139 
    140  
     129The score for a document given a query is the cosine of the angle formed between the query vector and the document vector. The explain() method can be used to show exactly what  score  calculation is for a given query and a given document. So explanation() ‘s results (explanation. toString()) is presented to the user.
     130
     131
     132= Appendix =
     133
     134= A. Information Retrieval with Query Expansion =
     135 
     136
     137General ideas:
     138
     1391) Query expansion adds additional terms related to initial query terms to the query. They shouldn’t be obligatory contained in a document (it would be difficult to find a document in a database, containing every term of ["exercise-induced asthma", "chronic obstructive asthma with acute exacerbation”,” exercise-induced asthma (disorder)", "bronchial hypersensitivity", "chronic obstructive asthma", "chronic obstructive asthma with status asthmaticus", "bronchial hyperreactivity", "exercise induced asthma", "cough variant asthma", "intrinsic asthma", "status asthmaticus", "allergic asthma"]), that’s why they should be appended by “OR” operator and can be assigned a lower weight. Thus such query expansion usually changes the document ranking and consequently the order of retrieved documents in the output, rather than significantly changes the number of documents retrieved.
     140
     141 
     142
     1432) What terms to add? Obviously, the added terms should be very close to the query term, that’s why in information retrieval as expansion terms usually synonyms and children (terms, related to the query term by IS_A relationship) are added. For example, if a user enters a broad query, such as lung disease, query expansion will add documents concerning narrower terms, such as pneumonia (children node),
     144
     145 
     146
     1473) It is very important to have good ontologies at hand. Otherwise the expansion terms may turn out to be very inaccurate. This is the problem with nonscientific terminology: it’s practically impossible to construct an accurate ontology, due to the vagueness of words of natural languages. Synonymy is very approximate here and it’s difficult to determine where exactly in the ontology tree the term is to be put. Scientific terminology is much better in this respect, because it is much more exact. Of course there is still some inaccuracy, but query expansion can be efficient.
     148
     149 
     150
     1514) Even if query expansion itself doesn’t improve the search, the query can be made more precise: if some of the terms are found in the ontologies, they are put in quotation marks, thus avoiding wrong results.
     152
     153For example, if user doesn’t put quotation marks in his query: cystic lung disease, then documents, containing disease will be retrieved:
     154
     155 
     156
     157(1) New diagnosis of heart disease since last study visit
     158
     159(2) The score on the Unified Parkinson's Disease Rating Scale
     160
     161 
     162
     163During query expansion cystic lung disease will be found in ontologies and the query will become: asthma “cystic lung disease” OR (…expansion terms…).  This query won’t find those two irrelevant documents, because of the quotation marks.
     164
     165 
     166
     167Ontologies
     168
     169What ontologies to use?
     170
     171It should be decided by the user in accordance with his query. He should be given the list of ontologies to choose:
     172
     173I would propose to choose among the following ontologies:
     174
     175 
     176
     177•            Human Phenotype Ontology
     178
     179 
     180
     181•            Human disease
     182
     183 
     184
     185•            NCI Thesaurus
     186
     187 
     188
     189•            Medical Subject Headings
     190
     191 
     192
     193•            International Classification of Diseases (!http://bioportal.bioontology.org/visualize/35686)
     194
     195-Synonyms are graphical variants used in special cases
     196
     197 
     198
     199•            Online Mendelian Inheritance in Man (!http://bioportal.bioontology.org/visualize/40398)
     200
     201(The relation "manifestation of" may be useful, though too broad)
     202
     203- Practically no synonyms
     204
     205 
     206
     207How to search the ontologies?
     208
     209The search is performed by OntoCAT.
     210
     211In this project I tried different ways of accessing the ontologies:
     212
     213(1)            Directly on !BioPortal
     214
     215(2)            Downloading the ontologies on local computer
     216
     217(3)            Indexing them and searching in the index files
     218
     219The third variant turned out to be significantly faster, so it is used in the project.
     220
     221 
     222
     223What is done?
     224
     225The User first can index his database and ontologies and then search for relevant database entries.
     226
     227The User enters his query in the textbox, he can choose whether to expand the query or not. Choose “Search with query expansion”, the query is expanded with synonyms and children from indexed ontologies (Human disease, Human Phenotype ontology? and NCI Thesaurus). Then Lucene performs the search in the indexed database.
     228
     229 
     230
     231
     232= B. Lucene Indexing: scoring =
     233
     234=== Fields and Documents ===
     235 
     236
     237In Lucene, the objects we are scoring are Documents. A Document is a collection of Fields. Each Field has semantics about how it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to note that Lucene scoring works on Fields and then combines the results to return Documents. This is important because two Documents with the exact same content, but one having the content in two Fields and the other in one Field will return different scores for the same query due to length normalization (assumming the !DefaultSimilarity on the Fields).
     238
     239 
     240
     241
     242=== Score Boosting ===
     243 
     244
     245Lucene allows influencing search results by "boosting" in more than one level:
     246
     247 
     248
     249·      Document level boosting - while indexing - by calling document.setBoost() before a document is added to the index.
     250
     251·      Document's Field level boosting - while indexing - by calling field.setBoost() before adding a field to the document (and before adding the document to the index).
     252
     253·      Query level boosting - during search, by setting a boost on a query clause, calling Query.setBoost().
     254
     255 
     256
     257Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). The result is decoded as a single byte (with some precision loss of course) and stored in the directory. The similarity object in effect at indexing computes the length-norm of the field.
     258
     259 
     260
     261This composition of 1-byte representation of norms (that is, indexing time multiplication of field boosts & doc boost & field-length-norm) is nicely described in Fieldable.setBoost().
     262
     263 
     264
     265Encoding and decoding of the resulted float norm in a single byte are done by the static methods of the class Similarity: encodeNorm() and decodeNorm(). Due to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought into the score of document as norm(t, d), as shown by the formula in Similarity.
     266
     267 
     268
     269
     270=    =
     271 
     272
     273
     274= C. Documentation =
     275 
     276
     277public class DBIndexPlugin
     278
     279the plugin to index and search the database (with or without query expansion):
     280
     281@param LUCENE_INDEX_DIRECTORY – empty directory to put index files in
     282
     283 
     284
     285public void buildIndexAllTables(Database db) –makes the index
     286
     287public void SearchAllDBTablesIndex(Database db) –searches the index (in “description” field)
     288
     289public void !ExpandQuery(Database db) –expands the query by calling expand(!OntologiesForExpansion)from !OntocatQueryExpansion_lucene
     290
     291 
     292
     293public class !OntocatQueryExpansion_lucene
     294
     295 
     296
     297public List<String> parseQuery(String query) –parses the query by ignoring the punctuation, splitting the query by ‘ ‘, Boolean operators, reading phrases in quotation marks as a single unit. Calls public List<String> chunk (List<String> words)
     298
     299 
     300
     301public List<String> chunk (List<String> words) – chunks the query (List<String> words) into all possible n-grams (combinations of subsequent query words) (n ranges from 1 to words.size())
     302
     303 
     304
     305public void expand(List<String> ontologiesToUse) – finds expansion terms in ontologiesToUse. For every n-gram of the chunked query searches it in ontologies, if found, adds expansion terms to initial query list
     306
     307 
     308
     309public String output(List<String> parsed) – constructs a new query of the initial query list, adding expansion terms with lower weight, using the same Boolean operators and quotes (if any) as in user query.
     310
     311 
     312
     313public class !OntoCatIndexPlugin2
     314
     315the plugin that indexes and searches the ontologies
     316
     317@param LUCENE_ONTOINDEX_DIRECTORY - empty directory to put index files in
     318
     319@param ONTOLOGIES_DIRECTORY – the directory, where the ontologies are stored
     320
     321@param ontologyNamesMap – the list of ontologies and the correspondence between ontology names and file names containing them
     322
     323 
     324
     325public String !SearchIndexOntocat(String query, List<String> ontologyLabels) – searches the query in the ontologies with names ontologyLabels. Returns a string “!term:expansion term1; expansion term2;… expansion termN;”
     326
     327 
     328
     329public void buildIndexOntocat() -  builds the ontology index. Pairs (!term:expansion) are stored for each term of each ontology