wiki:DespoinaLog/2010/05/03

Build Lucene Index (in java class)

Command line : csv export a DB table & call lucene from command line

Call class from plugin in molgenis

  • either command call lucene :
  • or http://lucene.apache.org/java/2_0_0/api/overview-summary.html
  • * /hvp_pilot/handwritten/java/plugins/callLucene3.java:
    • In main method we create an instance of IndexBuilder object and call its buildIndex() method. In buildIndex() method we create a File object which repesents our index directory. Then we create a StandardAnalyzer object. StandardAnalyzer is extended from Analyzer class. It splits the text into words, convert them to lower case and remove some common words like is, this etc. You may use any other analyzer or create your own analyzer.
    • IndexWriter object is instantiated by passing parameters to the constructor. The first parameter is Directory object. Since we are using file based index, we use FSDirectory object which is extended from abstract class Directory. Second parameter is the analyzer object which we already created. Third parameter specifies whether we should create or update index. In our case, we are creating index. Fourth parameter specifies the number of tokens which should be indexed. In our case, we are using default number which is 10, 000. If our field value is of type text and consists of more than 10, 000 words excluding stop words, rest will be ignored. But you can select number of tokens unlimited also.
    • During iteration we create Document object. Field objects are created for all columns by providing column name/value pair. We create another field called fulltext by combining the values of name, age and designation fields. This is the default field where we do the search. We pass four parameters to the Field constructor. First parameter is the name of the field. We provide column name as field name. Second parameter is the field value which is again given by column value. Third parameter specifies whether we should store the field value in the index. If we store the field value, we may retrieve it later in its original form from the index rather than quering from database again. This is Ok for small sized fields but not recommended for large text fields
    • Fourth parameter is about indexing. We are not indexing employeeid field. This means this field is not searchable. Since id field is a unique integer value representing employee record, there is no point in searching it. Name and designation fields are indexed and tokenized. This means field value is split into tokens and indexed. ANALYZED constant represents it. Age field is indexed but not tokenized. Since it is an integer value, there is no point in splitting it. Full text field is not stored but indexed and tokenized. Then we add all fields to Document object and Document object is adeded to IndexWriter object.
    • Finally we optimize the index and close all resources like IndexWriter and Connection objects.
    • Ok, now our index is ready.
    • Next step , create a new plugin to test it and call the SearchIndex? class or , use an existing one.

DONE

  1. get a reference to index directory file
  2. initialize the driver class, connect to DB to table hvp pilot with data ,
  3. create statement object
  4. retrieve fields geneName, chromosomeLocation , geneDescription from DB and create field objects and add to document
  5. add the document to writer
  6. optimize the index

next TODO :

create a new plugin in hvp_pilot to Search the created index from callLucene3 . the plugin should contain an input for the user to enter the search text .

  • LuceneSearchPlugin? (new plugin to Search Index).The idea here is to add an html input (from user ) to search database (lucene) created index.

other :

eclipse --> New Install : add "http://download.jboss.org/jbosstools/updates/stable/galileo/" and continue

Last modified 14 years ago Last modified on 2010-10-01T23:19:13+02:00