wiki:DespoinaLog/2010/04/29

Next Steps , notes & links on OpenData?, nano publications & Lucene Build Index on cvs file extracted by molgenis csv extract.

Next Steps :

  1. Build an index (using Lucene) in Molgenis on a single database table. (Gene table from SyndromeBook?).
    • we have two options:
    1. use command line lucene . Need files for that. So we can export csv (molgenis csv export - Joeri) . The problem with that is that we wil have to export every single column or row in a file (see how that can be done) so the search is efficient. We need separate files for every single information (valid form) from the table . So an option is per column? or per row?. Explore that .
      1. There is an experimental database (Joeri ) we could also try that . The main idea is to generalize that and later make an index on more db tables, and potentially build a generator . The goal later is to place a google search on patient data on top of Molgenis.
      2. The fact is that there are several sets (molgenis) : database- model - system and we can use this lucene based indexing engine to create and index and be used by the search box that could be able to look in these dbs . If we generalize that to ALL DBs in molgenis we have the search inside patient data.
    2. use lucene java call inside a plugin in molgenis for the table in (1).
  2. Search not though ontocat but lucene in specific DBs . A start point is SyndromeBook?'s DB table : gene .

http://www.ebi.ac.uk/ebisearch/search.ebi?db=allebi&requestFrom=searchBox&query=brca&FormsButton3=Go

Other notes :

  • Peregrine is running in Concept wiki so that data production in TRIPLEs is feasible. --> nano publications (some steps are missing , but we get the idea) --> rdf
  • Hypothesis data / Real data /evidence --> experiments ---> STATEMENTS --> triples --> semantic web
    • molgnenis producing triples ?? (experimental DB - Joeri) future plans.
  • About the servlet version of search on top of ontocat , if you use a servlet , you are not REST (architecture) .

Ontocat is retuning keywords , how about links? or more specific studies about the specific term.

aspect of the as called "core model":

  • "Our core model addresses some key requirements that stem from existing publication practices and the need to aggregate information from distributed sources. Similar to standard scientific publications, nano-publications need to be citable, attributable, and reviewable. Furthermore, they need to be easily curated. Nano-publications must be easily aggregated and identified across the Web. Finally, they need to be extensible to cater for new forms of both metadata and description."
    • "aggregate Information from distributed sources"..this is really important . The key is not to create more and more resources out of the existing ones, but try to provide and efficient and accurate serving/presentation of the existing valid ones. use standards that could actually point/refer to the core of your actual search, in a way that the information is distributed in an organized and consistent manner.

http://www2005.org/cdrom/docs/p613.pdf

http://4store.org/

http://tagora.ecs.soton.ac.uk/eswc2009/

http://wiki.dbpedia.org/Downloads351

  • "Numerous scientists have pointed out the irony that right at the historical moment when we have the technologies to permit worldwide availability and distributed process of scientific data, broadening collaboration and accelerating the pace and depth of discovery…..we are busy locking up that data and preventing the use of correspondingly advanced technologies on knowledge"
  • http://sciencecommons.org/

STEP 1 : Create a Lucene Index (command line) using Molgenis svn extract for database hvp_pilot .

  1. New test csv class : call csv export (/Users/despoina/Documents/workspace/hvp_pilot/handwritten/java/plugins/test_csv.java)
  2. . Done . File in : CVS molgenis export directory : /private/var/folders/to/toww8wCyG3a88-qsfLyIV++++TI/-Tmp-/
    1. Cvs export in molgenis does not export every single valid quantity of information , like columns. just the database .
  3. In command prompt cd to cdLucene lucene directory , and try : (after you have copied your csv file in a directory here _syndrome_book_data_
    1. $$$$ java org.apache.lucene.demo.IndexFiles _syndrome_book_data_/
  • Now you can search your index by typing :
  1. $$$$$$ java org.apache.lucene.demo.SearchFiles
    1. example search :glycoprotein
    2. ok ther it is in a single file ...
    3. CUSTOMIZING lucene : ..more output ...Searches http://lucene.apache.org/java/3_0_1/queryparsersyntax.html#Fuzzy Searches

Last modified 13 years ago Last modified on 2010-10-01T23:19:13+02:00