= MOLGENIS progress update Jan - Jun 2010 = [[TOC()]] == highlights == * A dedicated MOLGENIS programmer, Robert, funded by [http://www.nbic.nl NBIC] since feb 2010 * MOLGENIS for eXtensible Genotype And Phenotype (XGAP) published ([http://www.ncbi.nlm.nih.gov/pubmed/20214801 Swertz et al, Genome Biology]) * MOLGENIS used for noricdb.org ([http://www.ncbi.nlm.nih.gov/pubmed/20664631 Leu et al, Eur J Hum Genetics]) published * MOLGENIS used for multiple GEN2PHEN data model pilots (EBI, FIMM, U Leic, U Groningen/FWN, shared programmers), paper in draft * MOLGENIS used for a locus specific database (UMCG), paper in draft * MOLGENIS under development for [HGVBaseG2P ] data management (U Leicester, dedicated programmer) * MOLGENIS oral presentations at [BOSC], [HVP], [ISMB], [NBIC] conferences * MOLGENIS uptake: animaldb, eu-panacea/xgap, lifelines/xgap, eu-sysgenet/xgap([http://www.ncbi.nlm.nih.gov/pubmed/20627861 Zouberakis, Database(Oxford)], [http://www.ncbi.nlm.nih.gov/pubmed/20205870 Gruenberger et al, BMC research notes]) * Extensive documentation and support infrastructure now online == Progress == Find the complete list of progress at http://www.molgenis.org/timeline * Batch upload by name: Enabled users to batch upload 'by name'. This way users don't have to worry about the internal id numbers when using cross references. For example: In the import you can have a column 'Sample_name' and that will automatically resolve the link between your data and this named sample. Status: released. * Batch upload wizard: The user is now provided with an option to choose to 'ignore duplicates' or 'update existing'. This in essence mean they can upload more dirty data and let MOLGENIS take care of the cleaning. * Compact view: The user can now specify
in meta model. Status: released. * Enable multi-column lookup lists When working in larger systems the organisation of data is often nested. For example, Samples are named within Investigations. To keep things clear people want to make sample names unique, but within an investigation. For the user, this means they must see both Investigation_name and Sample_name to uniquely identify samples. These kind of composite xref_labels="field1,field2" are now possible. Status: released. * Improved model validation MOLGENIS now does extensive checking of the model. This has almost eradicated generator errors because the modeler is now kept from making erroneous models, for example, by validating cross references in the model. Status: released. * Improved the decorator framework Decorators enable MOLGENIS designers to change the behavior of the database on add, update, remove and find. What now can be done is that additional logic can be added pre or post these actions. Moreover, this now also works in inheritance. So if, for example, somebody designs a 'Versionable' interface that keeps track of record versions than all sublcasses of this Versionable would also have this feature. Status: released. * Created Excel and zip based file imports Instead of using a directory of CSV files users can now upload an Excel file. Each of the sheets that has a name matching an entity in your MOLGENIS model will be tested for import. the columns matching entity fields will be reported. Based on this report the user can choose to import. Alternatively, users can upload a zip file with csv/tab files. Status: released. * Added automated testing suite Each MOLGENIs now autogenerates an extensive testing suite that is subsequently tested using a permution of values based on the current data model. Both CSV import/export as database add/update/find/remove are extensively tested. This has greatly improved the quality of each MOLGENIS. After each import these tests are now automatically run on the http://gbic.target.rug.nl:8080/hudson/ server. Status: released. * Running: authorization and authentication MOLGENIS users can now include a MolgenisAuth plugin that allows users to register and log in using name or openid. Users can be organized in groups, and groups can have read/write access control on the level of forms and entities. Finally, a plugin extension point is added to enable more sohpisticated access control rules, for example for row level security. As planned, we will add standardized implementations of this extension point, for example for row level security in the next 6 months. Status: in beta testing with known partners, we invite anybody interested to contact us as beta tester. Status: under development. * Running: MOLGENIS compute integration MOLGENIS users can now add jobs to a job manager to be submitted to a PBS compatible cluster. Typical use case is to run R scripts. These R scripts then use the MOLGENIS R/API to read raw data and write back results. A simple meta model has been added to design input/output parameters do that the scripts can be parameterized via the MOLGENIS user interface. This work has been piloted in the XGAP system. Also a parser was made to enable tool model exchange with Galaxy servers; this is however not yet fully functional and will be continuted as planned for the coming 6 months. Status: under development. * (Sponsored by NBIC/Biobanking platform) Running: Index and ontology enhanced search Together with the NBIC/biobank programmers we invested in search (driven by biobank use cases). We have piloted a Lucene indexing method to enable 'google' like searches on whole MOLGENIS instances. Next we devoted effort in the development of OntoCAT (ontocat.org) the open source toolbox that enables simple and uniform access to diverse ontological sources. Currently we are in the process of incorporating this tool to enable semantic query expansion, using ontological relationships to rewrite users query such that more revelent information is found. This project will be further developed in the next 6 months so it can be publically released. Status: under development. * Running: large data matrix storage Large GWAS and QTL studies result in data of incredible sizes, for example 165k individuals * 1M snp markers. We have found that this cannot work on mysql when storing each data element in the database seperately. To overcome this problem without loosing the power to integrate with MOLGENIS we have been working on a software module 'MatrixInterface' that allows alternative backend implementations for such large data. Big advantage is that the data is still connected to the rest of MOLGENIS which enables constraint checking and that user interface efforts to navigate this data are shared. Next to pilots on Oracle this includes a binary and text based format which has been released. Next step is to also support other back-ends like map/ped, bam, trityper, hdf5, hadoop, DAS/ensembl, biomart and so on. Status: under development. * Many bigfixes nesting of submnus, dealing null fields, dealing with null query rules, date related issues, automatic defaults for mrefs, corrected many small issues following automatic code quality check using PMD, extensive work on documentation. Status: released. NB Compared to roadmap made with NBIC we are a little ahead of schedule (we already started with semantics) thanks to support from GEN2PHEN, EBI and NBIC/biobanking. == Bottlenecks == * MOLGENIS 1st international workshop or mini-conference Diverse groups have asked for a MOLGENIS hackathon, workshop or course. Would NBIC or others be willing to sponsor and co-organize such an event? * MOLGENIS coordination NBIC We are slightly dissapointed that MOLGENIS dissimination is not pushed within NBIC platforms. This is surprising given the international uptake Would it be an idea to add MOLGENIS to the course rotation analogous to other tools like Galaxy? Or to make it part of BRS project requests which would also make more use of our local strengths. Also the scale of MOLGENIS sponsoring as compared to support for other initiatives is rather modest. == Scientific output == * '''Papers''' * XGAP: a uniform and extensible data model and software platform for genotype and phenotype experiments. Swertz et al - Genome Biol. 2010.11(3):R27 * Towards the integration of mouse databases - definition and implementation of solutions to two use-cases in mouse functional genomics. Gruenberger M et al. BMC Res Notes. 2010 Jan 22.3(1):16. * '''Presentations''' * XGAP - eXtensible software platform for high throughput Genotypes And Phenotypes. Invited oral presentation at EU-SYSGENET cost meeting, Braunschweig, April 8, 2010 (part of Sysgenet publication) * Towards flexible data infrastructures for genotype and phenotypes: models, generators, formats & tools. Selected for oral presentation at 3rd Human Variome Project meeting, Paris, May 13, 2010 * Chair of the BioAssist study capturing workshop, June 10, Utrecht, 2010. * User friendly cluster computing for QTL analysis on XGAP. Danny Arends et al. Poster presentation at NBIC Conference – 2010, Lunteren, March 29 * Towards a MOLGENIS based Platform for Proteomics. Poster at NPC-2010, Utrecht, February 16 and NBIC Conference – 2010, Lunteren, March 29) * Towards a MOLGENIS based data analysis framework for proteomics. Oral presentation at NBIC Conference – 2010, Lunteren, March 29 * '''Future publications''' * MOLGENIS: rapid prototyping of biosoftware at the push of a button. Morris Swertz et al. Accepted for Technology Track and poster presentation at ISMB2010 * MOLGENIS: rapid prototyping of biosoftware at the push of a button. Morris Swertz et al. Accepted for oral presentation at BOSC2010 * Towards a federated microarray gene expression repository using MOLGENIS and MAGE-TAB. Alexandros Kanterakis et al. Accepted for oral presentation at BOSC2010. * Towards a federated microarray gene expression repository using MOLGENIS and MAGE-TAB. Alexandros Kanterakis et al. Accepted for poster presentation at ISMB2010 * Towards a MOLGENIS based computational framework, H. Byelas, M. Swertz, The 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, from 9th to 11th of February, 2011 (submitted paper) * SYSGENET paper * GMOD invited presentation == Collaborations == * '''National''' * We continue intense collaborations with NBIC (biobanking, brs, molgenis) and NPC (proteomics) just as previous period * We collaborate now with the LifeLines project, a biobank following 165k individuals for 30 years. MOLGENIS is now piloted for the researchers data access platform * We are participating in the BBMRI-NL project, the local biobank infrastructure initiative. MOLGENIS will be an indispensible tool for data management. * '''International''' * We continuated the collaboration with the European Bioinformatics Institute Hinxton