wiki:MolgenisAppStories

Context Navigation

Version 48 (modified by Morris Swertz, 14 years ago) (diff)
--

Stories for the research portal release of Jan. 2012

These are the user stories derived from the main funders of this activity: LifeLines and PanaceaProject?.

As a admin we want to deploy new molgenis VMs with cluster access

Scrum: ticket:1068

Acceptance criteria:

We will have one standard VM that Ger can clone all the time for studyX
There is a procedure to ask target-beheer to create a new VM for particular user and group
VMs are pre-configured with all necessary software, see VMRequirements
VMs will have a mount to gpfs2 that is on /target/gpfs2/lifelines/studyX
Tools like PLINK and PLINKSeq will not live in the VMs but in shared folder on gpfs in /target/gpfs2/lifelines/tools
(LifeLines only) the firewall is closed for outside messages for the user

Status:

Joeri makes appointment with Ger and Morris on Wednesday morning to discuss [done]
Requirements on VMRequirements

SOP:

LifeLines official doet aanvraag bij target beheer voor VM + bijbehorende gebruikers
target beheer maakt kopie van de standaard 'molgenis' VM en lanceert op GPFS
target beheer voegt de specifieke groep, gebruikers + firewalls toe aan VM en cluster
target beheer geeft VM aan LifeLines official om te vullen met data via automatische scriptje
LifeLines beheerder runt test of Internet inderdaad dicht zit en alles naar behoren werkt

FAQ

Will Molgenis by running inside the VM or inside the WOM?

Inside VM. Just as always.

As LL team we want to connect each WOM to the proper VM

Scrum: ticket:1053

Acceptance criteria:

There is a WOM available to the MOLGENIS developers.
When a browser is opened in the WOM accesses molgenis@VM at CIT (e.g. application32)
When ssh is used in the WOM it can reach the VM (e.g. application32)
When ssh is used in the WOM it can reach the cluster.gcc.rug.nl (future: cluster.lifelines.target.nl)
When using the browser, no other websites can be reached
When using ssh, no other servers can be reached

Status:

unknown!

FAQ:

WOM is a Windows 7 based virtual desktop hosted at UMCG. VM is a virtual Linux box living at CIT. There are several VPNs between UMCG and CIT.

How to link all of this together? Is someone working on this?

As LL team we want to provide a Molgenis Research Portal for each study

Scrum: ticket:949

Acceptance criteria:

When a VM is rolled out, a Molgenis-based LL Research Portal is provided within it
Based on the study request, this Portal will be filled with content (How do we know which content to load? Where does this happen?)
Targets, protocols and measurements are retrieved from some location and put into the Portal. They are restricted based on the study request.
The Portal instance contains only the pheno data that was requested for the study

Status:

In the Molgenis Research Portal, we want to have a Phenotype Matrix Viewer

Scrum: ticket:1050

Original Requirements on Modules/Matrix

Acceptance criteria:

One viewer!
Has options to export to Excel and SPSS.
Has column filters that can be stacked.
Has row and column header filters.

Status:

Unclear. We have one MatrixViewer but seem to have at least two back-ends now: the original SliceableMatrix and the one from Joris that supports multiple values in separate rows.
Suggestion: coming Monday, discuss what applications and modules there are, what added value they provide with regard to xQTL, and if they do, find a way to integrate them.

As a user, I want to select a phenotype and a list of individuals in the Molgenis Research Portal and then run a GWAS on the LL geno data

Scrum: ticket:1052

Acceptance criteria:

Status per sub-story:

We must have the imputed Third Release geno data on gpfs storage

Harm-Jan is currently imputing; this will take another two weeks.
After that we can upload these data (in TriTyper? format?) to gpfs.

Harm-Jan:

Op dit moment ben ik de genotype data van LifeLines release 3 aan het imputeren op het millipede cluster. Ik zal proberen deze in de gestelde twee weken klaar te hebben. Als dit allemaal is afgerond heb ik een file die voor elke SNP aangeeft wat de imputatie kwaliteit is geweest (mbv de door BEAGLE aangegeven r2 kwaliteits score), per 300 samples. Om de data spoedig te imputeren deel ik de totale dataset namelijk op in batches van ongeveer 300. Ik zal er voor zorgen dat er een koppeltabel komt die aangeeft welke sample in welke batch zit en geef daarnaast ook de gemiddelde imputatie score over alle batches. Daarnaast loont het de moeite om voor elke SNP ook de minor allele frequency (MAF) en de Hardy-Weinberg p-waarde (HWEP) te presenteren. Deze HWEP is een waarde die aangeeft of de verdeling van de allelfrequentie voor een SNP overeenkomt met de verwachte allelfrequentie verdeling voor die SNP. We hebben eerder gezien dat een lage HWEP vaak samen gaat met een lage imputatie kwaliteit (ie: door fouten in imputatie wijkt de werkelijke allelfrequentie verdeling af van de verwachte). Bovendien zijn deze waardes zijn eenvoudig uit te rekenen met de software die ik Joeri eerder heb gegeven. Daarnaast kan de MAF ook informatief zijn aangezien laag-frequente SNPs (MAF < 0.01) slecht geimputeerd worden in de huidige setting, aangezien de referentie dataset slechts 90 samples bevat.

Morris:

2 vraagjes:

(1) Wat is de preciese imputatie procedure? (of verschilt die niet van wat Alex doet)? We moeten onderzoekers namelijk precies kunnen vertellen wat ze krijgen.

(2) Dit is namelijk iets wat LifeLines straks gewoon zelf moet kunnen (dwz Alex pipeline werkt al op compute dus als die identiek is hebben we 'go').

@Joeri: het zou mooi zijn alle info die HarmJan? nu noemt dus ook getoond kunnen worden. Je zou HWE en MAF kunnen zien als features, elke SNP als target, en dan dus values voor elke combo.

We need to be able to link geno to pheno data

Proposal by Jan-Lucas:

Uitgangspunten:

Marcel spreadsheet bevat LLPatient ID's en Marcel Pseudoniemen (gekoppeld).
LL PatientID's gaan niet van LRA naar Target Stage.
Target Stage bevat LL bronpseudoniem.
Voor onderzoek wordt LL bronpseudoniem vervangen door onderzoekpseudoniem.

Voorstel zelf:

Marcels spreadsheet wordt geimporteerd in LRA, indien niet mogelijk in aparte database.
Bij aanmaken dataset in UMCG Publish voor een onderzoek wordt Marcels spreadsheet op dezelfde manier gepseudonimiseerd als de LRA data, van Patient ID naar bronpseudoniem naar onderzoekspseudoniem. Dit levert lijst op met onderzoekspseudoniem en Marcelpseudoniem.
Lijst gaat mee in data export/import naar CIT Publish.
Op CIT publish komt een view die vertaling maakt van Marcelpseudoniem naar onderzoekspseudoniem per onderzoek. View kan relatoneel zijn, maar ook XML opeleveren.
Op CIT publish komt een database procedure voor legen van tabel met pseudoniemen.
Als LRA dat op CIT Publish staat wordt view uitgelezen, op basis hiervan kan procedure "replace pseudonyms" uitgevoerd worden (uit Gert-Jans PPTX).
Na procedure "replace pseudonyms" wordt eventueel aangemaakte file met pseudoniemen verwijderd. (Bij voorkeur heeft procedure die lijst in memory, maar als in file dan moet deze verwijderd.
Na procedure "replace pseudonyms" wordt tabel met pseudoniemen geleegd voor dat onderzoek, dan met aanroepen database procedure.

We must be able to initiate a GWAS run from the Research Portal

Proposed flow by Morris:

precondition:

De research portal heeft toegang tot een genofile (meteen plink format + binary format) met daarin dezelfde individual pseudonyms als in de pheno database. Of kunnen we hier beter alleen de xQTL binary file voor gebruiken?
Deze genodata wordt dus vooraf al per research portal met de juiste pseudoniemen klaargezet (=SOP genodata). De portal hoeft dus niet zelf de pseudonimisatie te raadplegen.
De VM draait direct bovenop het cluster en heeft via dat cluster toegang tot GPFS. Elke research portal heeft dus een folder zoiets als /gpfs/target/lifelines/study1/rawdata/study1.bed

logica:

Als de gebruiker het phenotype heeft geselecteerd gaat programma dus, gegeven lijst van individuen, de gehele bed (?) file doorlopen en (1) rijen weglaten van individuals die niet in de view zitten en (2) de pheno kolom aanpassen met het juiste phenotype.
Implementatie is afhankelijk van hoe lang dit proces duurt. Is het 'klaar terwijl je wacht' dan kan het gewoon als plugin. Anders moet het via MOLGENIS compute zoals Joeri beschrijft. Output: /gpfs/target/lifelines/study1/results/myselection1.bed

Acceptance criteria:

List of individuals and selected phenotype are passed from the Portal

As a LL data manager, I want to have a Catalog of the LL phenotype data

Scrum: ticket:1051

Acceptance criteria:

Status:

Functional design / mock-up of Catalog on Catalogue
We still need a technical design of the Catalog.
Despoina and Chao need LL metadata to fill their first version of the catalog with. Joris will provide them these data, however, they are still incomplete and will probably change.

As a LL data manager, I want to load data from publish layer into EAV (pheno model)

Scrum: ticket:1067

Acceptance criteria:

Status:

To-do's:

Add logging (so we can see what going on when it crashes in production environment, if it ever occurs)
- Add Thread Monitor
How to handle/load/implement descriptive tables like LAB_BEPALING, this table is actually big list of Measurements with a lot of extra fields.
- options:
  - Create a new type that extends Measurement and hold the additional fields
  - Merge data into the label of Category
How to handle/load/implement the table that describes which foreign keys are used between the tables.
- The matrix viewer should know this info as well to build correct queries
Re-factor lifelines packages (it a little bit messy), remove old not used code anymore and place in descriptive packages
Remove JPA dependencies
- Many-to-many in JPA are not working properly with Labels, for example ov.setTarget_Name("x"). In JDBCMapper this is solved, but know not where and how we could best do this for JPA. This set by label should also be put into generated test
- Remove change org.molgenis.JpaDatabase? interface to prevent this
- trick to prevent compilation problem in Hudson, should be changed!
- this.em = ((JpaDatabase?)db).getEntityManager().getEntityManagerFactory().createEntityManager();
- Jpa reverse relation cause hudson to give compile error. Should be added to molgenis for non-jpa entities. And implement as @deprecated of throw unsupportedOperationException.
Update CSV readers to be multi threaded?
In (production) environment it's not a bad idea to put the java executable in the Oracle VM that part of the database.
Last but not least, Test if data is loaded correctly (Test from Anco).
We should make sure that the data is always loaded into the right format (this means that it always end up the right way in database).

Download in other formats:

Plain Text

Context Navigation

Stories for the research portal release of Jan. 2012

Table of Contents

As a admin we want to deploy new molgenis VMs with cluster access

As LL team we want to connect each WOM to the proper VM

As LL team we want to provide a Molgenis Research Portal for each study

In the Molgenis Research Portal, we want to have a Phenotype Matrix Viewer

As a user, I want to select a phenotype and a list of individuals in the Molgenis Research Portal and then run a GWAS on the LL geno data

We must have the imputed Third Release geno data on gpfs storage

We need to be able to link geno to pheno data

We must be able to initiate a GWAS run from the Research Portal

As a LL data manager, I want to have a Catalog of the LL phenotype data

As a LL data manager, I want to load data from publish layer into EAV (pheno model)

Download in other formats: