Changes between Version 1 and Version 2 of SopConvertLifeLinesGenoData


Ignore:
Timestamp:
2012-04-04T06:13:15+02:00 (13 years ago)
Author:
Morris Swertz
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SopConvertLifeLinesGenoData

    v1 v2  
    2323Complete genotype data is in: /target/gpfs2/lifelines_rp/releases/LL3/
    2424
     25Per study a text file is available contains the mapping + selection of identifiers:
     26
     27== Procedure ==
     28
     29
     30
     31* Data resides on /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedTriTyper (accessible from all our new VMs)
     32 * Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
     33 * Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/
     34 * STEP 1: make the subset_molgenis<n>.txt file:
     35   * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
     36   * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
     37   * Export this view (tab separated, no enclosures, no headers) to molgenis<n>.txt and scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
     38   * Run the following command there: {{{ ./formatsubsetfile.sh molgenis<n>.txt }}}
     39   * Your file is now available as subset_molgenis<n>.txt and looks like:[[BR]]LL_WGA0001   STUDYPSEUDO1   0[[BR]]LL_WGA0002   STUDYPSEUDO2   0[[BR]]LL_WGA0003   STUDYPSEUDO3   0[[BR]]...
     40    * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
     41    * Items are TAB-separated and it doesn't end with a newline
     42 * STEP 2: run the convertor
     43  * Usage: {{{ /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_molgenis<n>.txt }}}
     44 * STEP 3: copy to correct location
     45  * {{{cp study<n>.tped ../../lifelines0<n>}}}
     46  * May take some time!
     47
     48== Further Genodata ==
     49
     50The commands above generate a single large file for the study in question. From this researchers would like some further file manipulation to be done:
     51 * Supply the large genodata in binary format, using command:
     52     {{{ plink --tfile <data> --make-bed --out <data> }}}
     53     This should generate .bed, .bim and .fam files.
     54 * Supply the data also in separate files per chromosome. This can be done with the commands:
     55     {{{  plink --tfile <data> --make-bed --chr 1 --out <data_chr1> }}}
     56     {{{  plink --tfile <data> --make-bed --chr 2 --out <data_chr2> }}}
     57     {{{  ... }}}
     58   (a script file should take care of this series of commands)
     59
     60Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome.
     61 * Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile.
     62
     63[[Image(http://i.imgur.com/nLT2e.png)]]
     64
     65A schematic overview of the two export paths described above.