| 25 | Per study a text file is available contains the mapping + selection of identifiers: |
| 26 | |
| 27 | == Procedure == |
| 28 | |
| 29 | |
| 30 | |
| 31 | * Data resides on /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedTriTyper (accessible from all our new VMs) |
| 32 | * Convertor from TriTyper to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3 |
| 33 | * Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/ |
| 34 | * STEP 1: make the subset_molgenis<n>.txt file: |
| 35 | * In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view |
| 36 | * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) |
| 37 | * Export this view (tab separated, no enclosures, no headers) to molgenis<n>.txt and scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 |
| 38 | * Run the following command there: {{{ ./formatsubsetfile.sh molgenis<n>.txt }}} |
| 39 | * Your file is now available as subset_molgenis<n>.txt and looks like:[[BR]]LL_WGA0001 STUDYPSEUDO1 0[[BR]]LL_WGA0002 STUDYPSEUDO2 0[[BR]]LL_WGA0003 STUDYPSEUDO3 0[[BR]]... |
| 40 | * So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user) |
| 41 | * Items are TAB-separated and it doesn't end with a newline |
| 42 | * STEP 2: run the convertor |
| 43 | * Usage: {{{ /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_molgenis<n>.txt }}} |
| 44 | * STEP 3: copy to correct location |
| 45 | * {{{cp study<n>.tped ../../lifelines0<n>}}} |
| 46 | * May take some time! |
| 47 | |
| 48 | == Further Genodata == |
| 49 | |
| 50 | The commands above generate a single large file for the study in question. From this researchers would like some further file manipulation to be done: |
| 51 | * Supply the large genodata in binary format, using command: |
| 52 | {{{ plink --tfile <data> --make-bed --out <data> }}} |
| 53 | This should generate .bed, .bim and .fam files. |
| 54 | * Supply the data also in separate files per chromosome. This can be done with the commands: |
| 55 | {{{ plink --tfile <data> --make-bed --chr 1 --out <data_chr1> }}} |
| 56 | {{{ plink --tfile <data> --make-bed --chr 2 --out <data_chr2> }}} |
| 57 | {{{ ... }}} |
| 58 | (a script file should take care of this series of commands) |
| 59 | |
| 60 | Besides the genodata dosage information is also desired, both for the total (per-study) dataset and for that dataset split per chromosome. |
| 61 | * Joeri has a tool for this. NB: it is slow, reimplementing it in a compiled language might be worthwhile. |
| 62 | |
| 63 | [[Image(http://i.imgur.com/nLT2e.png)]] |
| 64 | |
| 65 | A schematic overview of the two export paths described above. |