8 | | Specifications: |
9 | | * Geno data is released to researcher 'per study' (i.e. an approved research request). |
10 | | * Per study a subset of the individuals is selected |
11 | | * The individual identifiers are 're-pseunomized' from 'WGA' to 'study' identifiers |
12 | | * Data is reformatted in various PLINK formats |
| 7 | == Required inputs == |
| 8 | |
| 9 | === Variables per study: === |
| 10 | * '''studyId''' - the id of study. E.g. 'OV039' |
| 11 | * '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039' |
| 12 | * '''mappingFile''' - describing what WGA ids are included and what their study ids are. E.g.: |
| 13 | |
| 14 | {{{ |
| 15 | 1 LL_WGA0001 1 12345 |
| 16 | 1 LL_WGA0002 1 09876 |
| 17 | 1 LL_WGA0003 1 64542 |
| 18 | ... |
| 19 | }}} |
| 20 | |
| 21 | * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms |
| 22 | * Items are TAB-separated and it doesn't end with a newline |
| 23 | |
| 24 | === Constants over all studies: === |
| 25 | * Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose |
| 26 | * Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped |
| 27 | * Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt |
16 | | The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study. |
| 31 | Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing: |
| 32 | * [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| 33 | * [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| 34 | * [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| 35 | * [studyId]_imputation_batch.txt listing imputation batches |
| 36 | * [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch |
18 | | User expects files in PLINK format: |
19 | | * PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
20 | | * BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
21 | | * MAP/PED imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) |
22 | | * batch.txt listing imputation batches |
23 | | * impution_quality.txt listing imputation quality per SNP x batch |
| 38 | NB: All files should be prefixed with {{{studyId}}}. |
26 | | == Required inputs == |
27 | | |
28 | | The following are input for the conversion procedure: |
29 | | * Beagle imputed genotype files (fam/ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap |
30 | | * Beagle imputed dosage files (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage |
31 | | * '''per study''' mapping file for study to filter and re-pseudonomize identifiers |
32 | | |
33 | | Example mapping file: |
34 | | {{{ |
35 | | 1 LL_WGA0001 1 STUDYPSEUDO1 |
36 | | 1 LL_WGA0002 1 STUDYPSEUDO2 |
37 | | 1 LL_WGA0003 1 STUDYPSEUDO3 |
38 | | ... |
39 | | }}} |
40 | | |
41 | | * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms |
42 | | * Items are TAB-separated and it doesn't end with a newline |
48 | | * In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view |
49 | | * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) |
| 48 | This is done in the generic layer: |
| 49 | * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view |
| 50 | * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers) |
65 | | filter individuals (repeat per chr) |
| 59 | === Step 2: generate conversion jobs === |
| 60 | |
| 61 | The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert. |
| 62 | These jobs are produced in the 'studyDir/scripts'. |
| 63 | |
| 64 | Command: |
| 65 | {{{ |
| 66 | sh ../LL3/scripts/generateGenoJobs.sh \ |
| 67 | --studyId OV039 \ |
| 68 | --outputDir ../lifelines_OV039 \ |
| 69 | --mappingFile ../mappingFile_OV039.txt |
| 70 | }}} |
| 71 | |
| 72 | == Step 3: submit jobs == |
| 73 | |
| 74 | change directory to the scripts directory, inspect and submit: |
| 75 | |
| 76 | change directory: |
| 77 | {{{ |
| 78 | cd ../lifelines_OV039/scripts |
| 79 | }}} |
| 80 | |
| 81 | list scripts (we expect jobs for each format and each chromosome): |
| 82 | {{{ |
| 83 | ls -lah |
| 84 | }}} |
| 85 | |
| 86 | submit to cluster |
| 87 | {{{ |
| 88 | sh submit_jobs.sh |
| 89 | }}} |
| 90 | |
| 91 | == Step 4: monitor progress, QC results == |
| 92 | |
| 93 | monitor progress using 'qstat' |
| 94 | |
| 95 | TBD how to QC. |
| 96 | * must check that now WGA id is in the data |
| 97 | * must check that all ids where in the set |
| 98 | |
| 99 | == Step 5: release == |
| 100 | {{{ |
| 101 | cd ../lifelines_OV039/ |
| 102 | |
| 103 | #give user permission to see the data |
| 104 | chown lifelines_OV039:lifelines * |
| 105 | }}} |
| 106 | |
| 107 | == Implementation details == |
| 108 | |
| 109 | The 'generateGenoJobs.sh' script implements the following steps: |
| 110 | |
| 111 | === Convert MAP/PED and generate BIM/BED/FAM=== |
| 112 | |
74 | | plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1 |
| 126 | /target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \ |
| 127 | --file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \ |
| 128 | --update-ids $updateIdsFile \ |
| 129 | --out $studyDir/${studyId}_chr$i \ |
| 130 | --keep $keepFile \ |
| 131 | --recode |
| 132 | |
| 133 | #step 3: for i in {1..22} |
| 134 | #convert to bim/fam/bed |
| 135 | plink \ |
| 136 | --file $studyDir/${studyId}_chr$i \ |
| 137 | --make-bed |
| 138 | |
| 139 | #remove temp |
| 140 | rm temp_chr1 |
| 141 | |
| 142 | === Convert dosage format === |
| 143 | |
| 144 | As PLINK cannot updateIds on dosage files we created it ourselves. The command: |
| 145 | |
| 146 | |
| 147 | {{{ |
| 148 | #step1: for i in {1..22} |
| 149 | #--subsetFile is the mappingFile |
| 150 | #--doseFile is the imputed dosages |
| 151 | #--outFile is where the converted file should go |
| 152 | |
| 153 | python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \ |
| 154 | --subsetFile $mappingFile \ |
| 155 | --doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \ |
| 156 | --outFile ${studyDir}/${studyId}_chr$i.dose |
| 157 | |
77 | | update individuals ids (repeat per chr) |
78 | | {{{#!sh |
79 | | #--file [file] is input file |
80 | | #--keep [file] tells plink what individuals to update |
81 | | #(from txt file with OLD fam + ind id + NEW fam id + ind id) |
82 | | #--recode tells plink to write results (otherwise no results!!! argh!) |
83 | | # result: updatedids.map/ped |
84 | | |
85 | | plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1 |
86 | | }}} |
87 | | |
88 | | |
89 | | #step 3: |
90 | | #convert to bed (repeat per chr) |
91 | | plink --file study2_chr1 --make-bed |
92 | | === Step 4: convert into dosage format === |
93 | | |
94 | | TODO! ask Joeri? |