| 8 | | Specifications: |
| 9 | | * Geno data is released to researcher 'per study' (i.e. an approved research request). |
| 10 | | * Per study a subset of the individuals is selected |
| 11 | | * The individual identifiers are 're-pseunomized' from 'WGA' to 'study' identifiers |
| 12 | | * Data is reformatted in various PLINK formats |
| | 7 | == Required inputs == |
| | 8 | |
| | 9 | === Variables per study: === |
| | 10 | * '''studyId''' - the id of study. E.g. 'OV039' |
| | 11 | * '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039' |
| | 12 | * '''mappingFile''' - describing what WGA ids are included and what their study ids are. E.g.: |
| | 13 | |
| | 14 | {{{ |
| | 15 | 1 LL_WGA0001 1 12345 |
| | 16 | 1 LL_WGA0002 1 09876 |
| | 17 | 1 LL_WGA0003 1 64542 |
| | 18 | ... |
| | 19 | }}} |
| | 20 | |
| | 21 | * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms |
| | 22 | * Items are TAB-separated and it doesn't end with a newline |
| | 23 | |
| | 24 | === Constants over all studies: === |
| | 25 | * Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose |
| | 26 | * Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped |
| | 27 | * Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt |
| 16 | | The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study. |
| | 31 | Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing: |
| | 32 | * [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| | 33 | * [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| | 34 | * [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| | 35 | * [studyId]_imputation_batch.txt listing imputation batches |
| | 36 | * [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch |
| 18 | | User expects files in PLINK format: |
| 19 | | * PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| 20 | | * BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| 21 | | * MAP/PED imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) |
| 22 | | * batch.txt listing imputation batches |
| 23 | | * impution_quality.txt listing imputation quality per SNP x batch |
| | 38 | NB: All files should be prefixed with {{{studyId}}}. |
| 26 | | == Required inputs == |
| 27 | | |
| 28 | | The following are input for the conversion procedure: |
| 29 | | * Beagle imputed genotype files (fam/ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap |
| 30 | | * Beagle imputed dosage files (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage |
| 31 | | * '''per study''' mapping file for study to filter and re-pseudonomize identifiers |
| 32 | | |
| 33 | | Example mapping file: |
| 34 | | {{{ |
| 35 | | 1 LL_WGA0001 1 STUDYPSEUDO1 |
| 36 | | 1 LL_WGA0002 1 STUDYPSEUDO2 |
| 37 | | 1 LL_WGA0003 1 STUDYPSEUDO3 |
| 38 | | ... |
| 39 | | }}} |
| 40 | | |
| 41 | | * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms |
| 42 | | * Items are TAB-separated and it doesn't end with a newline |
| 48 | | * In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view |
| 49 | | * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers) |
| | 48 | This is done in the generic layer: |
| | 49 | * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view |
| | 50 | * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers) |
| 65 | | filter individuals (repeat per chr) |
| | 59 | === Step 2: generate conversion jobs === |
| | 60 | |
| | 61 | The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert. |
| | 62 | These jobs are produced in the 'studyDir/scripts'. |
| | 63 | |
| | 64 | Command: |
| | 65 | {{{ |
| | 66 | sh ../LL3/scripts/generateGenoJobs.sh \ |
| | 67 | --studyId OV039 \ |
| | 68 | --outputDir ../lifelines_OV039 \ |
| | 69 | --mappingFile ../mappingFile_OV039.txt |
| | 70 | }}} |
| | 71 | |
| | 72 | == Step 3: submit jobs == |
| | 73 | |
| | 74 | change directory to the scripts directory, inspect and submit: |
| | 75 | |
| | 76 | change directory: |
| | 77 | {{{ |
| | 78 | cd ../lifelines_OV039/scripts |
| | 79 | }}} |
| | 80 | |
| | 81 | list scripts (we expect jobs for each format and each chromosome): |
| | 82 | {{{ |
| | 83 | ls -lah |
| | 84 | }}} |
| | 85 | |
| | 86 | submit to cluster |
| | 87 | {{{ |
| | 88 | sh submit_jobs.sh |
| | 89 | }}} |
| | 90 | |
| | 91 | == Step 4: monitor progress, QC results == |
| | 92 | |
| | 93 | monitor progress using 'qstat' |
| | 94 | |
| | 95 | TBD how to QC. |
| | 96 | * must check that now WGA id is in the data |
| | 97 | * must check that all ids where in the set |
| | 98 | |
| | 99 | == Step 5: release == |
| | 100 | {{{ |
| | 101 | cd ../lifelines_OV039/ |
| | 102 | |
| | 103 | #give user permission to see the data |
| | 104 | chown lifelines_OV039:lifelines * |
| | 105 | }}} |
| | 106 | |
| | 107 | == Implementation details == |
| | 108 | |
| | 109 | The 'generateGenoJobs.sh' script implements the following steps: |
| | 110 | |
| | 111 | === Convert MAP/PED and generate BIM/BED/FAM=== |
| | 112 | |
| 74 | | plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1 |
| | 126 | /target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \ |
| | 127 | --file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \ |
| | 128 | --update-ids $updateIdsFile \ |
| | 129 | --out $studyDir/${studyId}_chr$i \ |
| | 130 | --keep $keepFile \ |
| | 131 | --recode |
| | 132 | |
| | 133 | #step 3: for i in {1..22} |
| | 134 | #convert to bim/fam/bed |
| | 135 | plink \ |
| | 136 | --file $studyDir/${studyId}_chr$i \ |
| | 137 | --make-bed |
| | 138 | |
| | 139 | #remove temp |
| | 140 | rm temp_chr1 |
| | 141 | |
| | 142 | === Convert dosage format === |
| | 143 | |
| | 144 | As PLINK cannot updateIds on dosage files we created it ourselves. The command: |
| | 145 | |
| | 146 | |
| | 147 | {{{ |
| | 148 | #step1: for i in {1..22} |
| | 149 | #--subsetFile is the mappingFile |
| | 150 | #--doseFile is the imputed dosages |
| | 151 | #--outFile is where the converted file should go |
| | 152 | |
| | 153 | python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \ |
| | 154 | --subsetFile $mappingFile \ |
| | 155 | --doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \ |
| | 156 | --outFile ${studyDir}/${studyId}_chr$i.dose |
| | 157 | |
| 77 | | update individuals ids (repeat per chr) |
| 78 | | {{{#!sh |
| 79 | | #--file [file] is input file |
| 80 | | #--keep [file] tells plink what individuals to update |
| 81 | | #(from txt file with OLD fam + ind id + NEW fam id + ind id) |
| 82 | | #--recode tells plink to write results (otherwise no results!!! argh!) |
| 83 | | # result: updatedids.map/ped |
| 84 | | |
| 85 | | plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1 |
| 86 | | }}} |
| 87 | | |
| 88 | | |
| 89 | | #step 3: |
| 90 | | #convert to bed (repeat per chr) |
| 91 | | plink --file study2_chr1 --make-bed |
| 92 | | === Step 4: convert into dosage format === |
| 93 | | |
| 94 | | TODO! ask Joeri? |