Context Navigation

Changes between Version 30 and Version 31 of SopConvertLifeLinesGenoData

Timestamp:: 2012-04-21T12:12:25+02:00 (13 years ago)
Author:: Morris Swertz
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

SopConvertLifeLinesGenoData

-                      v30
+                      v31
 [[TOC()]]
+How to pseudonomize geno data per study
+How to pseudonomize and reformat imputed genotype (GWAS) data per study.
 This SOP applies to LL3.
+Specifications:
+* Geno data is released to researcher 'per study' (i.e. an approved research request).
+* Per study a subset of the individuals is selected
+* The individual identifiers are 're-pseunomized' from 'WGA' to 'study' identifiers
+* Data is reformatted in various PLINK formats
+== Required inputs ==
+=== Variables per study: ===
+* '''studyId''' - the id of study. E.g. 'OV039'
+* '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
+* '''mappingFile''' - describing what WGA ids are included and what their study ids are. E.g.:
+{{{
+   LL_WGA0001   1   12345
+   LL_WGA0002   1   09876
+   LL_WGA0003   1   64542
+...
+}}}
+  * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
+  * Items are TAB-separated and it doesn't end with a newline
+=== Constants over all studies: ===
+* Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
+* Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
+* Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt
 == Expected outputs ==
+The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study.
+Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:
+* [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
+* [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
+* [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
+* [studyId]_imputation_batch.txt listing imputation batches
+* [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch
+User expects files in PLINK format:
+* PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
+* BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
+* MAP/PED imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
+* batch.txt listing imputation batches
+* impution_quality.txt listing imputation quality per SNP x batch
+NB: All files should be prefixed with {{{studyId}}}.
 > TODO monomorphic filtering
-== Required inputs ==
-The following are input for the conversion procedure:
-* Beagle imputed genotype files (fam/ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap
-* Beagle imputed dosage files (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage
-* '''per study''' mapping file for study to filter and re-pseudonomize identifiers
-Example mapping file:
-{{{
-   LL_WGA0001   1   STUDYPSEUDO1
-   LL_WGA0002   1   STUDYPSEUDO2
-   LL_WGA0003   1   STUDYPSEUDO3
-...
-}}}
- * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
- * Items are TAB-separated and it doesn't end with a newline
 == Procedure ==
+=== Step 0: request a study user ===
 === Step 1: create subset_study<n>.txt file for study<n> ===
+ * In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
+ * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
+This is done in the generic layer:
+ * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
+ * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
  * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
  * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
-=== Step 2: convert into study<n>.tped format ===
-cd to directory:
-{{{#!sh
-cd /target/gpfs2/lifelines_rp/releases/LL3
-}}}
 reformat mapping file '''WHY IS THIS?''':
 …
 }}}
+filter individuals (repeat per chr)
+=== Step 2: generate conversion jobs ===
+The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert.
+These jobs are produced in the 'studyDir/scripts'.
+Command:
+{{{
+sh ../LL3/scripts/generateGenoJobs.sh \
+--studyId OV039 \
+--outputDir ../lifelines_OV039 \
+--mappingFile ../mappingFile_OV039.txt
+}}}
+== Step 3: submit jobs ==
+change directory to the scripts directory, inspect and submit:
+change directory:
+{{{
+cd ../lifelines_OV039/scripts
+}}}
+list scripts (we expect jobs for each format and each chromosome):
+{{{
+ls -lah
+}}}
+submit to cluster
+{{{
+sh submit_jobs.sh
+}}}
+== Step 4: monitor progress,  QC results ==
+monitor progress using 'qstat'
+TBD how to QC.
+* must check that now WGA id is in the data
+* must check that all ids where in the set
+== Step 5: release ==
+{{{
+cd ../lifelines_OV039/
+#give user permission to see the data
+chown lifelines_OV039:lifelines *
+}}}
+== Implementation details ==
+The 'generateGenoJobs.sh' script implements the following steps:
+=== Convert MAP/PED and generate BIM/BED/FAM===
 {{{#!sh
+#--file [file] is input file (expects .map and .ped)
+#--keep [file] tells plink what individuals to keep (from txt file with fam + ind id)
+#step1
+#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile
+#step2: for i in {1..22}
+#--file [file] is input file in .map and .ped
+#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
 #--recode tells plink to write results (otherwise no results!!! argh!)
 #--out defines output prefix (here: filtered.*)
 …
 #result: filtered.ped/map'
+plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1
+/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
+--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
+--update-ids $updateIdsFile \
+--out $studyDir/${studyId}_chr$i \
+--keep $keepFile \
+--recode
+#step 3:  for i in {1..22}
+#convert to bim/fam/bed
+plink \
+--file $studyDir/${studyId}_chr$i \
+--make-bed
+#remove temp
+rm temp_chr1
+=== Convert dosage format ===
+As PLINK cannot updateIds on dosage files we created it ourselves. The command:
+{{{
+#step1: for i in {1..22}
+#--subsetFile is the mappingFile
+#--doseFile is the imputed dosages
+#--outFile is where the converted file should go
+python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
+--subsetFile $mappingFile \
+--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
+--outFile ${studyDir}/${studyId}_chr$i.dose
 }}}
-update individuals ids (repeat per chr)
-{{{#!sh
-#--file [file] is input file
-#--keep [file] tells plink what individuals to update
-#(from txt file with OLD fam + ind id + NEW fam id + ind id)
-#--recode tells plink to write results (otherwise no results!!! argh!)
-# result: updatedids.map/ped
-plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1
-}}}
-#step 3:
-#convert to bed (repeat per chr)
-plink --file study2_chr1 --make-bed
-=== Step 4: convert into dosage format ===
-TODO! ask Joeri?
 === Step 5: copy all study*<n> files to the lifelines0<n> folder ===