Changes between Version 30 and Version 31 of SopConvertLifeLinesGenoData


Ignore:
Timestamp:
2012-04-21T12:12:25+02:00 (13 years ago)
Author:
Morris Swertz
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SopConvertLifeLinesGenoData

    v30 v31  
    22[[TOC()]]
    33
    4 How to pseudonomize geno data per study
    5 
     4How to pseudonomize and reformat imputed genotype (GWAS) data per study.
    65This SOP applies to LL3.
    76
    8 Specifications:
    9 * Geno data is released to researcher 'per study' (i.e. an approved research request).
    10 * Per study a subset of the individuals is selected
    11 * The individual identifiers are 're-pseunomized' from 'WGA' to 'study' identifiers
    12 * Data is reformatted in various PLINK formats
     7== Required inputs ==
     8
     9=== Variables per study: ===
     10* '''studyId''' - the id of study. E.g. 'OV039'
     11* '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
     12* '''mappingFile''' - describing what WGA ids are included and what their study ids are. E.g.:
     13
     14{{{
     151   LL_WGA0001   1   12345
     161   LL_WGA0002   1   09876
     171   LL_WGA0003   1   64542
     18...
     19}}}
     20
     21  * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
     22  * Items are TAB-separated and it doesn't end with a newline
     23
     24=== Constants over all studies: ===
     25* Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
     26* Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
     27* Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt
    1328
    1429== Expected outputs ==
    1530
    16 The IDS should be filtered (e.g. 5000) and recoded (psuedoids) for one study.
     31Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:
     32* [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
     33* [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
     34* [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
     35* [studyId]_imputation_batch.txt listing imputation batches
     36* [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch
    1737
    18 User expects files in PLINK format:
    19 * PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
    20 * BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
    21 * MAP/PED imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
    22 * batch.txt listing imputation batches
    23 * impution_quality.txt listing imputation quality per SNP x batch
     38NB: All files should be prefixed with {{{studyId}}}.
    2439
    2540> TODO monomorphic filtering
    26 == Required inputs ==
    27 
    28 The following are input for the conversion procedure:
    29 * Beagle imputed genotype files (fam/ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap
    30 * Beagle imputed dosage files (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage
    31 * '''per study''' mapping file for study to filter and re-pseudonomize identifiers
    32 
    33 Example mapping file:
    34 {{{
    35 1   LL_WGA0001   1   STUDYPSEUDO1
    36 1   LL_WGA0002   1   STUDYPSEUDO2
    37 1   LL_WGA0003   1   STUDYPSEUDO3
    38 ...
    39 }}}
    40 
    41  * So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
    42  * Items are TAB-separated and it doesn't end with a newline
    4341
    4442== Procedure ==
    4543
     44=== Step 0: request a study user ===
     45
    4646=== Step 1: create subset_study<n>.txt file for study<n> ===
    4747
    48  * In every STUDY<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
    49  * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
     48This is done in the generic layer:
     49 * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
     50 * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
    5051 * Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
    5152 * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
    52 
    53 === Step 2: convert into study<n>.tped format ===
    54 
    55 cd to directory:
    56 {{{#!sh
    57 cd /target/gpfs2/lifelines_rp/releases/LL3
    58 }}}
    5953
    6054reformat mapping file '''WHY IS THIS?''':
     
    6357}}}
    6458
    65 filter individuals (repeat per chr)
     59=== Step 2: generate conversion jobs ===
     60
     61The conversion has been fully automated (see below for details). Therefore we generate all the jobs needed to convert.
     62These jobs are produced in the 'studyDir/scripts'.
     63
     64Command:
     65{{{
     66sh ../LL3/scripts/generateGenoJobs.sh \
     67--studyId OV039 \
     68--outputDir ../lifelines_OV039 \
     69--mappingFile ../mappingFile_OV039.txt
     70}}}
     71 
     72== Step 3: submit jobs ==
     73
     74change directory to the scripts directory, inspect and submit:
     75
     76change directory:
     77{{{
     78cd ../lifelines_OV039/scripts
     79}}}
     80
     81list scripts (we expect jobs for each format and each chromosome):
     82{{{
     83ls -lah
     84}}}
     85
     86submit to cluster
     87{{{
     88sh submit_jobs.sh
     89}}}
     90
     91== Step 4: monitor progress,  QC results ==
     92
     93monitor progress using 'qstat'
     94
     95TBD how to QC.
     96* must check that now WGA id is in the data
     97* must check that all ids where in the set
     98
     99== Step 5: release ==
     100{{{
     101cd ../lifelines_OV039/
     102
     103#give user permission to see the data
     104chown lifelines_OV039:lifelines *
     105}}}
     106
     107== Implementation details ==
     108
     109The 'generateGenoJobs.sh' script implements the following steps:
     110
     111=== Convert MAP/PED and generate BIM/BED/FAM===
     112
    66113{{{#!sh
    67 #--file [file] is input file (expects .map and .ped)
    68 #--keep [file] tells plink what individuals to keep (from txt file with fam + ind id)
     114#step1
     115#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile
     116
     117
     118#step2: for i in {1..22}
     119#--file [file] is input file in .map and .ped
     120#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
    69121#--recode tells plink to write results (otherwise no results!!! argh!)
    70122#--out defines output prefix (here: filtered.*)
     
    72124#result: filtered.ped/map'
    73125
    74 plink --file testdata_chr1 --keep subset.txt --recode --out temp_chr1
     126/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
     127--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
     128--update-ids $updateIdsFile \
     129--out $studyDir/${studyId}_chr$i \
     130--keep $keepFile \
     131--recode
     132
     133#step 3:  for i in {1..22}
     134#convert to bim/fam/bed
     135plink \
     136--file $studyDir/${studyId}_chr$i \
     137--make-bed
     138
     139#remove temp
     140rm temp_chr1
     141
     142=== Convert dosage format ===
     143
     144As PLINK cannot updateIds on dosage files we created it ourselves. The command:
     145
     146
     147{{{
     148#step1: for i in {1..22}
     149#--subsetFile is the mappingFile
     150#--doseFile is the imputed dosages
     151#--outFile is where the converted file should go
     152
     153python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
     154--subsetFile $mappingFile \
     155--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
     156--outFile ${studyDir}/${studyId}_chr$i.dose
     157
    75158}}}
    76159
    77 update individuals ids (repeat per chr)
    78 {{{#!sh
    79 #--file [file] is input file
    80 #--keep [file] tells plink what individuals to update
    81 #(from txt file with OLD fam + ind id + NEW fam id + ind id)
    82 #--recode tells plink to write results (otherwise no results!!! argh!)
    83 # result: updatedids.map/ped
    84 
    85 plink --file temp_chr1 --update-ids subset.txt --recode --out study2_chr1
    86 }}}
    87 
    88 
    89 #step 3:
    90 #convert to bed (repeat per chr)
    91 plink --file study2_chr1 --make-bed
    92 === Step 4: convert into dosage format ===
    93 
    94 TODO! ask Joeri?
    95160
    96161=== Step 5: copy all study*<n> files to the lifelines0<n> folder ===