wiki:SopConvertLifeLinesGenoData

Version 40 (modified by Morris Swertz, 12 years ago) (diff)

--

SOP for converting LifeLines Geno Data

How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3.

Variables per study:

  • studyId - the id of study. E.g. 'OV039'
  • studyDir - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
  • mappingFile - describing what WGA ids are included and what their study ids are. E.g.:
1   LL_WGA0001   1   12345
1   LL_WGA0002   1   09876
1   LL_WGA0003   1   64542
...
  • So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
  • Items are TAB-separated and it doesn't end with a newline

Constants over all studies (for LL3):

  • Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
  • Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
  • Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores?.txt

Expected outputs

Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:

  • [studyId]_Chr.PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_Chr.BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_Chr.DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
  • [studyId]_imputation_batch.txt listing imputation batches
  • [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch

NB: All files should be prefixed with studyId.

TODO monomorphic filtering

Procedure

Step 0: request a study user

Step 1: create subset_study<n>.txt file for study<n>

This is done in the generic layer:

  • In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
  • In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
  • Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
  • scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

reformat mapping file WHY IS THIS?:

./formatsubsetfile.sh study<n>.txt

Step 2: generate conversion jobs

The conversion has been fully automated (see below for details). First we generate all the jobs needed to convert.

Command:

sh ../LL3/scripts/generateGenoJobs.sh \
--studyId OV039 \
--studyDir ../lifelines_OV039 \
--mappingFile ../mappingFile_OV039.txt

Step 3: submit jobs

change directory to the 'studyDir/scripts' directory, inspect jobs and submit:

change directory:

cd ../lifelines_OV039/scripts

list scripts (we expect jobs for each format and each chromosome):

ls -lah

submit to cluster

sh submit_jobs.sh

Step 4: monitor progress, QC results

monitor progress using 'qstat'

TBD how to QC.

  • must check that now WGA id is in the data
  • must check that all ids where in the set

Step 5: release

cd ../lifelines_OV039/

#give user permission to see the data
chown lifelines_OV039:lifelines *

Implementation details

The 'generateGenoJobs.sh' script implements the following steps:

Convert MAP/PED and generate BIM/BED/FAM

#step1
#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile


#step2: for i in {1..22}
#--file [file] is input file in .map and .ped
#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'

/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
--update-ids $updateIdsFile \
--out $studyDir/${studyId}_chr$i \
--keep $keepFile \
--recode

#step 3:  for i in {1..22}
#convert to bim/fam/bed
plink \
--file $studyDir/${studyId}_chr$i \
--make-bed

#step 4: for i in {1..22}
#remove temp
rm temp_chr$i

Convert dosage format

As PLINK cannot updateIds on dosage files we created it ourselves. The command:

#step1: for i in {1..22}
#--subsetFile is the mappingFile
#--doseFile is the imputed dosages
#--outFile is where the converted file should go

python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
--subsetFile $mappingFile \
--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
--outFile ${studyDir}/${studyId}_chr$i.dose

Step 5: copy all study*<n> files to the lifelines0<n> folder

cp study<n>* ../../lifelines0<n>

  • May take some time!

Maintaining the source code of the tools

To work with the sourcecode:

  1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
  2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
  3. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/

Overview

http://i.imgur.com/nLT2e.png

A schematic overview of the export procedures described above.