| Version 62 (modified by , 13 years ago) (diff) |
|---|
SOP for converting LifeLines Geno Data
Table of Contents
How to pseudonomize and reformat imputed genotype (GWAS) data per study. This SOP applies to LL3.
Variables per study:
- studyId - the id of study. E.g. 'OV039'
- studyDir - the folder where all converted data should go. E.g. '../home/lifelines_OV039'
- mappingFile - describing what WGA ids are included and what their study ids are. E.g.:
1 LL_WGA0001 1 12345 1 LL_WGA0002 1 09876 1 LL_WGA0003 1 64542 ...
- So: Geno family ID's - TAB - Geno individual ID's - TAB - Study family psuedonyms TAB Study pseudonyms
- Items are TAB-separated and it doesn't end with a newline
WARNING: NOW THE MAPPING FILE DOES NOT YET CONTAIN FAMILY PSEUDONYMS!
Constants over all studies (for LL3):
- Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose
- Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped
- Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores?.txt
- PCA analysis of the genotypes; used for QC (filtering of individuals genetically too different): /target/gpfs2/lifelines_rp/releases/LL3/UnimputedPedMap
Expected outputs
Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing:
- [studyId]_Chr.PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_Chr.BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_Chr.DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered)
- [studyId]_imputation_batch.txt listing imputation batches
- [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch
NB: All files should be prefixed with studyId.
TODO monomorphic filtering
Procedure
Step 0: request a study user
Contact "Target beheer" (admins, e.g. Ger) and ask for an account.
E.g. username = lifelines_OV093 group = none home directory = /target/gpfs2/lifelines-genome/home/lifelines_OV093
Step 1: create subset_study<n>.txt file for study<n>
This is done in the generic layer:
- In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
- In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers)
- Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
- scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3
Step 2: generate conversion jobs
The conversion has been fully automated (see below for details). First we generate all the jobs needed to convert.
Command:
sh /target/gpfs2/lifelines_rp/releases/LL3/scripts/generateGenoJobs.sh \ --studyId OV039 \ --studyDir /full/path/to/lifelines_OV039 \ --mappingFile /full/path/to/mappingFile_OV039.txt
NB use full path names
Step 3: submit jobs
change directory to the 'studyDir/scripts' directory, inspect jobs and submit:
change directory:
cd ../lifelines_OV039/scripts
list scripts (we expect jobs for 'plink' and 'dose' and each chromosome):
ls -lah
submit to cluster
sh submit_jobs.sh
monitor job completions (on an empty cluster this takes >2hr)
qstat -u <user>
Step 4: QC results
Steps
- check jobs/*.finished files for all jobs (should all exist)
- check error logs (should not be there)
- run the test
sh jobs/run_test.sh
Step 5: release
cd ../lifelines_OV039/ #remove temp files (!) rm temp* #copy the quality scores cp /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputeDosage/BeagleImputedQualityScores.txt OV039_imputation_batch_qualities.txt #give user permission to see the data chown lifelines_OV039:lifelines *
Implementation details
The 'generateGenoJobs.sh' script implements the following steps:
Convert MAP/PED and generate BIM/BED/FAM
#step1
#generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile
#step2: for i in {1..22}
#--file [file] is input file in .map and .ped
#--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id)
#--recode tells plink to write results (otherwise no results!!! argh!)
#--out defines output prefix (here: filtered.*)
#--update-ids [file] tells prefix to update ids
#result: filtered.ped/map'
/target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \
--file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \
--update-ids $updateIdsFile \
--out $studyDir/${studyId}_chr$i \
--keep $keepFile \
--recode
#step 3: for i in {1..22}
#convert to bim/fam/bed
plink \
--file $studyDir/${studyId}_chr$i \
--make-bed
#step 4:
#remove temp
rm temp*
Convert dosage format
As PLINK cannot updateIds on dosage files we created it ourselves. The command:
#step1: for i in {1..22}
#--subsetFile is the mappingFile
#--doseFile is the imputed dosages
#--outFile is where the converted file should go
python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \
--subsetFile $mappingFile \
--doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \
--outFile ${studyDir}/${studyId}_chr$i.dose
Deprecated: Maintaining the source code of the tools
To work with the sourcecode:
- Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/
- Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/
- Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/
Additional scripts
Define the source folders (where data has been converted) and final destination (folder where data is served). Sometimes the name are different.
For example
sources=("lifelines_OV0xy" "lifelines_OV0xy")
targets=("lifelines_OV0ab" "lifelines_OV0cd")
Script to check if all folders are there
total=${#sources[*]}
for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then
echo ${targets[i]}
else
echo "error: "${sources[i]}
fi
Batch copy and setup of the data
total=${#sources[*]}
for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then
echo ${targets[i]}
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bim
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.ped
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.map
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bed
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.dose
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.log
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.fam
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*_imputation*
rsync -av --progress /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ \
/target/gpfs2/lifelines-genome/home/${targets[i]}/
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/temp*
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed/test*
rm /target/gpfs2/lifelines-genome/home/${targets[i]}/jobs*
chown -R $user /target/gpfs2/lifelines-genome/home/${targets[i]}/*
chmod -R a-rw /target/gpfs2/lifelines-genome/home/${targets[i]}/*
chmod -R u+r /target/gpfs2/lifelines-genome/home/${targets[i]}/*
else
echo "error: "${sources[i]}
fi
done
Count files to see if indeed all completed
for (( i=0; i<=$(( $total -1 )); i++ )) do
if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then
echo "${sources[i]} -> ${targets[i]}"
echo "raw"
ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/raw | wc -l
ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/raw | wc -l
echo "imputed"
ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/imputed | wc -l
ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed | wc -l
fi
done