= SOP for converting LifeLines Geno Data for one study = [[TOC()]] == Background == Researcher request access to LifeLines. If granted, they get a 'study'. In this study data set all identifiers should be impossible to link back to the rest of lifelines. Therefore, all 'lifelines participant identifiers' will need to be replaced with 'study sample identifiers'. This SOP describes how to do that for genotypic data. == Variables per study: == * '''studyId''' - the id of study. E.g. 'OV039' * '''studyDir''' - the folder where all converted data should go. E.g. '../home/lifelines_OV039' * '''mappingFile''' - describing the mapping of 'lifelines gwas IDs' to 'study sample IDs' are. E.g.: {{{ LL_WGA0001 12345 LL_WGA0002 09876 LL_WGA0003 64542 ... }}} * So: LifeLines GWAS ID's - TAB - Study sample IDs * Items are TAB-separated and it doesn't end with a newline == Constants over all studies (for LL3): == * Beagle imputed dosage files per chromosome (dose): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/*.dose * Beagle imputed genotype files per chromosome (ped/map): /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/*.map and *.ped * Beagle batch and quality files: /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/BeagleBatches.txt and BeagleQualityScores.txt * PCA analysis of the genotypes; used for QC (filtering of individuals genetically too different): /target/gpfs2/lifelines_rp/releases/LL3/UnimputedPedMap == Expected outputs == Result of this procedure is that there will be a folder readable to a lifelines_OV039 user containing: * [studyId]_Chr[x].PED/MAP/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_Chr[x].BIM/BED/FAM imputed genotype files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_Chr[x].DOSE imputed dosage files (split per chromosome, with missing value phenotype, monomorphic filtered) * [studyId]_imputation_batch.txt listing imputation batches * [studyId]_imputation_batch_quality.txt listing imputation quality per SNP x batch NB: All files should be prefixed with {{{studyId}}}. > TODO monomorphic filtering == Procedure == === Step 0: request a study user === Contact "Target beheer" (admins, e.g. Ger) and ask for an account. E.g. username = lifelines_OV093 group = none home directory = /target/gpfs2/lifelines-genome/home/lifelines_OV093 === Step 1: GCC gets a mapping file === This is done by member of TCC: * In every [StudyID] schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view * In this view, PA_IDs (LL IDs) are related to GNO_IDs ("WGA" IDs, the LL_WGA numbers) * Export this view (tab separated, no enclosures, no headers) to subset_study.txt * scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3 The result is the mappingFile as described above === Step 2: generate conversion jobs === The conversion has been fully automated (see below for details). First we generate all the jobs needed to convert. (1) Create the conversion script. We edit a file, e.g. convert_OV039.sh Command: {{{ sh /target/gpfs2/lifelines_rp/releases/LL3/scripts/generateGenoJobs.sh \ --studyId OV039 \ --studyDir /full/path/to/lifelines_OV039 \ --mappingFile /full/path/to/mappingFile_OV039.txt }}} NB use full path names (2) Run the script {{{ sh convert_OV039.sh }}} === Step 3: submit jobs === change directory to the 'studyDir/jobs' directory, inspect jobs and submit: change directory: {{{ cd ../lifelines_OV039/jobs }}} list scripts (we expect jobs for 'plink' and 'dose' and each chromosome): {{{ ls -lah }}} submit to cluster {{{ sh submit_all.sh }}} monitor job completions (on an empty cluster this takes >2hr) {{{ qstat -u }}} === Step 4: QC results === Steps * go into study folder, e.g. lifelines_OV10_0026 * go into jobs folder * check *.finished files for all jobs (should all exist) * check error logs {{{ more *.err }}} (should be empty) * run the test {{{ sh jobs/run_test.sh }}} * check the *.missing file in the root folder {{{ cd .. more *.missing }}} If you see values here; please check with research office (Salome Scholtens) === Step 5: release === clean up {{{ cd ../lifelines_OV039/ #remove temp files (!) rm temp* #remove *.missing files (!) rm *.missing #give read access to the 'lifelines' group chgroup -R lifelines * }}} Change to user for the study, e.g. lifelines_OV039 {{{ #copy data to home directory (assume you are in home now) cp /target/gpfs2/lifelines_rp/releases/LL3/lifelines_OV039 . #reduce read to user only chmod -R go-r * }}} == Implementation details == The 'generateGenoJobs.sh' script implements the following steps: === Convert MAP/PED and generate BIM/BED/FAM=== {{{#!sh #step1 #generate 'updateIdsFile' and 'keepFile' files in plink format from the mappingFile #step2: for i in {1..22} #--file [file] is input file in .map and .ped #--keep [file] tells plink what individuals to keep (from mappingFile.txt file with fam + ind id) #--recode tells plink to write results (otherwise no results!!! argh!) #--out defines output prefix (here: filtered.*) #--update-ids [file] tells prefix to update ids #result: filtered.ped/map' /target/gpfs2/lifelines_rp/tools/plink-1.08/plink108 \ --file /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedPedAndMap/output.$i \ --update-ids $updateIdsFile \ --out $studyDir/${studyId}_chr$i \ --keep $keepFile \ --recode #step 3: for i in {1..22} #convert to bim/fam/bed plink \ --file $studyDir/${studyId}_chr$i \ --make-bed #step 4: #remove temp rm temp* }}} === Convert dosage format === As PLINK cannot updateIds on dosage files we created it ourselves. The command: {{{ #step1: for i in {1..22} #--subsetFile is the mappingFile #--doseFile is the imputed dosages #--outFile is where the converted file should go python /target/gpfs2/lifelines_rp/releases/LL3/scripts/convertDose.py \ --subsetFile $mappingFile \ --doseFile /target/gpfs2/lifelines_rp/releases/LL3/BeagleImputedDosage/ImputedGenotypeDosageFormatPLINK-Chr$i.dose \ --outFile ${studyDir}/${studyId}_chr$i.dose }}} == Deprecated: Maintaining the source code of the tools == To work with the sourcecode: 1. Checkout code from svn: http://www.molgenis.org/svn/standalone_tools/ 2. Find compiled jars at http://www.molgenis.org/svn/standalone_tools/jars/ 2. Read manuals for use: http://www.molgenis.org/svn/standalone_tools/manuals/ = Additional scripts = Define the source folders (where data has been converted) and final destination (folder where data is served). Sometimes the name are different. For example {{{ sources=("lifelines_OV0xy" "lifelines_OV0xy") targets=("lifelines_OV0ab" "lifelines_OV0cd") }}} Script to check if all folders are there {{{ total=${#sources[*]} for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo ${targets[i]} else echo "error: "${sources[i]} fi }}} Batch copy and setup of the data {{{ total=${#sources[*]} for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo ${targets[i]} rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bim rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.ped rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.map rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.bed rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.dose rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.log rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*.fam rm /target/gpfs2/lifelines-genome/home/${targets[i]}/*_imputation* rsync -av --progress /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ \ /target/gpfs2/lifelines-genome/home/${targets[i]}/ rm /target/gpfs2/lifelines-genome/home/${targets[i]}/temp* rm /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed/test* rm /target/gpfs2/lifelines-genome/home/${targets[i]}/jobs* chown -R $user /target/gpfs2/lifelines-genome/home/${targets[i]}/* chmod -R a-rw /target/gpfs2/lifelines-genome/home/${targets[i]}/* chmod -R u+r /target/gpfs2/lifelines-genome/home/${targets[i]}/* else echo "error: "${sources[i]} fi done }}} Count files to see if indeed all completed {{{ for (( i=0; i<=$(( $total -1 )); i++ )) do if [ -d /target/gpfs2/lifelines-genome/home/${targets[i]}/ ] && [ -d /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/ ]; then echo "${sources[i]} -> ${targets[i]}" echo "raw" ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/raw | wc -l ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/raw | wc -l echo "imputed" ls -lah /target/gpfs2/lifelines_rp/releases/LL3/${sources[i]}/imputed | wc -l ls -lah /target/gpfs2/lifelines-genome/home/${targets[i]}/imputed | wc -l fi done }}}