wiki:SopConvertLifeLinesGenoData

Version 21 (modified by Morris Swertz, 13 years ago) (diff)

--

SOP for converting LifeLines Geno Data

This SOP applies to LL3.

TODO: make molgenis 'compute' pipeline for this :-)

Specifications:

  • Geno data is released to researcher 'per study' (i.e. an approved research request).
  • Per study a subset of the individuals is selected
  • The individual identifiers are 're-pseunomized' from 'marcel identifiers' to 'study identifiers' (so data can not be matched between studies)
  • Data is reformatted in various PLINK formats

Expected outputs

User expects files in PLINK format:

  • TPED/TFAM genotype files (chosen for internal use as easier to produce)
  • BIM/BED/FAM genotype files (with missing value phenotype, monomorphic filtered)
  • IDEM but then splitted per chromosome
  • MAP/PED dosage files (with missing value phenotype, monomorphic filtered)
  • IDEM but then splitted per chromosome

Required inputs

The following are input for the conversion procedure:

  • TriTyper? imputed data files: /target/gpfs2/lifelines_rp/releases/LL3/
  • mapping file for study to select and re-pseudonomize identifiers

Example mapping file:

LL_WGA0001   STUDYPSEUDO1   0
LL_WGA0002   STUDYPSEUDO2   0
LL_WGA0003   STUDYPSEUDO3   0
...
  • So: Geno individual ID's - TAB - Study pseudonyms - TAB - Phenotypes (can be all 0's as TFAM will be generated later by the user)
  • Items are TAB-separated and it doesn't end with a newline

Procedure

Step 1: create subset_study<n>.txt file for study<n>

  • In every MOLGENIS<n> schema for a study that has geno data, there is a VW_DICT_GENO_PSEUDONYMS view
  • In this view, PA_IDs (LL IDs) are related to GNO_IDs ("Marcel" IDs, the LL_WGA numbers)
  • Export this view (tab separated, no enclosures, no headers) to subset_study<n>.txt
  • scp to cluster.gcc.rug.nl:/target/gpfs2/lifelines_rp/releases/LL3

Step 2: convert into study<n>.tped format

Estimated runtime: 4 hours (4Gb/2 cpu @ cluster.gcc.rug.nl)

cd to directory:

cd /target/gpfs2/lifelines_rp/releases/LL3

reformat mapping file:

./formatsubsetfile.sh study<n>.txt

run convertor on TriTyper? and Mapping file:

/target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/java -jar TriToPlinkLifeLines.jar P BeagleImputedTriTyper/ study<n> subset_study<n>.txt

Note:

  • Convertor from TriTyper? to PLINK resides on /target/gpfs2/lifelines_rp/releases/LL3
  • Correct Java version resides on /target/gpfs2/lifelines_rp/tools/jdk1.6.0_22/bin/

Step 3: convert into binary plink format

Convert .tped into study<n>.bed, .bim. and .fam files:

plink --tfile study<n> --make-bed --out study<n> 

Split study<n>.bed, .bim, fam per chromosome:

this script is untested, awaiting account

#create variable holding study name
study = study<n>

#get all chromosomes out of .bim file
chrs=`awk '{print $1}' ${study}.bim | sort -nur`
echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
echo ""

#use to split/convert
for chr in $chrs; do
        print "Processing chromosome $_\n";
        plink --bfile $study --chr $_ --make-bed --out $study$_;

NB: If this takes long we should make this cluster jobs!

Step 4: convert into dosage format

MISSING! ask Joeri?

Step 5: copy all study<n> files to the lifelines0<n> folder

cp study<n>* ../../lifelines0<n>

  • May take some time!

Overview

http://i.imgur.com/nLT2e.png

A schematic overview of the export procedures described above.