wiki:CLCdemo27Jan11

Version 1 (modified by Wil Bruins, 14 years ago) (diff)

--

CLCbio demo 27 Jan 2011

Pieter demonstrated the protocol he uses to analyse raw sequencing data (alignment + SNP calling and annotation) Thursday 27 Jan. He used Jokes MMIHS sample.

MMIHS is a severe bowel syndrome, leading to the death of patients at a very young age. This patient was a foetus aborted before 24 weeks of pregnancy (syndrome detected at 20 week ultrasound) of consanguineous (closely related, cousins) parents.

  1. Where would we expect causal variants?
    • Because the parents are consanguineous, we expect the causal variant to be homozygous. Using SNP array and plotting the B allele frequency, stretches can be detected where all SNPs are homozygous. These regions are the regions of interest.
  2. Sequencing.
    • in this particulate dataset (110114 lane 5) the base qualities are a bit funny at the 3' end. Should check FastQC report.
    • read trimming using CLC:
      • 0.03 limit, equals Q20 (almost, CLC manual explains this)
      • trim first base
      • trim 2 or more ambiguous bases in one stretch (Ns)
      • CLC discards reads that after this trimming are shorter than 50 bases
      • all this seems similar to using BWA's -q 20
      • these trimming settings should also be used in the control sample (see below)
    • mapping:
      • can be relatively strict because of the trimming
      • Per chromosome so keep in mind the possibility that reads map (for example to a pseudogene) that should have been mapped to another chromosome alltogether
      • the settings mismatch cost 2, insertion cost 3, deletion cost 3 are similar to BWA allowing 5 mismatches
      • length fraction 0.9 with similarity 0.95 means that 90% of the reads should be 95% similar to the reference sequence
    • control sample: should be unrelated as in not family and not the same or similar disease (allthough HNPCC control is fine for MMIHS), but should be on the same run and done using the same capturing. Just for terminology: the library is a tube with DNA ready for sequencing: 1 sample or pool, fragmented, size selected (gel), adapter ligated
  3. SNP calling.
    • CLC can detect indels up to some 5 bases (DIP detection)
    • quality settings: window length 11, max gaps/mismatches 2, max average quality of surrounding bases 15, min quality of central base 20
    • significance settings: min coverage 10, minimum allele frequency of 60%, even though this is low for expected homozygous variants (which should be 100% different from the reference), the control can be 50%.
    • ploidy: max expected variations of course 2
    • CLC uses an annotated reference: the SNPs will have gene names when applicable
    • how reliable are SNPs found only on forward or reversely mapped reads? Pietr ignorse this for now
    • delete SNPs that are both in sampel and control (Excel)
    • delete SNPs that lack gene name (whole exome)
    • highlight SNPs in the previously detected homozygous regions
  4. SeattleSeq
    • should be in format chr<tab>pos<tab>ref base<tab>cons base
    • remove known variants and other variants you're not interested in
  5. Return to CLC to check the remaining SNPs