wiki:VCFAggregateScriptManual

Version 6 (modified by jvelde, 10 years ago) (diff)

--

Running the VCF aggregration and de-sampling procedure

1. Install VCFtools and Tabix

Download and install from: http://sourceforge.net/projects/samtools/files/tabix/ and http://vcftools.sourceforge.net/ Tabix needs to be compiled first. (use MAKE )

2. Add location of VCFtools and Tabix to your path

E.g. use

export PATH=/Volumes/Users/Software/vcftools_0.1.10/bin/:/Volumes/Users/Software/tabix-0.2.6/:${PATH}

Or a more permanent option (.bashrc file or so)

3. Download and install Perl if you don't have it

Go to: https://www.perl.org/

4. Download the script and put in folder of choosing

GitHub? link: https://github.com/molgenis/ngs-utils/blob/master/scripts/vcf-fill-gtc.pl Raw version for wget: https://raw.githubusercontent.com/molgenis/ngs-utils/master/scripts/vcf-fill-gtc.pl

Aggregation procedure

1. Merge sample VCFs into one batch VCF

vcf-merge CAR_*/*.vcf.sorted.filtered.gz | bgzip -c > merged.vcf.gz

2. Create a summary VCF per batch

vcf-fill-gtc.pl -vcfi merged.vcf.gz -vcfo stripped.vcf -ss -fv PASS -si -ll INFO > stripped.vcf.log

The option -ss is crucial here: it removed all sample details.

Afterwards, be sure to inspect the log file for warnings!

more stripped.vcf.log

Man page:

#
# Create a summary VCF per batch:
#  -ss       : remove sample details!
#  -fv PASS  : keep only high quality variant calls that pass all filters applied in NextGene.
#              Just to be sure: variants should already have been filtered on PASS only in a previous step,
#              so this should be redundant here...
#  -si       : remove all INFO subfields except for INFO:AN and INFO:AC.
#              INFO:AN and INFO:AC were automatically updated by vcf-merge,
#              but the others were not and may contain erroneous annotation
#              that cause vcf-validator to complain the created VCF is not valid.
#

Troubleshooting

Q: My VCF files are not completely valid format! A: The are some built-in options to help with this. For example:

Prepare sample VCFs for one batch; e.g. CAR_Batch1_106Samples

cd /Volumes/CardioKitVCFs/OriginalVCFs/CAR_Batch1_106Samples

Fix missing '>' at the end of contig meta-data lines.

perl -pi -e 's/(contig=<ID=[^>\n]+)$/$1>/' CAR_*/*.vcf

Sort, filter on 'PASS', bgzip and index with tabix (vcftools will not work on uncompressed, unindexed VCF files.)

for item in $(ls CAR_*/*.vcf); \
do echo "Processing $item..."; \
vcf-sort $item | vcf-annotate -H > $item\.sorted\.filtered; \
bgzip $item\.sorted\.filtered; \
tabix -p vcf $item\.sorted\.filtered\.gz; \
done