wiki:VCFAggregateScriptManual

Running the VCF aggregration and de-sampling procedure

1. Install VCFtools and Tabix

Download and install Tabix and VCFtools. Tabix needs to be compiled first, so use make.

2. Add location of VCFtools and Tabix to your path

E.g. use

export PATH=/Applications/vcftools_0.1.10/bin/:/Applications/tabix-0.2.6/:$PATH

Or a more permanent option (.bashrc or .profile file or so)

Also, add the VCFtools Perl library to the Perl5 library locations:

export PERL5LIB=/Applications/vcftools_0.1.11/perl/:$PERL5LIB

3. Download and install Perl if you don't have it

Go to: https://www.perl.org/

4. Download the script and put in folder of choosing

GitHub? link: https://github.com/molgenis/ngs-utils/blob/master/scripts/vcf-fill-gtc.pl

Raw version for wget:

wget https://raw.githubusercontent.com/molgenis/ngs-utils/master/scripts/vcf-fill-gtc.pl --no-check-certificate

Aggregation procedure

1. Sort, filter, bgzip and index

VCFtools will not work on uncompressed, unindexed VCF files so we must sort, filter on 'PASS', bgzip and index with tabix.

for item in $(ls mydirectory/*.vcf); \
do echo "Processing $item..."; \
vcf-sort $item | vcf-annotate -H > $item\.sorted\.filtered; \
bgzip $item\.sorted\.filtered; \
tabix -p vcf $item\.sorted\.filtered\.gz; \
done

2. Merge sample VCFs into one batch VCF

vcf-merge mydirectory/*.vcf.sorted.filtered.gz | bgzip -c > merged.vcf.gz

3. Create a summary VCF per batch

perl vcf-fill-gtc.pl -vcfi merged.vcf.gz -vcfo stripped.vcf -ss -fv PASS -si -ll INFO > stripped.vcf.log

The option -ss is crucial here: it removed all sample details.

Afterwards, be sure to inspect the log file for warnings!

more stripped.vcf.log

Troubleshooting

Q: My VCF files are not completely valid format! A: The are some built-in options to help with this. For example this fixes a bug in old NextGene? versions:

Fix missing '>' at the end of contig meta-data lines.

perl -pi -e 's/(contig=<ID=[^>\n]+)$/$1>/' mydirectory/*.vcf

Q: I get this error:

Can't locate Log/Log4perl.pm in @INC

A: Install Log4perl:

cpan Log::Log4perl

Q: What are the script options? A: Man page:

#
# Create a summary VCF per batch:
#  -ss       : remove sample details!
#  -fv PASS  : keep only high quality variant calls that pass all filters applied in NextGene.
#              Just to be sure: variants should already have been filtered on PASS only in a previous step,
#              so this should be redundant here...
#  -si       : remove all INFO subfields except for INFO:AN and INFO:AC.
#              INFO:AN and INFO:AC were automatically updated by vcf-merge,
#              but the others were not and may contain erroneous annotation
#              that cause vcf-validator to complain the created VCF is not valid.
#
Last modified 10 years ago Last modified on 2014-09-17T14:47:30+02:00