Data preparation

Here we describe how to enhance the VCF file one gets from a usual variants caller by adding useful annotation and moving the data into a relational database to allow fast searching and filtering of variants within Varapp.

The VCF must be first decomposed and normalized with vt, then annotated with VEP, then is ran through Gemini to transform it into a relational database of annotated variants.

The complete pipeline we use can be found here. When a new VCF arrives, run it through the pipeline, then move the result where Varapp reads it (see next section).

Input format

The input data is a multi-samples VCF such as generated by GATK, and a pedigree (PED file) describing the familial relationships, i.e. a 6 columns tab-delimited text file with a .ped extension:

Family_ID  Individual_ID  Paternal_ID  Maternal_ID  Sex  Phenotype
Fam1       A                                        1    1
Fam1       B                                        2    1
Fam1       C              A            B            1    2
Fam1       D              A            B            2    2

Sex: 1=male, 2=female. Phenotype: 1=unaffected, 2=affected. Paternal_ID and Maternal_ID are left blank if unknown; all other fields are mandatory. The phenotype can be changed freely from the web interface.

If not provided, the PED file will be generated automatically, considering every sample as coming from a different family (e.g. random cohort).

Typically, a multi-VCF contains one or a few families, or up to a few hundred individuals from a cohort. For performance reasons, is is advised to rather split very large datasets when possible. Varapp makes it easy to switch from one group to the other later on.

Annotation

We need to add information to the VCF such as the frequency in the population, impact, pathogenicity scores, etc. of each variant. For that, we use Ensembl’s VEP and Gemini.

Varapp does not modify or add any more information to their output. It will only keep track of which programs and databases versions have been used to produce a dataset, so that it is always possible to reproduce a result obtained with an older version. Note: We use Gemini for annotation only; Varapp has its own query API for filtering.

Until a later release, Varapp is sensible to the annotation it finds in the VCF (i.e. it will ignore supplementary info but complain if some is missing). Here is the command that we run to annotate with VEP:

perl ${VEP_PATH}/variant_effect_predictor.pl -i <vcf> \
    --cache \
    --dir ${VEP_PATH}/.vepcache \
    --fasta $ref \
    --sift b \
    --polyphen b \
    --symbol \
    --numbers \
    --biotype \
    --total_length \
    --canonical --ccds \
    --vcf \
    --hgvs \
    --gene_phenotype \
    --uniprot \
    --force_overwrite \
    --domains --regulatory \
    --protein --tsl \
    --variant_class \
    --port 3337 \
    -o ${VEP_OUT}

And this is the custom annotation we add from the VCF INFO field into Gemini:

gemini annotate -f <vcf> \
    -a extract \
    -c AF,BaseQRankSum,FS,MQRankSum,ReadPosRankSum,SOR \
    -t float,float,float,float,float,float \
    -e AF,BaseQRankSum,FS,MQRankSum,ReadPosRankSum,SOR \
    -o mean,mean,mean,mean,mean,mean \
    <gemini_db>

The annotation with VEP can take a few hours. However, it has to be done only once.

Variants databases

For each samples batch, the previous step produces a database (SQLite).

Varapp can interrogate as many of these variants databases as the user needs. The user interface allows to change the working database in one click. For performance reasons, we don’t recommend generating databases containing more than 500K variants.

It is time to install the app. First follow the Installation instructions.

Manually

At this point the app needs to be told where to look for Gemini databases. This is done by setting GEMINI_DB_PATH to the location of your files in the config file settings.py of the backend.

Then when you copy new Gemini databases into that directory, they will be detected by the application and added to the list of available databases in the Admin panel, so that the administrator can attribute them to users.

Alternatively, you can fill the variants_db table of “users_db”, in a similar fashion to the demo data already present, or use it to add metadata. Be careful though that the field “name” must be unique.

Automated pipeline

(In construction)

Start using the app

As soon as the data is ready, there is no need to look at those files anymore. Log in Varapp and start using the graphical interface.

The first time it sees new databases, Varapp will take a few time to fill its cache, especially if the databases are big. For now you can keep track of this progress in the Apache logs - but we are working on making it more visible.

Add a database access to a user

If you are logged as a user with role ‘admin’ or ‘superuser’, you can access the Admin panel from the link on the top right of the screen. In the left column, select the user. Then in the right column, use the dropdown button to select a database to attribute to that user. You can deny access by using the “Remove” button in front of a database name.