## ── Attaching packages ─────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

ExOutBench is a tool used to benchmark the performance expression outlier calling methods in prioritizing genes whose expression is associated with rare genetic variation. For each function in the package the input is:

  • A file with rare variant information. Which individual-gene pairs have a rare variant? This is provided, or pre-compiled beforehand.
  • A file with the expresion outlier status of the same individual-gene pair. This is the thing you’re trying to test. The file with variant information (called outlier.calls) is formatted like this:
SampleName GeneID chr start end consdetail
IndvA ENSG00000186092 chr1 64904 64905 upstream
IndvA ENSG00000230021 chr1 631495 631496 upstream
IndvA ENSG00000230021 chr1 666203 666204 intron,non_coding
IndvA ENSG00000230021 chr1 666398 666399 intron,non_coding
IndvB ENSG00000225972 chr1 631495 631496 downstream
IndvB ENSG00000225630 chr1 631495 631496 downstream
IndvB ENSG00000237973 chr1 631495 631496 non_coding_exon
IndvB ENSG00000229344 chr1 631495 631496 upstream

Every line is an individual-gene pair and a rare variant associated with that gene. There can be multiple rare variants associated with one individual-gene pair, but if an individual-gene pair does not carry any rare variants within the specified window, it will be excluded from the table.

I have compiled this file for European individuals in the GTEx v8 cohort. It lists all rare variants (MAF < 1% in GnomAD and in GTEx) within 10 kb upstream of the gene and within the gene body. The file will have to be shared internally, since it contains personal genotype data.

rare.variants <- read_tsv(
  "/gpfs/commons/groups/lappalainen_lab/jeinson/data/ExOutBench_data/all_rare_variants_SNPs_10kb_genebody_w_consdetail_no_NA.tsv", 
  progress = F)
## Parsed with column specification:
## cols(
##   SampleName = col_character(),
##   GeneID = col_character(),
##   chr = col_character(),
##   start = col_double(),
##   end = col_double(),
##   consdetail = col_character()
## )

The test also takes in expression outlier calls from any method you might want to benchmark. The formatted is as such, where every row is an individual-gene pair with a corresponding “outlier score”:

GeneID SampleName outlier.score
ENSG00000000971 GTEX-1117F 0.9098267
ENSG00000001561 GTEX-1117F 0.9605523
ENSG00000001617 GTEX-1117F 0.9692674
ENSG00000002549 GTEX-1117F 0.9269707
ENSG00000002822 GTEX-1117F 0.8792655
ENSG00000002834 GTEX-1117F 0.9964071
tiss.outlier.scores <-
  read_csv("/gpfs/commons/groups/lappalainen_lab/jeinson/projects/aneva-dot/analysis_v8/ANEVA-DOT-pipeline/Filtering/combined.ad.scores.in.MSCLSK.tsv") %>%
  rename("GeneID" = "X1") %>%
  gather(SampleName, DOT.score, -GeneID) %>%
  filter(complete.cases(.)) %>%
  rename("outlier.score" = "DOT.score")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )
## See spec(...) for full column specifications.

!! The column names in the outlier call file and the rare variant call file must match! The package joins them together then performs enrichment analyses.

Right now, this package is able to calculate enrichment by

  • Outlier score threshold: This indicates how likely it is to observe a rare variant given outlier status, increasingly stringent outlier score thresholds.
  • Variant annotation: This shows the relative risk of a certain type of rare variant causing an expression outlier to occur, under a fixed threshold.

In the future, I will add support for showing enrichment of rare variants based on their distance to outlier genes, and enrichment of rare variants given their population frequency.

By default, each function removes all individual-gene pairs from genes that never occur as outliers. This functionality can be turned off by setting limit.to.genes.w.outliers to FALSE. The functions produce a data frame, and plots the results automatically by default. See the package documentation for more options and descriptions.

Run the enrichment by significance pipeline!

enrichment.by.significance.output <- 
  enrichment_by_significance(outlier.calls = tiss.outlier.scores, rare.variants = rare.variants, draw.plot = T)
## Only considering genes with at least one outlier at 0.05 
## Checking to only include individuals that have genotype data!!
## Joining, by = c("GeneID", "SampleName")
## Calculating enrichment scores

enrichment.by.significance.output
##      ratio  lower.q  upper.q n.outliers.w.var n.outliers
## 1 1.214516 1.184871 1.244902             1880       2673
## 2 1.292598 1.253777 1.332621             1043       1393
## 3 1.353310 1.307331 1.400906              697        889
## 4 1.400594 1.348621 1.454569              508        626
## 5 1.434031 1.377525 1.492854              403        485
## 6 1.478486 1.416925 1.542723              305        356
##   n.non.outliers.w.var n.non.outliers   sig
## 1               364503         629427 5e-02
## 2               365340         630707 1e-02
## 3               365686         631211 1e-03
## 4               365875         631474 1e-04
## 5               365980         631615 1e-05
## 6               366078         631744 1e-07

Run the enrichment by annotation pipeline!

enrichment.by.annotation.out <- 
  enrichment_by_annotation(outlier.calls = tiss.outlier.scores, rare.variants = rare.variants, draw.plot = T)
## Warning in if (!(c("SampleName", "GeneID", "chr", "start", "consdetail")
## %in% : the condition has length > 1 and only the first element will be used
## Only considering genes with at least one outlier at 0.05 
## Checking to only include individuals that have genotype data!!
## Combining outlier calls with rare variant information
## Joining, by = c("GeneID", "SampleName")
## Calculating enrichment scores

enrichment.by.annotation.out
##         ratio     lower.q    upper.q n.outliers.w.var n.outliers
## 1    1.075880   1.0350329   1.118339             1312       2673
## 2    1.608611   0.9335467   2.771826               13       2673
## 3    1.973116   1.7418131   2.235136              228       2673
## 4    9.078017   7.5648357  10.893878              115       2673
## 5    1.299335   1.1782949   1.432810              351       2673
## 6    1.050469   0.9430467   1.170127              295       2673
## 7    3.424950   2.7594900   4.250887               81       2673
## 8    1.181876   0.8868263   1.575089               46       2673
## 9    2.250708   1.9618846   2.582050              191       2673
## 10   1.967354   1.6505084   2.345024              120       2673
## 11   2.994491   1.6541396   5.420930               11       2673
## 12  29.969656  13.6612328  65.746650                7       2673
## 13 140.227203 110.5887923 177.808873              106       2673
## 14  97.636336  65.5794661 145.363400               34       2673
## 15  27.703044   6.4035126 119.849630                2       2673
## 16  21.406897   2.7646457 165.755509                1       2673
## 17   0.000000   0.0000000        NaN                0       2673
##    n.non.outliers.w.var n.non.outliers            anno
## 1                287155         629427          intron
## 2                  1903         629427             NMD
## 3                 27210         629427     3_prime_UTR
## 4                  2983         629427          splice
## 5                 63611         629427        upstream
## 6                 66128         629427      downstream
## 7                  5569         629427     5_prime_UTR
## 8                  9165         629427      non_coding
## 9                 19983         629427        missense
## 10                14363         629427      synonymous
## 11                  865         629427 non_coding_exon
## 12                   55         629427 splice_acceptor
## 13                  178         629427     stop_gained
## 14                   82         629427    splice_donor
## 15                   17         629427      start_lost
## 16                   11         629427       stop_lost
## 17                    6         629427   stop_retained