Photo by Bill Majoros. Used with permission.

NEWS

[8/23/2018] Experiment design web tool released - A web tool is now available for power and sample size estimation: http://67.159.92.22:8080/
[6/16/2018] First version of BIRD released - The first version of BIRD has been released on GitHub at https://github.com/bmajoros/BIRD.

Contents

Description
Download
Installation
Running
Output Format
Examples
Web Tool
Contact

Description

BIRD (Bayesian Inference of Regulatory Differences) is a software suite for identifying regulatory variants in data from STARR-seq and other massively parallel reporter assays (MPRAs). BIRD uses a Bayesian hierarchical model to integrate prior information with read count data. Using MCMC (Markov Chain Monte Carlo), BIRD efficiently performs posterior inference to estimate effect sizes of regulatory variants. Because BIRD's inference is based on posterior distributions, it is able to provide both point estimates and credible intervals for effect sizes. BIRD has been found to be substantially more accurate than other tests based on the beta-binomial distribution and Fisher's exact test.

The purpose of BIRD is to estimate the effect size of regulatory variants, where the effect size, theta, is defined as:

Thus, an effect size of 1 means that the variant has no effect on transcription. Effect sizes less than 1 mean that the alternate allele reduces transcription, while effects greater than 1 mean that the alternate allele increases transcription. Effect sizes tend to range between 0.5 and 2.0.

BIRD explicitly models the unknown allele frequency p in the DNA and q in the RNA, as well as the individual frequencies q_i in each biological replicate. BIRD computes an effect size (theta) from these latent variables p and q, rather than simply dividing read counts. The common method of dividing read counts results in unstable estimates when read counts are low. BIRD's use of an informative prior on effect sizes results in more stable estimates when read counts are low. In addition, BIRD's explicit modeling of replicate structure improves its estimate of allele frequencies in RNA, thus providing more accurate estimates of regulatory effect size. Finally, because BIRD reconstructs the full posterior distribution of theta (the effect size), it can assess how confident its estimate of theta is. This confidence is reported to the user as a 95% credible interval, as well as a posterior probability, P_reg, that the variant has a regulatory effect.

BIRD is written in open-source STAN, C++, and python, and can be compiled on most UNIX systems. BIRD is was developed in the Center for Statistical Genetics and Genomics at Duke University.

Download

BIRD can be downloaded from GitHub:

https://github.com/bmajoros/BIRD

BIRD is designed to run on Linux. Installation instructions are provided below.

Installation

Prerequisites for Installing BIRD

The following are required to install and run BIRD directly on your system:

BIRD has been tested on Linux. It may or may not work on other UNIX systems.
Installation must be on a case-sensitive filesystem.
CmdStan must be installed. This is the command-line interface to the STAN statistical programming language.
g++/gcc version 6.2 or higher is required.
Python version 3.5 or higher is required.
GSL, the Gnu Scientific Library, must be installed.

Installing and Compiling BIRD Source Code

First, install CmdStan, and set the environment variable $STAN to the directory where CmdStan has been installed. In the bash shell, this can be done using a command of the form:

export STAN=/full/absolute/path/to/cmdstan

Next, download BIRD, copy its files into the CmdStan directory, and compile BIRD in the CmdStan directory, using these commands:

git clone --recursive \

  https://github.com/bmajoros/BIRD.git
cd BIRD

make all

cp swift *.{stan,py} $STAN

cd $STAN

make BIRD

Next, please set these additional environment variables:

$TMPDIR : absolute path to a location where temporary files can be made (this might already be set by your system administrator)
$BIRD : absolute path to the BIRD directory created above by the git clone command
$PYTHONPATH : absolute path to the $BIRD/python subdirectory
All environment variables documented in the GSL distribution. These include $GSLDIR and possibly others.

If you use the tcsh or csh shells, this can be done using the setenv command; if your shell is bash, you'll need to use the set and export commands. It is recommended to put these commands in your .cshrc or .bashrc file so they run every time you log in. Note that your system administrators may have already set $TMPDIR for you.

Running BIRD

Prerequisites

Before the BIRD model can be run, you must create a file containing the read counts for each genetic variant. The format of this required file is described below.

To obtain read counts, you will first need to align your sequencing reads using an aligner such as Bowtie. Once your reads are aligned, you can use the samtools mpileup command to get read counts for each genetic variant. You can then put those read counts into an input file for BIRD. The format of the input file is described next.

Input Format

The input file for BIRD must contain one line per genetic variant. Each line consists of tab-separated columns.

The first two columns are:

ID = identifier of the variant. For example, rs847362.
D = number of DNA replicates. This should be an integer. If you only sequenced your DNA once, set this to 1.

The next set of columns consists of pairs of reference and alternate read counts in the DNA. The number of such pairs is given by D. Thus, if you did 3 DNA replicates, you will have six columns after the D column:

Read count for reference allele in DNA replicate 1
Read count for alternate allele in DNA replicate 1
Read count for reference allele in DNA replicate 2
Read count for alternate allele in DNA replicate 2
Read count for reference allele in DNA replicate 3
Read count for alternate allele in DNA replicate 3

If you have more than 3 replicates, continue alternating reference and alternate allele counts as above, separating all columns by tabs.

Next is column R, which specifies the number of RNA replicates.

Finally, this is followed by R pairs of reference and alternate read counts in the RNA, just as for the DNA above.

Thus, an example line for 1 DNA replicate and 3 RNA replicates would be as follows (NOTE: do not include the header line in your input file):

ID        D  ref 
alt  R  ref  alt ref  alt ref  alt

variant1  1  101  99   3 
110  89  114  84  109  82

The example-inputs.txt file in the $BIRD directory provides additional examples.

Running the Model

The model must be run in the $STAN directory. The following command will run the model on a set of variants:

bird.py BIRD lambda input.txt stan.txt samples first-last > out.txt

The parameters are:

BIRD = the model to use; set this to "BIRD" (no quotes).
lambda = the minimum effect size to include in posterior calculations. We recommend setting this to 1.25, which will calibrate posterior probabilities for detecting effects larger than 1.25 or less than 1/1.25. The minimum value for this parameter is 1. This parameter does not affect the estimated effect size or credible interval.
input.txt = the genetic variants to test, one line per variant, as described above.
stan.txt = the name of a file that BIRD can use to store outputs from the STAN engine.
samples = the number of MCMC samples to draw. We recommend 1000. If speed is crucial, you can set this to 500, or use the swift program instead (see below).
first-last = the range of variants to process, in zero-based coordinates. If your input file contains 50 variants, set this to 0-49 to process all 50 variants. You can use this parameter to split the variants across multiple jobs, for parallelization. For example, to run 5 jobs, you could use the ranges 0-9, 10-19, 20-29, 30-39, and 40-49. Be sure to specify different output files and different temp files for the different jobs. BIRD is not automatically parallelized; you will need to run the separate jobs manually.
out.txt = the output file where predictions will be written.

An optional parameter is -t thetas.txt, which will store the sampled values of theta (effect size) into the given file, with all the samples for a single variant on a single line separated by tabs. If you process multiple variants, the samples for different variants will be on different lines, in the same order as in the input file.

Note that the BIRD model relies on MCMC sampling, which can be slow. To run on millions of variants with high allele frequencies (>0.075), we recommend using the Swift model instead of BIRD. Swift is approximately 5000 times faster than BIRD, while being only slightly less accurate for variants with high allele frequencies; for allele frequencies less than 0.075, BIRD tends to be more accurate.

Output Format

The output consists of 5 columns:

Variant ID = identifier of the variant tested.
Effect size (theta) = posterior median of effect size under the model. This effect size can be interpreted as the transcriptional rate of the alternate allele divided by the transcriptional rate of the reference allele. An effect of 0.5 means that the alternate allele halves the transcriptional rate. Effect sizes tend to be between 0.5 and 2 for real regulatory variants. An effect size close to 1 means no effect.
Lower CI = lower bound of 95% credible interval.
Upper CI = upper bound of 95% credible interval.
P_reg = posterior probability that the variant has a regulatory effect. This probability reflects the lambda value specified in the BIRD command. If you specified 1.25 (as recommended), P_reg will be the posterior probability that the effect size is greater than 1.25 or less than 1/1.25.

Parallelizing BIRD on Your System

Because BIRD processes each variant individually, parallelization on different cluster computing systems is simple, via batch processing of different sets of variants. By splitting the full variant set into N subsets (input files) and submitting N batch jobs to process those files on N different compute nodes, processing can in theory achieve an N-fold speedup. Be sure to specify different outputs files and different temp files for the different jobs.

Running the Swift Model

BIRD can be slow when run on extremely large numbers of variants, due to its use of MCMC sampling. A faster model that is typically only slightly less accurate on common variants (maf>0.075) is included in the BIRD package. It is called swift, and has the same input and output file formats as BIRD. Swift can be run via the following command:

swift input.txt 100 lambda first-last 1000

Where:

input.txt is the input file, formatted the same as for BIRD.
lambda (minimum effect size in posterior calculations) is defined the same as for BIRD: we recommend using 1.25.
100 is a concentration parameter for the prior that shrinks q toward p; we have found 100 to work well on multiple data sets.
first-last is the range of variant indices to process, as in BIRD. Indices are zero-based and inclusive.
1000 is the number of samples to draw. We have found 1000 samples to work well.

If you want to save the individual MCMC samples to reconstruct the full posterior distribution for theta, you can use the -s option with a filename. The samples for theta will be written into the specified file. All samples of theta for a single variant will be on a single line, separated by tabs. Samples for different variants will be on separate lines.

Swift has been found to be approximately 5000 times faster than BIRD, making it practical for use on the largest of data sets. Note that Swift tends to be less accurate than BIRD on rare/uncommon variants (allele frequencies less than 0.075).

Example Inputs and Outputs

In the $BIRD directory is a file containing 1000 genetic variants simulated to have an effect size of 0.5, meaning that they reduce transcriptional rate by half. BIRD can be run on this file using the following command in the STAN directory:

bird.py BIRD 1.25 $BIRD/example-inputs.txt mcmc.txt 1000 0-499

The input file contains 1000 lines, each line consisting of a single simulated variant. For example, the first line is:

variant1    1   
885    115    10   
40    0    47   
6    61    6   
75    5    88   
5    101    6   
112    8    124   
9    140    7   
149    11

The first field is the variant identifier ("variant1"), followed by the number of DNA replicates, which is 1 here. After that come the DNA read counts for the reference allele (885) and the alternate allele (115). This is followed by the number of RNA replicates (10). After that are pairs of fields giving the reference and alternate read counts in the RNA. For the first RNA replicate there were 40 reference and 0 alternate reads. For the second RNA replicate, there were 47 reference and 6 alternate reads, etc...

Because BIRD uses a random sampling approach to inference, its outputs will differ somewhat between runs. The outputs should be similar to the lines shown below, but with slight differences due to sampling:

variant   effect  CI_left  CI_right  P_reg
variant1  0.5513  0.3806   0.8016    0.974

variant2  0.5868  0.4513   0.8492    0.963

variant3  0.5985  0.4441   0.8284    0.963

      ...etc...

Since these variants were simulated to have an effect size of 0.5 (a halving of transcriptional rate), we expect the estimated effect to be on the order of 0.5, as they are here. However, because the model's prior shrinks estimates toward 1 (i.e., no effect), estimates are moderated toward 1 when the model is not highly confident of the effect. Thus, for some variants in the example set, the estimate will be higher than 0.5, due to shrinkage under the prior.

The P_reg values (fifth column) indicate how confident the model is that there is a regulatory effect. If you specified a lambda value of 1.25 when running BIRD (as recommended), P_reg will be the posterior probability that the effect size is greater than 1.25 or less than 1/1.25.

To see that P_reg values are indicative of whether a variant has a regulatory effect, you can run BIRD on both positive examples in example-inputs.txt, and on a set of neutral variants included in the file example-nulls.txt:

bird.py BIRD 1.25 $BIRD/example-inputs.txt out-pos.tmp 1000 0-499 > bird.pos

bird.py BIRD 1.25 $BIRD/example-nulls.txt out-neg.tmp 1000 0-499 > bird.neg

Extracting the P_reg values (fifth column of the output files) and plotting their distributions shows that for the regulatory variants, P_reg is typically close to 1 (red distribution), whereas for neutral variants with no effect, P_reg is typically close to 0 (blue distribution), as expected:

Plots like this can be generated in the R statistical computing environment, using commands such as the following:

pos <- read.table("bird.pos")

neg <- read.table("bird.neg")

dpos <- density(pos$V5)

dneg <- density(neg$V5)

plot(dpos,type="n")

lines(dpos,col="red")

lines(dneg,col="blue")

Utilizing the full Posterior Distribution

While BIRD summarizes the posterior distribution of theta (effect size) using the posterior median and the 95% credible interval, advanced users may wish to reconstruct the full posterior distribution of theta and extract other summary statistics.

If you used the -t thetas.txt option when running BIRD, then the samples of theta will be saved in the file thetas.txt. The samples for each variant will be separated by tabs, with samples for different variants on different lines (in the same order as the input file).

You can load this data into the R statistical computing environment. To plot the distribution of theta values for the i^th variant, you can use these R commands:

data <- read.table("thetas.txt")
hist(as.numeric(data[i,]))

For example, the histogram below was constructed from the samples for one of the variants in the example-inputs.txt file. As you can see, the peak of the distribution is close to the true value of 0.5:

However, this distribution has a large variance, indicating that the model is not certain that theta is precisely equal to 0.5. The 95% credible interval reported in the BIRD output (CI_left and CI_right) gives the symmetric interval containing 95% of the MCMC samples, and indicates the amount of spread in the distribution. Using the full distribution as illustrated above, you can extract other credible intervals as desired, or compute arbitrary tail probabilities.

For example, an estimate of P(theta<=0.5) for the first variant in the input file can be obtained in R using these commands:

data <- read.table("thetas.txt")
sum(data[1,]<=0.5)/ncol(data)

which in this case gives 0.264. Thus, the model assigns a degree of belief of 26.4% that the variant reduces transcription by factor of 2 or more.

Web Tool for Experimental Design

BIRD can be used to estimate required read depths and replicates to achieve a desired sensitivity and false discovery rate (FDR). The web tool for computing these sample sizes is at:

http://67.159.92.22:8080/

Contact

Please direct all bug reports and feature requests to Bill Majoros.

BIRD was developed by Bill Majoros at Duke University in collaboration with Andrew Allen and Tim Reddy. The web tool was developed by Alex Barrera and Young-Sook Kim at Duke University.

contact: bmajoros@alumni.duke.edu

Eagle photo by Bill Majoros. Used with permission.