Description
BIRD (Bayesian
Inference of Regulatory Differences)
is a software suite for identifying regulatory variants in data from
STARR-seq and other massively parallel reporter
assays (MPRAs). BIRD uses a Bayesian hierarchical model to
integrate prior information with read count data. Using MCMC (Markov Chain Monte Carlo),
BIRD efficiently performs posterior inference to estimate effect sizes
of regulatory variants. Because BIRD's inference is based on
posterior distributions, it is able to provide both point estimates and
credible intervals for effect sizes. BIRD has been found to be
substantially more accurate than other tests based on the beta-binomial
distribution and Fisher's exact test.
The purpose of BIRD is to estimate the effect size of regulatory variants, where the effect size, theta, is defined as:
Thus, an effect size of 1 means that the variant has no effect on
transcription. Effect sizes less than 1 mean that the alternate
allele reduces transcription, while effects greater than 1 mean that
the alternate allele increases transcription. Effect sizes tend
to range between 0.5 and 2.0.
BIRD explicitly models the unknown allele frequency p in the DNA and q in the RNA, as well as the individual frequencies qi in each biological replicate. BIRD computes an effect size (theta) from these latent variables p and q,
rather than simply dividing read counts. The common method of
dividing read counts results in unstable estimates when read counts are
low. BIRD's use of an informative prior on effect sizes results
in more stable estimates when read counts are low. In addition,
BIRD's explicit modeling of replicate structure improves its estimate
of allele frequencies in RNA, thus providing more accurate estimates of
regulatory effect size. Finally, because BIRD reconstructs the
full posterior distribution of theta (the effect size), it can assess
how confident its estimate of theta is. This confidence is
reported to the user as a 95% credible interval, as well as a posterior
probability, Preg, that the variant has a regulatory effect.
BIRD is written in open-source STAN, C++, and python, and can be compiled
on most UNIX systems. BIRD is was developed
in the Center for Statistical
Genetics and Genomics at Duke University.
Download
BIRD can be
downloaded from GitHub:
BIRD is designed to
run on Linux.
Installation instructions are provided below.
Installation
Prerequisites for Installing BIRD
The following are required to
install and run BIRD directly on your system:
- BIRD has been tested on Linux. It may or may not work
on other UNIX systems.
- Installation must be on a case-sensitive filesystem.
- CmdStan
must be installed. This is the command-line interface to the STAN
statistical programming language.
- g++/gcc version 6.2 or higher is required.
- Python
version 3.5 or higher is required.
- GSL, the Gnu Scientific Library, must be installed.
Installing and Compiling BIRD Source Code
First, install CmdStan, and set the environment variable $STAN
to the directory where CmdStan has been installed. In
the bash shell, this can be done using a command of the form:
export STAN=/full/absolute/path/to/cmdstan
Next, download BIRD, copy its files into the CmdStan directory, and
compile BIRD in the CmdStan directory, using these commands:
git clone --recursive \
https://github.com/bmajoros/BIRD.git cd BIRD
make all
cp swift *.{stan,py} $STAN
cd $STAN
make BIRD
Next, please set these additional environment variables:
$TMPDIR : absolute path to a
location where temporary files can be made (this might already be set
by your system administrator)
$BIRD : absolute path to the BIRD
directory created above by the git clone command
- $PYTHONPATH
: absolute path to the $BIRD/python
subdirectory
- All environment variables documented in the GSL distribution. These include $GSLDIR and possibly others.
If you use the tcsh or csh shells, this can be done
using the setenv command;
if your shell is bash,
you'll need to use the set
and export
commands. It is recommended to put these commands in your .cshrc or .bashrc file so they run every
time you log in. Note that your system administrators may have
already set $TMPDIR for you.
Running BIRD
Prerequisites
Before the BIRD model
can be run, you must create a file containing the read counts for each
genetic variant. The format of this required file is described
below.
To obtain read counts, you will first need to align your sequencing reads
using an aligner such as Bowtie.
Once your reads are aligned, you can use the samtools mpileup
command to get read counts for each genetic variant. You can then
put those read counts into an input file for BIRD. The format of
the input file is described next.
Input Format
The input file for BIRD must contain one line per
genetic variant. Each line consists of tab-separated columns.
The first two columns are:
- ID = identifier of the variant. For example, rs847362.
- D =
number of DNA replicates. This should be an integer. If you
only sequenced your DNA once, set this to 1.
The next set of columns consists of pairs of reference
and alternate read counts in the DNA. The number of such pairs is
given by D. Thus, if
you did 3 DNA replicates, you will have six columns after the D column:
- Read count for reference allele in DNA replicate 1
- Read count for alternate allele in DNA replicate 1
- Read count for reference allele in DNA replicate 2
- Read count for alternate allele in DNA replicate 2
- Read count for reference allele in DNA replicate 3
- Read count for alternate allele in DNA replicate 3
If you have more than
3 replicates, continue alternating reference and alternate allele
counts as above, separating all columns by tabs.
Next is column R, which
specifies the number of RNA replicates.
Finally, this is followed by R
pairs of reference and alternate read counts in the RNA, just as for
the DNA above.
Thus, an example line for 1 DNA replicate and 3 RNA replicates would be as follows (NOTE: do not include the header line in your input file):
ID D ref
alt R ref alt ref alt ref alt
variant1 1 101 99 3
110 89 114 84 109 82
The example-inputs.txt file in the $BIRD directory provides additional examples.
Running the Model
The model must be run in the $STAN directory. The following command will run the model on a set of variants:
bird.py BIRD lambda input.txt stan.txt samples first-last > out.txt
The parameters are:
BIRD = the model to use; set this to "BIRD" (no quotes).
- lambda = the minimum effect size to include in posterior calculations. We recommend setting this to 1.25, which will calibrate posterior probabilities for detecting effects larger than 1.25 or less than 1/1.25.
The minimum value for this parameter is 1. This parameter does
not affect the estimated effect size or credible interval.
input.txt = the genetic variants to test, one line per variant, as described above.
stan.txt = the name of a file that BIRD can use to store outputs from the STAN engine.
samples = the number of MCMC samples to draw. We recommend 1000. If speed is crucial, you can set this to 500, or use the swift program instead (see below).
first-last = the range of variants
to process, in zero-based coordinates. If your input file
contains 50 variants, set this to 0-49 to process all 50
variants. You can use this parameter to split the variants across
multiple jobs, for parallelization. For example, to run 5 jobs,
you could use the ranges 0-9, 10-19, 20-29, 30-39, and 40-49. Be
sure to specify different output files and different temp files for the
different jobs. BIRD is not automatically parallelized; you will
need to run the separate jobs manually.
out.txt = the output file where predictions will be written.
An optional parameter is -t thetas.txt,
which will store the sampled values of theta (effect size) into the
given file, with all the samples for a single variant on a single line
separated by tabs. If you process multiple variants, the samples
for different variants will be on different lines, in the same order as
in the input file.
Note that the BIRD model relies on MCMC sampling, which can be
slow. To run on millions of variants with high allele frequencies (>0.075), we recommend using the Swift
model instead of BIRD. Swift is approximately 5000 times faster
than BIRD, while being only slightly less accurate for variants with
high allele frequencies; for allele frequencies less than 0.075, BIRD
tends to be more accurate.
Output Format
The output consists of 5 columns:
- Variant ID = identifier of the variant tested.
- Effect size (theta)
= posterior median of effect size under the model. This effect
size can be interpreted as the transcriptional rate of the alternate
allele divided by the transcriptional rate of the reference
allele. An effect of 0.5 means that the alternate allele halves
the transcriptional rate. Effect sizes tend to be between 0.5 and
2 for real regulatory variants. An effect size close to 1 means no effect.
- Lower CI = lower bound of 95% credible interval.
- Upper CI = upper bound of 95% credible interval.
- Preg = posterior probability that the variant has a regulatory effect. This probability reflects the lambda value specified in the BIRD command. If you specified 1.25 (as recommended), Preg will be the posterior probability that the effect size is greater than 1.25 or less than 1/1.25.
Parallelizing BIRD on Your System
Because BIRD processes
each variant individually, parallelization on different cluster computing
systems is simple, via batch processing of different sets of variants.
By splitting the full variant set into N subsets (input files) and
submitting N batch jobs to
process those files on N
different compute nodes, processing can in theory achieve an N-fold speedup. Be sure to specify different outputs files and different temp files for the different jobs.
Running the Swift Model
BIRD
can be slow when run on extremely large numbers of variants, due to its
use of MCMC sampling. A faster model that is typically only slightly
less accurate on common variants (maf>0.075) is included in the BIRD package. It is called swift, and has the same input and output file formats as BIRD. Swift can be run via the following command:
swift input.txt 100 lambda first-last 1000
Where:
- input.txt is the input file, formatted the same as for BIRD.
- lambda (minimum effect size in posterior calculations) is defined the same as for BIRD: we recommend using 1.25.
- 100 is a concentration parameter for the prior that shrinks q toward p; we have found 100 to work well on multiple data sets.
- first-last is the range of variant indices to process, as in BIRD. Indices are zero-based and inclusive.
- 1000 is the number of samples to draw. We have found 1000 samples to work well.
If you want to save the individual MCMC samples to reconstruct the full posterior distribution for theta, you can use the -s
option with a filename. The samples for theta will be written into the
specified file. All samples of theta for a single variant will be on a
single line, separated by tabs. Samples for different variants will be
on separate lines.
Swift has been found to be approximately 5000 times faster than BIRD,
making it practical for use on the largest of data sets. Note
that Swift tends to be less accurate than BIRD on rare/uncommon
variants (allele frequencies less than 0.075).
Example Inputs and Outputs
In the $BIRD
directory is a file containing 1000 genetic variants simulated to have an
effect size of 0.5, meaning that they reduce transcriptional rate by
half. BIRD can be run on this file using the following command in the STAN directory:
bird.py BIRD 1.25 $BIRD/example-inputs.txt mcmc.txt 1000 0-499
The input file contains 1000 lines, each line consisting of a single simulated variant. For example, the first line is:
variant1 1
885 115 10
40 0 47
6 61 6
75 5 88
5 101 6
112 8 124
9 140 7
149 11
The
first field is the variant identifier ("variant1"), followed by the
number of DNA replicates, which is 1 here. After that come the
DNA read counts for the reference allele (885) and the alternate allele
(115). This is followed by the number of RNA replicates
(10). After that are pairs of fields giving the reference and
alternate read counts in the RNA. For the first RNA replicate
there were 40 reference and 0 alternate reads. For the second RNA
replicate, there were 47 reference and 6 alternate reads, etc...
Because BIRD uses a random sampling approach to inference, its outputs
will differ somewhat between runs. The outputs should be similar
to the lines shown below, but with slight differences due to sampling:
variant effect CI_left CI_right Preg variant1 0.5513 0.3806 0.8016 0.974
variant2 0.5868 0.4513 0.8492 0.963
variant3 0.5985 0.4441 0.8284 0.963
...etc...
Since
these variants
were simulated to have an effect size of 0.5 (a halving of
transcriptional rate), we expect the estimated effect to be on the
order of 0.5, as they are here. However, because the model's
prior shrinks estimates toward 1 (i.e., no effect), estimates are
moderated toward 1 when the model is not highly confident of the
effect. Thus, for some variants in the example set, the estimate
will be higher than 0.5, due to shrinkage under the prior.
The Preg values (fifth column) indicate how confident the model is that there is a regulatory effect. If you specified a lambda value of 1.25 when running BIRD (as recommended), Preg will be the posterior probability that the effect size is greater than 1.25 or less than 1/1.25.
To see that Preg values are indicative of whether a variant has a regulatory effect, you can run BIRD on both positive examples in example-inputs.txt, and on a set of neutral variants included in the file example-nulls.txt:
bird.py BIRD 1.25 $BIRD/example-inputs.txt out-pos.tmp 1000 0-499 > bird.pos
bird.py BIRD 1.25 $BIRD/example-nulls.txt out-neg.tmp 1000 0-499 > bird.neg
Extracting the Preg values (fifth column of the output files) and plotting their distributions shows that for the regulatory variants, Preg is typically close to 1 (red distribution), whereas for neutral variants with no effect, Preg is typically close to 0 (blue distribution), as expected:
Plots like this can be generated in the R statistical computing environment, using commands such as the following:
pos <- read.table("bird.pos")
neg <- read.table("bird.neg")
dpos <- density(pos$V5)
dneg <- density(neg$V5)
plot(dpos,type="n")
lines(dpos,col="red")
lines(dneg,col="blue")
Utilizing the full Posterior Distribution
While BIRD summarizes the posterior distribution of
theta (effect size) using the posterior median and the 95% credible
interval, advanced users may wish to reconstruct the full posterior
distribution of theta and extract other summary statistics.
If you used the -t thetas.txt option when running BIRD, then the samples of theta will be saved in the file thetas.txt.
The samples for each variant will be separated by tabs, with samples
for different variants on different lines (in the same order as the
input file).
You can load this data into the R statistical computing environment. To plot the distribution of theta values for the ith variant, you can use these R commands:
data <- read.table("thetas.txt")
hist(as.numeric(data[i,]))
For example, the histogram below was constructed from the samples for one of the variants in the example-inputs.txt file. As you can see, the peak of the distribution is close to the true value of 0.5:
However, this
distribution has a large variance, indicating that the model is not
certain that theta is precisely equal to 0.5. The 95% credible
interval reported in the BIRD output (CI_left and CI_right)
gives the symmetric interval containing 95% of the MCMC samples, and
indicates the amount of spread in the distribution. Using the
full distribution as illustrated above, you can extract other credible
intervals as desired, or compute arbitrary tail probabilities.
For example, an estimate of P(theta<=0.5) for the first variant in the input file can be obtained in R using these commands:
data <- read.table("thetas.txt")
sum(data[1,]<=0.5)/ncol(data)
which in this case gives 0.264. Thus, the model assigns a degree
of belief of 26.4% that the variant reduces transcription by factor of
2 or more.
Web Tool for Experimental Design
BIRD
can be used
to estimate required read depths and replicates to achieve a desired
sensitivity and false discovery rate (FDR). The web tool for
computing
these sample sizes is at:
Contact
Please direct all bug
reports and feature requests to Bill Majoros.
BIRD was developed by Bill
Majoros
at Duke University
in collaboration with Andrew Allen and Tim Reddy. The web tool
was developed by Alex Barrera and Young-Sook Kim at Duke University.
|