Installing and Running ACE+ using Docker



This page gives a complete command list for running ACE+ via Docker.  It shows a complete session that covers the training of ACE+ models, building a personalized genome, and running the ACE+ models on that personalized genome.  This series of commands has been confirmed to run without error in Docker version 17.03.1-ce, build c6d412e.

As a prerequisite, you must have Docker installed, and have access to a UNIX terminal window and the ability to download remote files via wget.  Once you have the Docker daemon running, the following commands should work on any system.

# This command downloads the 3GB ACE+ image for Docker:
docker pull genezilla/aceplus

# This command makes a convenient environment variable, $DOCKER, which will simplify the following commands.  Please change "/home/bmajoros" to your working directory:
export DOCKER="docker run -w /root -v /home/bmajoros:/root -it genezilla/aceplus"

# This command downloads the human genome in 2bit format:
wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

# This command extracts chromosome lengths from the 2bit genome file:
$DOCKER twoBitInfo hg19.2bit hg19.lengths

# This command downloads a set of annotations for chromosome 22:
wget www.geneprediction.org/ACE/chr22.gff

# This command downloads a VCF file of genetic variants for chr22:
wget www.geneprediction.org/ACE/chr22.vcf.gz

# These commands re-zip the VCF file using bgzip and runs tabix to index the VCF file for efficient extract of variants within genes:
gunzip chr22.vcf.gz
$DOCKER bgzip chr22.vcf
$DOCKER tabix chr22.vcf.gz

# These commands download a set of training genes:
wget www.geneprediction.org/ACE/iso0-100.fasta.gz
gunzip iso0-100.fasta.gz
wget www.geneprediction.org/ACE/iso0-100.gff.gz
gunzip iso0-100.gff.gz

# This command downloads a substitution matrix for alignment:
wget https://raw.githubusercontent.com/bmajoros/ACE/master/pam10

# This command downloads a list of individuals for which we wish to construct personal genomes:
wget www.geneprediction.org/ACE/gender.txt

# This command downloads a sample configuration file:
wget
https://raw.githubusercontent.com/bmajoros/ACE/master/aceplus.config

# These commands create a personalized genome for individual HG00096:
mkdir genomes
$DOCKER make-individual-genomes.pl aceplus.config chr22.gff genomes

# This command extracts positive training features:
$DOCKER get-examples.pl iso0-100.gff iso0-100.fasta  GT,GC,AT  AG,AC  ATG  TGA,TAA,TAG

# This command extracts negative training features:
$DOCKER get-negative-examples.pl

# These commands create model files for start and stop codons:
$DOCKER train-signal.pl WMM start-codons.fasta non-start-codons.fasta start-codons-12bp 1 ATG 6 3 12 0.95 0 30 0 0
$DOCKER train-signal.pl WMM start-codons.fasta non-start-codons.fasta start-codons-3bp 1 ATG 0 3 3 0.95 0 30 0 0
$DOCKER train-signal.pl WMM stop-codons.fasta non-stop-codons.fasta stop-codons-3bp 1 ATG 0 3 3 0.95 0 30 0 0

# These commands train donor and acceptor splice-site models:
$DOCKER logreg-splice.py donors.fasta non-donors.fasta 6 12 0.5 GT,GC,AT GT > donors.model
$DOCKER logreg-splice.py acceptors.fasta non-acceptors.fasta 20 2 0.5 AG,AC AG  > acceptors.model

# This command extracts training features for the exon definition model:
$DOCKER get-hexamer-counts.py internal-exons.fasta introns.fasta > hex-counts.txt

# These commands subset the training data to make training faster -- this is only for illustration purposes: in practice you should use at least 10,000 exons and 10,000 introns, and then the logistic regression will take around 12 hours to run.
head -n 1000 hex-counts.txt > tmp.1
tail -n 1000 hex-counts.txt > tmp.2
mv hex-counts.txt hex-counts.txt.bak
cat tmp.1 tmp.2 > hex-counts.txt

# This command trains the exon definition hexamer weights:
$DOCKER logistic-regression.R hex-counts.txt 0.5 betas.txt

# These commands install the hexamer weights into exon and intron model files:
$DOCKER install-logistic-features-in-sensor.py betas.txt EXON exon.model
$DOCKER install-logistic-features-in-sensor.py betas.txt INTRON intron.model

# This command downloads an intergenic model that can be used for any organism (it does not affect ACE+ predictions, but is still needed):
wget https://raw.githubusercontent.com/bmajoros/ACE/master/intergenic.model
mv intergenic.model intergenic0-43.binmod

# This command runs ACE+, producing its output into aceplus.essex:
$DOCKER aceplus.pl aceplus.config genomes/ref-1.fasta genomes/HG00096-1.fasta genomes/local.gff 0 aceplus.essex

The output format for ACE+, and utilities for converting and parsing ACE+ output, are described on the original ACE web page.  ACE+ is described on a separate page.




Getting Help

If you encounter difficulties, please email bmajoros@duke.edu.








contact: bmajoros@duke.edu

Hummingbird photo by Bill Majoros.  Used with permission.