This page gives a
complete command list for running ACE+ via Docker. It shows a
complete session that covers the training of ACE+ models, building a
personalized genome, and running the ACE+ models on that personalized
genome. This series of commands has been confirmed to run without error in Docker version 17.03.1-ce, build c6d412e.
The output format for ACE+, and utilities for converting and parsing ACE+ output, are described on the original ACE web page. ACE+ is described on a separate page.As a prerequisite, you must have Docker installed, and have access to a UNIX terminal window and the ability to download remote files via wget. Once you have the Docker daemon running, the following commands should work on any system. # This command
downloads the 3GB ACE+ image for Docker:
docker pull genezilla/aceplus # This command makes a convenient environment variable, $DOCKER, which will simplify the following commands. Please change "/home/bmajoros" to your working directory: export DOCKER="docker run -w /root -v /home/bmajoros:/root -it genezilla/aceplus" # This command downloads the human genome in 2bit format: wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit # This command extracts chromosome lengths from the 2bit genome file: $DOCKER twoBitInfo hg19.2bit hg19.lengths # This command downloads a set of annotations for chromosome 22: wget www.geneprediction.org/ACE/chr22.gff # This command downloads a VCF file of genetic variants for chr22: wget www.geneprediction.org/ACE/chr22.vcf.gz # These commands re-zip the VCF file using bgzip and runs tabix to index the VCF file for efficient extract of variants within genes: gunzip chr22.vcf.gz $DOCKER bgzip chr22.vcf $DOCKER tabix chr22.vcf.gz # These commands download a set of training genes: wget www.geneprediction.org/ACE/iso0-100.fasta.gz gunzip iso0-100.fasta.gz wget www.geneprediction.org/ACE/iso0-100.gff.gz gunzip iso0-100.gff.gz # This command downloads a substitution matrix for alignment: wget https://raw.githubusercontent.com/bmajoros/ACE/master/pam10 # This command downloads a list of individuals for which we wish to construct personal genomes: wget www.geneprediction.org/ACE/gender.txt # This command downloads a sample configuration file: wget https://raw.githubusercontent.com/bmajoros/ACE/master/aceplus.config # These commands create a personalized genome for individual HG00096: mkdir genomes $DOCKER make-individual-genomes.pl aceplus.config chr22.gff genomes # This command extracts positive training features: $DOCKER get-examples.pl iso0-100.gff iso0-100.fasta GT,GC,AT AG,AC ATG TGA,TAA,TAG # This command extracts negative training features: $DOCKER get-negative-examples.pl # These commands create model files for start and stop codons: $DOCKER train-signal.pl WMM start-codons.fasta non-start-codons.fasta start-codons-12bp 1 ATG 6 3 12 0.95 0 30 0 0 $DOCKER train-signal.pl WMM start-codons.fasta non-start-codons.fasta start-codons-3bp 1 ATG 0 3 3 0.95 0 30 0 0 $DOCKER train-signal.pl WMM stop-codons.fasta non-stop-codons.fasta stop-codons-3bp 1 ATG 0 3 3 0.95 0 30 0 0 # These commands train donor and acceptor splice-site models: $DOCKER logreg-splice.py donors.fasta non-donors.fasta 6 12 0.5 GT,GC,AT GT > donors.model $DOCKER logreg-splice.py acceptors.fasta non-acceptors.fasta 20 2 0.5 AG,AC AG > acceptors.model # This command extracts training features for the exon definition model: $DOCKER get-hexamer-counts.py internal-exons.fasta introns.fasta > hex-counts.txt # These commands subset the training data to make training faster -- this is only for illustration purposes: in practice you should use at least 10,000 exons and 10,000 introns, and then the logistic regression will take around 12 hours to run. head -n 1000 hex-counts.txt > tmp.1 tail -n 1000 hex-counts.txt > tmp.2 mv hex-counts.txt hex-counts.txt.bak cat tmp.1 tmp.2 > hex-counts.txt # This command trains the exon definition hexamer weights: $DOCKER logistic-regression.R hex-counts.txt 0.5 betas.txt # These commands install the hexamer weights into exon and intron model files: $DOCKER install-logistic-features-in-sensor.py betas.txt EXON exon.model $DOCKER install-logistic-features-in-sensor.py betas.txt INTRON intron.model # This command downloads an intergenic model that can be used for any organism (it does not affect ACE+ predictions, but is still needed): wget https://raw.githubusercontent.com/bmajoros/ACE/master/intergenic.model mv intergenic.model intergenic0-43.binmod # This command runs ACE+, producing its output into aceplus.essex: $DOCKER aceplus.pl aceplus.config genomes/ref-1.fasta genomes/HG00096-1.fasta genomes/local.gff 0 aceplus.essex
|