NEWS
|
DescriptionACE (Assessing Changes to Exons) is a software suite designed to detect changes in gene structures in the genomes of individuals or strains, and to interpret those changes as to possible loss of function. It has been applied to thousands of human genomes and to numerous plant samples. It was developed in the Reddy lab at Duke University in collaboration with the Yandell lab at the University of Utah and Michael Campbell at Cold Spring Harbor Laboratory. ACE begins by constructing an explicit genome sequence from an indexed
VCF file and a reference genome. ACE utilizes the VCF file to
rapidly infer an alignment of the reference sequence to the individual
genome and uses the alignment to compute a coordinate transformation to
be used to map annotations from the reference to the individual
genome.
ACE analyzes all projected splice sites for changes to either the consensus sequence or to flanking bases that influence splice site strength. ACE reports all splice sites that are disrupted or substantially weakened in the individual genome, and enumerates possible alternate splice patterns that may result from loss of the reference splice site, including those arising through exon skipping, intron retention, or cryptic-site activation (exon skipping and intron retention can be individually enabled or disabled via a configuration file). Noncoding genes are included in this analysis. For protein-coding genes, ACE additionally analyzes the reading frame in each isoform and reports any change that may result in protein truncation or mRNA decay. Unlike comparative gene finders that model homologous genes in different species and assume conserved reading frames, ACE is designed to compare genomes of the same species, and is specifically designed to identify genes that may have become inactivated or changed in function via recent sequence modifications.
When processing VCF files, ACE can apply single-nucleotide variants,
insertions, deletions, and short copy-number variants. ACE
left-normalizes variants and resolves overlapping variants in the same
haplotype that are compatible. ACE provides multiple filtering
levels for sequences containing questionable variants (e.g., any
variant overlapping an inconsistent variant phased to the same
haplotype). Because ACE uses indexed VCF files and indexed genome
files, it can extract individual genes efficiently, enabling gene-level
parallelization for processing large samples. ACE applies
all variants within a gene that are present on a haplotype
simultaneously, so that the predicted changes to splicing patterns
and/or reading frames reflect the combined effect of any and all
variants that may interact (for example, when one variant alters a
reading frame and one or more downstream variants correct it).
ACE can detect the loss of start codons resulting from either splicing
changes or directly from sequence variants, and applies a scanning
model to identify likely alternate start codons; a retrainable scoring
matrix can be used to ensure that candidate start codons have
appropriate flanking bases (e.g., a strong Kozak sequence).
ACE
can detect both premature stop codons and variants that result in an
absence of a stop codon anywhere in the reading frame, either of which
may lead to mRNA decay and partial or complete loss of function.
ACE is highly configurable and includes options for use on plant
genomes, such as prediction of intron retention. ACE produces
structured output in multiple formats, including XML, and provides
parsers and command-line utilities for convenient extraction of
specific information and summary statistics.
Example
Inputs and Outputs
|
IMPORTANT UPDATE: Please add the full path to the ACE/perl
directory created above to your $PATH environment
variable, so that ACE scripts can call programs in the ACE/perl
directory. |
$TMPDIR
: absolute path to a
location where temporary files can be made (this might already be set
by your system administrator)$ACE
: absolute path to the ACE
directory created above$PERLLIB
: absolute path to the ACE/perl
directory created by the git clone command$GSLDIR
: absolute path to the Gnu
Scientific Library$TMPDIR
for you.docker pull
genezilla/ace
Dockerfile
for ACE and build it using the docker build
command:wget https://raw.githubusercontent.com\
/bmajoros/ACE/master/Dockerfile
mkdir ace
mv Dockerfile ace
docker build -t genezilla/ace ace
docker pull
or built manually using docker build
), you can execute
any ACE command by first prefixing it with: docker run
-w /root -v
X:/root
-it
genezilla/ace
, where X is the absolute path to the current directory where you're working. For example, to build personal genomes
from
a VCF file, the command would be:docker
run -w /root -v DIR:/root -it genezilla/ace make-individual-genomes.pl
transcript_id
and
a gene_id
. Coding segments must be labeled as CDS
,
and untranslated regions/exons (in coding or noncoding genes) should
simply be labeled exon
. An incomplete list of
some sources of annotations are
provided below.sample.config
, which you can
customize:donor-consensus
= GT,AT,GC
acceptor-consensus = AG,AC
start-codons
= ATG
stop-codons
= TGA,TAG,TAA
max-splice-shift = 70
min-exon-length = 30
min-intron-length = 30
NMD-distance
= 50
allow-exon-skipping = true
allow-cryptic-sites = true
allow-intron-retention = false
start-codon-model =
start-codons-12bp.model
short-start-codon-model = start-codons-3bp.model
stop-codon-model =
stop-codons-3bp.model
donor-model
= donors57-100.model
acceptor-model =
acceptors57-100.model
gap-open-penalty = 5
gap-extend-penalty = 10
subst-matrix
= pam10
bandwidth
= 50
ploidy
= 2
margin-around-gene = 1000
min-orf-length
= 150
twoBitToFa
= /usr/local/bin/twoBitToFa
tabix =
/usr/local/bin/tabix
genome =
/data/reddylab/Reference_Data/hg19.2bit
chr-lengths = /data/reddylab/Reference_Data/hg19.genome
gender =
/home/bmajoros/1000G/vcf/gender.txt
individuals =
/home/bmajoros/1000G/assembly/fasta/Geuvadis.txt
vcf =
/home/bmajoros/1000G/vcf
vcf-lacks-chr = true
max-splice-shift
is the maximum distance from a disrupted splice site that ACE will
search
for a cryptic site to use in place of the disrupted site; for humans, a
distance of 70 to 150 is recommended (i.e., 70 to favor specificity,
150 to favor sensitivity).NMD-distance
refers to the nucleotide distance (on the
spliced transcript) between a stop codon and the
most 3' exon
junction; stop codons further upstream than this distance
may trigger NMD. This option allows the default of 50 bp to be
overridden.*.model
files are included in the ACE directory for use on human genomes; for
other organisms these must be retrained,
as described later. The short-start-codon-model
captures only the three positions containing the start codon (usually ATG), for use in contexts where
flanking positions may fall outside the annotated transcript.gap-open-penalty
,
gap-extend-penalty
, and bandwidth
are parameters for aligning proteins (to detect large changes to
protein sequence).pam10
is an amino-acid substitution matrix included in the ACE
directory. Please include the full path to this file.twoBitToFa
and tabix
are the full paths to where these programs are
installed on your system.genome
is the full path to the 2bit genome file on your system.vcf
is
the path to the directory where bgzipped VCF files are located (along
with *.tbi
index files).ploidy
values of 1 or 2 are currently supported, and values >2 are
currently in beta testing for the next release.individuals
is the full path to a file listing the identifiers of the individuals
for which ACE is to perform analyses. Note
that all individuals in the VCF file must also be listed in this
file. In order to run ACE on a subset of individuals, your VCF
file must include only columns for those individuals.gender
is required only for organisms having
sex chromosomes: if present, it lists one individual per line:
HG00096 male
HG00097 female
HG00099 female
HG00100 female
HG00101 male
... ...
chr-lengths
is the path to a file
giving the length of each chromosome, as illustrated below:chr1
249250621
chr2 243199373
chr3 198022430
... ...
$ACE
set to the proper path in order for this script to work:make-individual-genomes.pl <ace.config>
<genes.gtf> <out-dir>
<ace.config>
parameter is the path to the
configuration file described above. The GFF2 file <genes.gtf>
specifies the coordinates of the genes to be included in the
constructed genome. Recall that in the configuration file the individuals
element points to a file listing the identifiers of the individuals for
which genomes are to be built.chr1
ensembl transcript 1002
17286 .
-
.
transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1
ensembl CDS 16644 17258
0
-
0
transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1
ensembl CDS 7631 8092
0
-
0
transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1
ensembl CDS 5869 5985
0
-
0
transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1
ensembl CDS 4778 4893
0
-
0
transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1
ensembl CDS 3443 3545
0
-
2
transcript_id=ENST0000037825
1;gene_id=ENSG00000130764;
chr1
ensembl CDS 2294 2383
0
-
0
transcript_id=ENST00000378251;gene_id=ENSG00000130764;
chr1
ensembl CDS 1870
2118 0
-
0
transcript_id=ENST00000378251;gene_id=ENSG00000130764;
chr1
ensembl five_prime_UTR 17259 17286
0
-
0
transcript_id=ENST00000378251;gene_id=ENSG00000130764;
chr1
ensembl three_prime_UTR 1002 1869
0
-
1
transcript_id=ENST00000378251;gene_id=ENSG00000130764;
transcript_id "ENST0000037825
1";
gene_id "ENSG00000130764
"CDS
elements, whereas
noncoding genes should simply have exon
elements.
Remember that coordinates in GFF are 1-based and the end coordinate is not inclusive.normalize-gff.pl
(in the $ACE
directory) on your GTF file to ensure that coding and noncoding genes
are properly annotated as such.CHROM
line and variant line from a 1000 Genomes Project VCF file are shown below:#CHROM
POS ID REF ALT QUAL FILTER
INFO FORMAT HG00099
chr1 10177
rs367896724 A AC 100 PASS
VT=INDEL GT
0|1
PASS
in the 7th column and GT
in the 9th column, and ./.
genotypes ("uncalled variants") are not permitted. Extra fields
in the genotype column are also not permitted. If your VCF
contains a null call for every position in the genome, these should be
removed prior to running ACE. Note that the substrate (chr1
in this case) should have a prefix of chr
and this should
be consistent across all VCF and GFF files.<out-dir>
.
Each file will be a multi-FASTA file containing the gene sequences for
each individual; a ref.fasta
file will contain the
reference sequences for comparison. When ploidy is greater than
1, a FASTA file will be generated for each haplotype; for example, for
a diploid individual with identifier HG00096, there will be two output
files: HG00096-1.fasta
and HG00096-2.fasta
.
A file called local.gff
is also generated, which provides gene structure annotations with exon
coordinates converted to local offsets for the corresponding reference
sequences./key=value
elements on the defline. These
include the coordinates of the gene on the reference chromosome,
a CIGAR
string giving the alignment of the reference sequence to the
individualized sequence, the variants that were applied, and the
numbers of warnings and errors generated by application of those
variants: >ENSG00000272636.1_1
/coord=chr17:4809-32427:- /margin=1000
/cigar=2501M1I2300M1D3390M2D5052M /variants=
rs2223138:chr17:156:156:A:G, rs2894723:chr17:3104:3104:C:T,
rs2396789:chr17:3737:3737:T:C, rs2894722:chr17:3758:3758:A:G,
rs8081895:chr17:4551:4551:A:G, rs36068254:chr17:5773:5773:A:G,
rs34663111:chr17:6933:6933:A:G /warnings=0 /errors=0
make-individual-genomes.pl
: ace.pl <config-file> <reference.multi-fasta>
<individual.multi-fasta> <reference.gtf>
<max-vcf-errors>
<out.essex>
<config-file>
is either a
configuration file like the one shown above, or a directory containing
a set of configuration files, one per isochore
(G+C content interval). In the latter case, the configuration
files must be named ace.xx-yy.config
; for example, ace.43-51.config
for sequences having an average G+C content between 43% and 51%.<reference.multi-fasta>
is the
reference FASTA file generated by make-individual-genomes.pl
.<individual.multi-fasta>
is
the FASTA file containing an individualized genome sequence for a
sample.<reference.gtf>
provides
reference annotations to be mapped; in most cases this will be the local.gff
file generated by make-individual-genomes.pl
.<max-vcf-errors>
is the
maximum number of VCF errors allowed before triggering a too-many-vcf-errors
flag in the output<out.essex>
is the output
file, which is a structured file that can be queried using provided
scripts, or converted to XML or GFF for import into other programs.essex-to-xml.pl
,
converts the output to XML so that external XML-processing software can
be used instead of the provided utilities, if desired.Source | URL | Genomes and Annotations Available |
GENCODE | https://www.gencodegenes.org/ | Human and mouse |
NCBI Genomes | ftp://ftp.ncbi.nlm.nih.gov/genomes/ | Multiple species |
RefSeq | ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ | Multiple species (curated subset of NCBI Genomes) |
ENSEMBL | http://uswest.ensembl.org/info/data/ftp/index.html | Multiple species |
Joint Genome Institute | http://genome.jgi.doe.gov/ | Multiple species |
PlantGDB | http://www.plantgdb.org/ | Multiple plant species |
FlyBase | http://flybase.org/ | Drosophila species |
WormBase | http://www.wormbase.org/ | Nematode species |
Hymenoptera Genome Database | http://hymenopteragenome.org/ | Bee, wasp, and ant species |
make-individual-genomes.pl
and ace.pl
can be run on a GTF file containing only a subset of the genes to be
analyzed. By splitting the full gene set into N subsets (GTF files) and
submitting N batch jobs to
process those GTF files on N
different compute nodes, processing can ideally approach an N-fold speedup. Note,
however, that processing long genome sequences is I/O intensive, and
that an N-fold speedup may
not be fully achieved if your I/O subsystem has limited bandwidth to
compute nodes.get-examples.pl
script in the $ACE
directory can be used to extract a set of example splice sites, start
codons, and stop codons to use for training. The script requires
a set of annotated genes (in GTF format) on a given substrate (FASTA file):get-examples.pl
<*.gtf> <*.fasta> <donor-consensuses>
<acceptor-consensuses> <start-codon-consensuses>
<stop-codon-consensuses>
$ACE/get-examples.pl genes.gff genome.fasta GT,GC,AT AG,AC
ATG TAA,TGA,TAG
donors.fasta
containing example donor splice
sites, a file acceptors.fasta
containing example acceptor
splice sites, and files start-codons.fasta
and stop-codons.fasta
containing example start and stop codons. These examples can be
used to train a signal sensor for each type of signal (splice site, start/stop
codon), using the train-signal.pl
script:train-signal.pl <model-type> <pos.fasta>
<neg.fasta>
<filestem> <% train> <signal-type>
<consensus-offset>
<consensus-length> <context-window-length>
<min-sens>
<order> <min-sample-size>
<boosting-iterations>
<boosting-percentile>
train-signal.pl
WMM donors.fasta non-donors.fasta donor 1 GT
4 2 13 0.98 0 0 0 0
donor.model
. It will use all of the
examples in donors.fasta
to train the model (rather than withholding some examples for
independent testing of the trained model), because the fifth parameter
(1) specifies that 100% of the examples should be used for
training. The consensus-offset
, consensus-length
,
and context-window-length
parameters are illustrated by
the figure below:train-signal.pl
WMM acceptors.fasta non-acceptors.fasta
acceptor 1 AG 5 2 12 0.98 0 0 0 0
train-signal.pl
WMM start-codons.fasta non-start-codons.fasta
start-codon 1 ATG 6 3 11 0.98 0 0 0 0
train-signal.pl
WMM stop-codons0-100.fasta
non-stop-codons0-100.fasta stop-codons0-100 1 TAG 0 3 3
0.98 0 0 0 0
min-orf-length
parameter is used to
detect ORFs in genes annotated as noncoding, as a means of detecting
possible mis-annotated genes. For human genes, 150nt provides a
good cutoff for separating true positives from false positives.
For nonhuman organisms, you can use the script random-orfs.pl
in the ACE/perl
directory to sample a set of random ORFs from a chromosome FASTA file and compile their lengths:$ACE/perl/random-orfs.pl chr1.fasta TGA,TAA,TAG 1000 > random-orf-lengths.txt
TGA,TAA,TAG
is the list of functional stop codons
for your organism. Similarly, you can compile a list of real ORF
lengths using the following commands:$ACE/perl/get-transcripts.pl genes-on-chr1.gff chr1.fasta transcripts.fasta
$ACE/perl/fasta-seq-lengths.pl transcripts.fasta > orf-lengths.txt
genes-on-chr1.gff
is a GTF (GFF2) file containing
gene annotations that include CDS elements for protein-coding
genes. You can now use a tool such as R to plot the distributions
of orf-lengths.txt
and random-orf-lengths.txt
and then choose a length cutoff that separates these two distributions:d1<-read.table("orf-lengths.txt")
d2<-read.table("random-orf-lengts.txt")
h1<-density(d1$V1)
h2<-density(d2$V1)
plot(h1)
lines(h2)
report
element for each transcript. Essex is hierarchically organized,
so that under the report
element are additional elements:(report
(substrate ENSG00000249915_1)
(transcript-ID ENST00000505221)
(gene-ID ENSG00000249915)
...
substrate
identifier is extracted from the identifier on the defline of the
reference FASTA file.
When building the individualized
sequence, ACE places information on the defline indicating the number
of vcf-warnings
(ambiguous variants that were resolved)
and vcf-errors
(ambiguous variants that were not
resolved): (vcf-warnings 0)
(vcf-errors 0)
alignment
element describes the alignment from the
reference sequence to the
individualized sequence, using
a CIGAR string that includes I
elements for insertions, D
elements for deletions, and M
elements for matches/mismatches: (alignment 17692M4I11041M2D2837M2D11837M)
reference-transcript
element: (reference-transcript
(ID ENST00000505221)
(gene ENSG00000249915)
(type protein-coding)
...
mapped-transcript
element: (mapped-transcript
(ID ENST00000505221)
(gene ENSG00000249915)
(type protein-coding)
(substrate ENSG00000249915_1)
...
status
element indicates the
result of mapping: (status mapped)
mapped
" indicates that the mapping was perfect
(all splice sites remain valid, the start codon remains valid, and the
stop codon remains valid). When running ACE with the -q
(quiet) option, transcripts that map perfectly are not reported.
For protein-coding genes, when the protein changes, ACE reports
information about the change via additional elements under the status
element: (status
mapped
(protein-differs
(percent-match 16.56
477/2881 ref=2881 alt=770))
(frameshift
(nt-with-phase-mismatch 6424)
(percent-phase-mismatch 74.3%)))
mapped
indicates that all splice sites
remain valid, but the protein-differs
indicates that the
protein is not identical to the reference protein. The percent-match
(number of exactly matching amino acids, divided by the length of the
reference protein) is very small due to a frameshift affecting 6424
nucleotides in the original coding segment; 74.3% is the proportion of
the reference coding segment that is out of phase in the mapped gene
structure.premature-stop
element
indicates that the stop codon is earlier in the mapped sequence than in
the reference sequence: (status
mapped
(premature-stop NMD
(EJC-distance 6019)
NMD
) (the other possible outcome of a
premature stop codon is protein-truncation
). NMD is
generally assumed to occur if a premature stop codon occurs at least 50
nt upstream of the last exon junction, where distance is measured on
the spliced transcript. ACE reports the observed EJC-distance
to enable further downstream filtering, if desired. (status
mapped
(new-upstream-start-codon
(new-start
ggaagc_ATG_gct -9.89305 cutoff: -14.0648)
(old-start
cagctg_ATG_gac -13.6501 cutoff: -14.0648)
(reference
bad-consensus)
(ORF-length 1173 =>
1230)
(transcript...
transcript
element. Scans for upstream start codons also occur for predicted
novel splice forms upon disruption of a splice site (see below).mapped-transcript
.
Upstream start codons can denote the beginning of a uORF, or upstream open reading frame.
Many human genes have uORFs in addition to their primary annotated
ORF, and while these uORFs typically have stop codons early in the
transcript, these do not always induce degradation via NMD. For
that reason, when a uORF is detected and NMD is predicted according to
the 50 nt rule for downstream ORFs, ACE will report this as hypothetical-NMD
to denote that it is much less confident that NMD will actually occur.
(status
splicing-changes
(broken-donor
10891 ccagca_GC_gtctgccctg -27.2988
ccagca_GG_gtctgccctg
-31.4097 threshold: -21.591)
(alternate-structures
(transcript
(structure-change exon-skipping)
(ID
ENST00000531190)
(gene ENSG00000110455)
(type noncoding)
...
cryptic-site
element:fate
element indicates the predicted outcome of splicing the transcript in
that way: (fate noncoding)
intron-retention
is enabled in the configuration
file, ACE will generate an alternative transcript with an intron
retention event. Otherwise, ACE will indicate that transcription is
likely aborted due to an unresolvable splicing outcome from the
disrupted splice site: (status
no-transcript
(broken-donor 2716 GA))
(weakened-donor
37043 cctggg_GT_aaggggccat -19.5229
cctggg_AT_aaggggccat -23.887
threshold: -21.591)
no-start-codon
condition: (status
mapped
no-start-codon)
(status
mapped
nonstop-decay)
min-orf-length
parameter in the configuration file. If an ORF is found, ACE also
scans the reference transcript to see if the ORF existed in the
reference. If the ORF did not exist in the reference, a flag of ref-no-start-codon
is reported. If the ORF existed in the
reference but was shorter than min-orf-length
, and if the
new ORF is at least as long as min-orf-length
and is at
least twice as long as the ORF in the reference sequence, then the new
ORF is marked via the flag ref-ORF-too-short
. If
the ORF is long enough in both the reference and alternate sequences,
ACE reports possible-misannotation
,
as the reference may be incorrectly annotated as noncoding. Note
that these ORFs are required to begin with a valid start codon, but
they may lack a valid stop codon; this is to allow for the possibility
that the annotation is incomplete at the 3' end.variants
element. These are then grouped according to the context of the
variant, so that variants within a coding segment (CDS), within UTR,
within a splice site, or near a splice site (within 50 nt) are reported
together. Other than variants near a splice site, intronic
variants are not currently reported in the Essex output, though they
are present on the defline of the generated FASTA file. For each
variant, the variant identifier, original substrate (chromosome),
reference location, mapped location, reference allele, and alternate
(applied) allele are given:(CDS-variants rs29674:chr5:23887:23887:A:C ...
(UTR-variants rs4956990:chr5:43761:43762:A:G ...
(near-splice-variants rs10794716:chr10:47730:47717:T:C ...
(splice-site-variants rs10853954:chr19:10969:10937:T:C ...
git@github.com:bmajoros/perl.git
)
are provided for parsing the structured Essex output.
Alternatively, you may wish to parse the XML output using a
publicly-available XML parser; the XML and Essex outputs contain the
same information. Converting to XML is described below (using the
essex-to-xml.pl
script
in the $ACE
directory).UPDATE: Essex parsers are now available in Perl, C++, and Python. The C++ parser is included in the main ACE repository. The Python parser is in beta testing, and can be downloaded here: https://github.com/bmajoros/python. |
use
EssexParser;
use EssexACE;
use Transcript;
my $parser=new EssexParser($infile);
while(1) {
my $root=$parser->nextElem();
last unless $root;
my $ace=new EssexACE($root);
my
$transcriptID=$ace->getTranscriptID();
my
$status=$ace->getStatusString();
if($status eq "mapped") {
my
$transcript=$ace->getMappedTranscript();
print
$transcript->toGff();
}
elsif($status eq
"splicing-changes") {
my
$transcripts=$ace->getAltTranscripts();
my $n=@$transcripts;
for(my $i=0 ; $i<$n
; ++$i) {
my
$transcript=$transcripts->[$i];
my
$id=$transcript->getTranscriptId();
$id="ALT$i\_$id";
$transcript->setTranscriptId($id);
print
$transcript->toGff();
}
}
}
EssexParser
package parses an Essex file and returns
a series of tree data structures representing successive elements in
the file:package
EssexParser;
#
$parser=new EssexParser($filename);
# $parser->close();
#
$tree=$parser->nextElem(); # returns root of the tree
#
$forest=$parser->parseAll(); # returns an array of trees
EssexParser
are
composed of EssexNode
objects, which can be traversed or
queried in various ways:package
EssexNode;
# $tag=$node->getTag();
# $n=$node->numElements();
# $elem=$node->getIthElem($i);
# $elem=$node->findChild($tag);
# $array=$node->findChildren($tag);
# $array=$node->findDescendents($tag);#
always returns an array
# $elem=$node->findDescendent($tag); #
returns node or undef
#
$bool=$node->hasDescendentOrDatum($tagOrDatum);
#
$count=$node->countDescendentOrDatum($tagOrDatum);
# $n=$node->countDescendents($tag);
# $bool=$node->hasDescendent($tag);
#
$string=$node->getAttribute($attributeTag);
# $array=$node->getElements();
# $bool=EssexNode::isaNode($datum);
# $bool=$node->hasCompositeChildren();
# $node->print($filehandle);
# $node->printXML($filehandle);
#
$array=$node->pathQuery("report/reference-transcript/type");
EssexParser
provides generic parsing
capabilities for any Essex file (whether produced by ACE or some other
application), the EssexACE
module provides a more
detailed representation and querying facilities for Essex files
produced by ACE:package
EssexACE;
#
$aceReport=new EssexACE($essexReportElem);
#
$aceReport->changeMinPercentMatch($x); # example: 75.3
# $substrate=$aceReport->getSubstrate();
#
$transcriptID=$aceReport->getTranscriptID();
# $geneID=$aceReport->getGeneID();
#
$vcfWarnings=$aceReport->getNumVcfWarnings();
#
$vcfErrors=$aceReport->getNumVcfErrors();
# $cigar=$aceReport->getCigar();
#
$transcript=$aceReport->getRefTranscript();
#
$transcript=$aceReport->getMappedTranscript();
#
$statusString=$aceReport->getStatusString();
# status =
mapped/splicing-changes/no-transcript/bad-annotation
#
$bool=$aceReport->hasBrokenSpliceSite();
#
$array=$report->getBrokenSpliceSites(); [pos,type=GT/AG]
#
$array=$aceReport->getAltTranscripts();
# $bool=$aceReport->proteinDiffers();
#
$percent=$aceReport->getProteinMatch(); # example: 98.57 (whole
number)
# $bool=$aceReport->frameshift();
#
$percent=$aceReport->frameshiftPercentMismatch();
# example: 83 (whole num)
#
$nucs=$aceReport->frameshiftNucMismatch();
# $bool=$aceReport->mappedPTC();
# premature stop codon
when status="mapped"
# $bool=$aceReport->mappedNMD(50); #
only valid when status="mapped"
# $bool=$aceReport->mappedNoStart(); #
assumes status="mappped"
# $bool=$aceReport->mappedNonstop(); #
no stop codon; status="mapped"
# $bool=$aceReport->refIsCoding();
# $bool=$aceReport->mappedIsCoding();
# $bool=$aceReport->lossOfCoding(); #
ref is coding, alt is noncoding
#
$bool=$aceReport->allAltStructuresLOF();
#
status=splicing-changes
#
$bool=$aceReport->allExonSkippingLOF(); # status=splicing-changes
#
$bool=$aceReport->allExonSkippingNMD(); # status=splicing-changes
reference-transcript
, mapped-transcript
,
or transcript
element in an Essex file can be converted into a Transcript
object that
can then be queried in various ways specific to transcripts:package
Transcript;
# $transcript=new Transcript($essexNode);
# $rawExons=$transcript->getRawExons();
#
$success=$transcript->loadExonSequences(\$chromosome);
#
$seq=$transcript->loadTranscriptSeq(\$chromosome);
# $bool=$transcript->equals($other);
#
$bool=$transcript->overlaps($otherTranscript);
#
$bool=$transcript->overlaps($begin,$end);
# $len=$transcript->getLength(); # sum
of exon sizes
# $len=$transcript->getExtent(); #
end-begin
#
($begin,$end)=$transcript->getCDSbeginEnd();
# $n=$transcript->numExons();
# $exon=$transcript->getIthExon($i);
# $n=$transcript->numUTR();
# $utr=$transcript->getIthUTR($i);
# $transcript->getSubstrate();
# $transcript->getSource();
# $gff=$transcript->toGff();
# $id=$transcript->getID();
# $id=$transcript->getTranscriptId();
# $id=$transcript->getGeneId();
# $begin=$transcript->getBegin();
# $end=$transcript->getEnd();
#
$strand=$transcript->getStrand();
#
if($transcript->isWellFormed(\$sequence));
# $transcript->getScore();
# my
$nextExon=$transcript->getSuccessor($thisExon);
# $transcript->shiftCoords($delta);
#
$transcript->reverseComplement($seqLen);
#
$transcript->setStopCodons({TGA=>1,TAA=>1,TAG=>1});
# $g=$transcript->getGene();
#
$genomicCoord=$transcript->mapToGenome($transcriptCoord);
#
$transcriptCoord=$transcript->mapToTranscript($genomicCoord);
#
$exon=$transcript->getExonContaining($genomicCoord);
# $array=$transcript->getSpliceSites();
#
$array=$transcript->parseExtraFields(); # array of [key,value] pairs
#
$hash=$transcript->hashExtraFields(\@keyValuePairs);
Transcript
objects are composed of Exon
objects, shown below. Note that both CDS (coding segment) and UTR
exons are represented, separately, as Exon
objects.
The Transcript::getRawExons()
function coalesces these
together into exons regardless of coding status.package
Exon;
# $exon->containsCoordinate($x) :
boolean
# $length=$exon->getLength();
# $exon->reverseComplement($seqLen);
#
$strand=$exon->getStrand(); # "+" or "-"
# $bool=$exon->overlaps($otherExon)
# $sequence=$exon->getSequence();
# $transcript=$exon->getTranscript();
# $gff=$exon->toGff();
# $begin=$exon->getBegin();
# $end=$exon->getEnd();
# $frame=$exon->getFrame();
# $type=$exon->getType();
# $score=$exon->getScore();
# $exon->shiftCoords($delta);
# $substrate=$exon->getSubstrate();
$ACE
directory.essex-to-xml.pl
<in.essex>
$ACE/essex-to-xml.pl 1.essex > 1.xml
essex-to-tsv.pl
<in.essex>
$ACE/essex-to-tsv.pl 1.essex > 1.tsv
GeneID
TranscriptID RefSubstrate
AltSubstrate VcfWarnings
VcfErrors Begin
End Strand Type
Mapped SplicingChanges
BrokenSpliceSite
ProteinDiffers Frameshift
PrematureStop NMD
StartCodonChange
NoStartCodon NoTranscript
NonstopDecay
ENSG00000272636.1
ENST00000343572.7 chr17
ENSG00000272636.1_1
0
0 1000
26618 -
protein-coding 1
0
0
0
0
0
0
0
0
0 0
ENSG00000272636.1
ENST00000609727.1 chr17
ENSG00000272636.1_1
0
0 8521
26152 -
protein-coding 0
0
0
0
0
1
0
0
0
0 0
...etc...
essex-to-gff.pl <in.essex>
<out.gff>
$ACE/essex-to-gff.pl 1.essex
out.gff
essex-dump-status.pl
<in.essex>
$ACE/essex-dump-status.pl 1.essex
ENSG00000272636.1_1
0/0 ENST00000343572.7 mapped
ENSG00000272636.1_1 0/0
ENST00000609727.1 bad-annotation bad-start premature-stop
ENSG00000174837.10_1 0/0
ENST00000450315.3 splicing-changes broken-donor
ENSG00000184731.5_1
0/0 ENST00000327669.4 mapped protein-differs
...etc...
essex-filter-errors.pl
<in.essex>
STDOUT
any report that does not
contain a bad-annotation
flag or a too-many-vcf-errors
flag. Bad annotations include reference annotations having
nonconsensus splice sites not listed in the configuration file, lacking
a start or stop codon, or being predicted to suffer NMD (nonsense-mediated decay). essex-get-gene.pl
<in.essex> <geneID>
$ACE/essex-get-gene.pl
1.essex ENSG00000227232.4
essex-get-transcript.pl
<in.essex> <transcriptID>
$ACE/essex-get-transcript.pl
1.essex
ENST00000538476.1
essex-get-loss-of-function.pl
<in.essex> <min-percent-match>
$ACE/essex-get-loss-of-function.pl
1.essex 50
ENSG00000060237
ENST00000530271 NMD frameshift
ENSG00000145506 ENST00000382730 NMD
sequence-variant
ENSG00000096996 ENST00000600835 NMD
broken-acceptor
ENSG00000032389 ENST00000443925
protein-differs sequence-variant
ENSG00000154035 ENST00000468196
protein-differs frameshift
ENSG00000145494 ENST00000469176 nonstop-decay
sequence-variant
ENSG00000171987 ENST00000307616 nonstop-decay
frameshift
essex-summarize-status.pl <in.essex>
$ACE/essex-summarize-status.pl
1.essex
no-transcript 3
broken-acceptor 1
broken-donor 2
splicing-changes
51
alternate-structures 51
broken-acceptor 26
broken-donor 25
mapped 12671
nonstop-decay 13
noncoding 3383
start-codon-change 38
protein-differs 9170
premature-stop 173
new-upstream-start-codon 46
noncoding-to-coding 3362
weakened-donor 68
weakened-acceptor 103
frameshift 159
essex-tabulate-changes.pl
<in.essex>
$ACE/essex-tabulate-changes.pl
1.essex
GENES_PROTEIN_CHANGE
4060
GENES_CHANGE_START_CODON
23
CODING_TRANSCRIPTS_NO_START_CODON
0
GENES_FRAMESHIFT
107 14667 3073227 166
GENES_NONSTOP_DECAY
11
GENES_PTC
143
MAPPED_GENES_NMD
95
NONCODING_GENES_SPLICING_CHANGE 0
CODING_GENES_SPLICING_CHANGE
36
GENES_SPLICING_CHANGE_NMD
14
BROKEN_DONORS 27
BROKEN_ACCEPTORS
27
CRYPTIC_SITES
182 1068 51
EXON_SKIPPING
43 43 51
ALT_TRANSCRIPTS 3542
3870 3459
WEAKENED_DONORS 68
WEAKENED_ACCEPTORS
103
GENES_NO_TRANSCRIPT_BROKEN_SS 3
GENES_NO_TRANSCRIPT_WEAK_SS
0
GENES_FRAMESHIFT
,
the columns are the number of genes having a frameshift, the total
frameshift length, the total length squared (for computing variances),
and the total number of frameshifts. For CRYPTIC_SITES
,
the columns are the number of cryptic sites predicted, the same value
squared, and the number of disrupted splice sites. For EXON_SKIPPING
,
the columns are the number of exon skipping events predicted, the same
value squared, and the number of disrupted splice sites. For ALT_TRANSCRIPTS
,
the columns are the total number of alternate splice forms predicted,
the same value squared, and the number of genes having any alternate
splice forms predicted.essex-query.pl <in.essex>
<tag>
essex-query.pl
in.essex weakened-donor
(weakened-donor 8548 aac_GT_aagttttgag
-15.362
aac_GT_taagttttga -19.0666)
(weakened-donor 1056 tttatg_GT_aagtggatat
-18.6239
tttatg_GT_gtccagtcgc -22.6956)
(weakened-donor 6130 ttctgg_GT_aggagtctgc
-20.6655
ttctgg_GT_aggagtcttt -21.6978)
essex-path-query.pl
<in.essex> <query>
essex-query.pl
except that the query is a path query,
in which the path to an element is specified. This allows more
specific queries that indicate elements having a given path from the
root. Paths are specified similarly to UNIX file paths, except
that rather than directory names, a path query contains Essex nodes in
a nested relation (parent-child).$ACE/essex-path-query.pl 1.essex
report/status/alternate-structures/transcript
transcript
elements that are nested within an alternate-structures
element (meaning that they are predicted gene structures resulting from
a splicing-disruption variant).filter-gencode.pl <in.gff>
$ACE/filter-gencode.pl infile.gff > outfile.gff
normalize-gff.pl
<infile.gff>
$ACE/normalize-gff.pl infile.gff >
outfile.gff
subset-vcf-by-sample
<in.vcf> <sampleID> <out.vcf>
$ACE/subset-vcf-by-sample
chr1.vcf.gz HG00099 out.vcf
sort-vcf.pl
<in.vcf.gz> <out.vcf.gz> <temp-dir>
sort
command.