|
|
Methods for Computational Gene Prediction |
|
|
|
DATA SETS
|
Genomic sequences and annotations
These data sets consist of FASTA files containing entire genomic contigs or chromosomes, and GFF files containing the coordinates of coding exons within those contigs. In order to use this data you will have to extract the training features (i.e., exons, introns, splice sites, etc.) yourself. For pre-extracted data scroll down to the next table below. (NOTE: files hosted at TIGR may disappear in the near future...)
SEE ALSO: Gene Prediction Data Consortium hosted at bioinformatics.org. NOTE: the synthetic ("toy genome") data from Chapter 5 is available on another page. Sequence features
These data sets consist of FASTA files containing pre-extracted training features, such as coding exons, splice sites, etc. Because they have been pre-extracted, you only get fixed window sizes, etc. To extract custom features, use the data from the table above.
Classification data sets
These data sets contain vectors of numerical features computed from sequences of both coding and non-coding ORFs. These can be used to train a classifier for distinguishing coding from non-coding sequence. Both training and test data are provided, so that classifier accuracy can be quantified on the hold-out set. Features were: (1) log probability of the ORF length, (2) weight-matrix score of the 5' signal of the ORF, (3) weight-matrix score of the 3' signal of the ORF, (4) log-likelihood ratio computed from hexamer frequencies in the ORF. More details are given in the README.pdf file included in each tarball. (See also section 10.17 of the book).
|