Categories and formats of genomics data¶
This document contains very, very brief overviews of the types of data encountered in genomics, and of some common file formats. It is divided into the following sections:
Types of data¶
Generally speaking, genomics data comes in four categories:
- Sequence
The nucleotide sequence of a chromosome, contig, transcript, or a set of these. These are typically maintained by public databases, such as UCSC, Ensembl, and RefSeq. The genome sequence for a given organism frequently is available in several editions, called builds or assemblies.
- Annotations
Descriptions of features – e.g. genes, transcripts, SNPs, start codons – that appear in genomes or transcripts. Annotations typically include coordinates (chromosome name, chromosome positions, and a chromosome strand), as well as properties (gene name, function, GO terms, et c) of a given feature.
Annotations are maintained by the same public databases that maintain sequence information, because the coordinates in each annotation are specific to the genome build upon which it is based. In other words, annotations and sequences must be matched! Pay particular attention to this when downloading annotations as supplemental data from a journal article.
- Quantitative data
Any kind of numerical value associated with a chromosomal position. For example, the degree of phylogenetic conservation between a set of organisms, at each position in the genome. Or, the strength of transcription factor binding to a chromosomal position in a ChIP-seq dataset.
Because quantitative data associates values with chromosomal coordinates, it can be considered an annotation of sorts. It is therefore important again to make sure that the coordinates in your data file match the genome build used by your feature annotation and/or read alignments.
- Read alignments
A record matching a short sequence of DNA to a region of identical or similar sequence in a genome. In a high-throughput sequencing experiment, alignment of short reads identifies the genomic coordinates from which each read presumably derived.
Read alignments can be produced by running sequencing data through alignment programs, such as Bowtie, Tophat, or BWA.
Read alignments can be converted to quantitative data by applying a mapping rule, that uses various properties of the read to assign genomic position(s) at which the read should be counted. For example, one could map reads to their 5’ ends, or to sites within the read where nucleotides mismatch the reference genome. For an in-depth discussion, see Read mapping functions.
Formats of data¶
One of the design goals of plastid
is to insulate users from the esoterica
of the various file formats used in genomics. But, two points are relevant:
It is important for users to recognize the file types names in order to identify the files they have or need to download.
Some file formats are indexed and others are not. Indexed files are memory-efficient, because computer programs don’t need to read the entire file to find the data of interest; instead, they can read the index and just fetch the desired portion of the data.
However, indexed files are frequently compressed, which can make reading them slower to parse. For small genomes that don’t use much memory in the first place (e.g. yeast, E. coli), the meager memory savings aren’t worth this speed cost. The exception is for short read alignments, where indexed BAM files are universally recommended.
Below is a table of commonly used file formats. At present, plastid
handles
all of these either natively or via Pysam (BAM files), Biopython (FASTA),
or twobitreader (2bit).
In addition, BED, GTF2, GFF3, and PSL files can be indexed via tabix.
plastid
supports (via pysam) reading of tabix-compressed files too.
Why are there so many formats?¶
There are a number of answers to this:
Genomics is a young science, and for a long time there was no consensus on how best to store data. This dialogue is, in fact, still ongoing.
It became apparent that file formats that work well with small genomes become very onerous for mammalian-sized genomes. This is why, for example, the 2bit, BigBed, and BigWig formats were created.
The various file formats have their own strengths and weaknesses. These are detailed in Which annotation format should I use?
Which annotation format should I use?¶
When choosing a feature annotation format, consider the following questions:
Will the annotation contain features that are not transcripts?
Will multiple types of features be stored in the same file?
Does rich attribute information need to be saved in the file?
Are features discontinuous?
Is the computing environment limited for processing power or memory and/or is the feature annotation very large?
BED, Extended BED, & BigBed¶
BED-family files contain a single record per line. And, in contrast to GTF2 or GFF3 files, single records – like transcripts – can be discontinuous. This makes BED files computationally cheap to parse, because each line is a complete record. In contrast, in GTF2 and GFF3 files, discontinuous features like transcripts need to be assembled from multiple continuous records (e.g. records describing individual exons).
BED files contain columns that describe only the following attributes:
feature name
feature coordinates (feature can be discontinuous, like a multi-exon transcript)
feature coding region start & stop
a score for the feature
a color for rendering the feature in a genome browser
Note that there is no attribute for feature type: typically all records in a BED file are of the same type (e.g. every record is a transcript or an alignment or a ChIP binding site, et c).
BigBed and Extended BED formats can include additional attributes in additional columns, but every entry in each column must be the same type of attribute (e.g. a “gene id” column can only contain gene IDs).
GTF2 & GFF3¶
Unlike BED-family files, GTF2 and GFF3 files are hierarchical: features have parents and children, which themselves are other features. Continuous features are represented on a single line. Discontinuous features – like transcripts – are represented on multiple lines – for example, one line per exon, one line per intron, and one line per continous portion of a coding region. These sub-features are linked together via parent-child attributes (‘Parent’ in for GFF3; ‘gene_id’ and ‘transcript_id’ in GTF2), which associate them with the discontinuous feature they represent.
This has several important implications:
Sub-features in GTF2 & GFF3 can have their own attributes, which differ from the attributes of their parent features.
In order to reconstruct a discontinuous feature like a transcript, GTF2 & GFF3 parsers need to collect all of the required subfeatures. However, parsers only know when they have collected all of the required features if they receive information indicating this is so. This information could be:
In a GFF3 file, the special line ‘###’:
### # the line above is not a comment, but a GFF3 instruction! # this line and the line above it are comments.
which indicates all features in memory may be assembled.
The end of the annotation file
In all cases, a GTF2 or GFF3 parser has to hold a potentially large set of subfeatures in memory until it it receives some signal that all related subfeatures have been collected. This costs memory, time, and disk space, and can become unwieldy for large genomes.
However, a major advantage of GTF2 and GFF3 files is that they contain a column (column 9) for arbitrary key-value pairs of attributes (such as GO terms, descriptive paragraphs, IDs that cross-reference different databases). This allows different features to have different types of attributes.
The primary difference between GTF2 and GFF3 formats is that, formally, GTF2 files only describe transcripts and their parts, according to a defined schema. The complete list of valid record types in GTF2 is:
CDS
start_codon
stop_codon
5UTR
3UTR
inter
inter_CNS
intron_CNS
exon
GFF3 files can describe any type of feature with any schema of parent-child hierarchy. This makes GFF3 the most flexible format. The cost of this flexibility is that, without knowing the parent-child schema, GFF3 parsers don’t know which chidl subfeatures to assemble into complex parent features.
Similarly, given an assembled feature in Python (represented as a
SegmentChain
), in the absence of a schema there is ambiguity surrounding
what types the parent SegmentChain
and each of its children (GenomicSegments
)
should be rendered as in GFF3 output. Due to this ambiguity, attempts to call the
as_gff3()
method on a multi-segment
SegmentChain
will raise an AttributeError
.
Instead, users may export the individual features from which the
multi-segment SegmentChain
was constructed, setting ‘ID’, ‘Parent’,
and ‘type’ attributes in each child feature’s attr dict:
# a multi-segment chain
>>> my_alignment
<SegmentChain segments=2 bounds=chrI:212353-214802(+) name=some_alignment>
>>> my_alignment.attr
{'ID': 'some_alignment', 'type': 'alignment'}
>>> list(my_alignment)
[<GenomicSegment chrI:212353-212900 strand='+'>,
<GenomicSegment chrI:214313-214802 strand='+'>]
>>> my_alignment.as_gff3()
# AttributeError!
# make a single, continuous feature with the endpoints of `my_alignment`
# 'ID' attribute should match 'ID' of my_alignment
>>> alignment_span = SegmentChain(my_alignment.spanning_segment,ID="some_alignment",type="alignment")
# then make a subfeature for each segment `my_alignment`,
# 'Parent' attribute should match the 'ID' attribute of `alignment_span`
>>> block1 = SegmentChain(my_alignment[0],Parent=my_alignment.get_name(),type="aligned_block")
>>> block2 = SegmentChain(my_alignment[1],Parent=my_alignment.get_name(),type="aligned_block")
# write to file
>>> features = [alignment_span,block1,block2]
>>> with open("some_file.gff","w") as gff_out:
>>> for feature in features:
>>> gff_out.write(feature.as_gff3())
In contrast, multi-segment Transcripts
can be unambiguously exported to GFF3;
they are rendered using the ontology from
Sequence Ontology (SO) v2.53.
In summary¶
The table below summarizes the discussion above:
Format |
Features that are not transcripts or parts of transcripts |
Multiple feature types |
Feature attributes |
Memory use |
Yes |
No |
No |
Low |
|
Yes |
If specified in extra column |
1 per extra column |
Low |
|
Yes |
If specified in extra column |
1 per extra column |
Low (and indexed) |
|
No |
Yes |
Unlimited |
High for discontinuous features |
|
Yes |
Yes |
Unlimited |
High for discontinuous features |
Getting the most out of your time & data¶
Starting a new type of analysis is rarely straightfoward. But, it is possible to save some time by following several practices:
Make sure your annotation matches your genome build. e.g. do not use the mm9 mouse genome annotation with the mm10 sequence assembly. Do not mix Ensembl’s human genome build GRCh38 and UCSC’s similar-but-still-different hg38.
If using a large genome (e.g. Drosophila or larger), consider using non-hierarchical (e.g. BED) and possibly indexed (e.g. BigBed, BigWig ) formats instead of non-indexed formats.
See also¶
Creating custom BED, BigBed, and GTF2 annotation files for examples on how to make your own annotation files
Coordinate systems used in genomics for information on genomic coordinates
SegmentChain
andTranscript
for details on these classesThe UCSC file format FAQ for details on file formats and further discussion of their capabilities, advantages, and disadvantages
The GFF3 specification for details on GFF3 files
Sequence Ontology (SO) v2.53, for a description of a common GFF3 feature ontology
SO releases, for the current SO consortium release.