Glossary of terms

0-indexed
0-indexed coordinates

In 0-indexed coordinate systems, the first position or coordinate is labeled 0. 0-indexed coordinates are typical in Python, where all slicing and indexing of lists, strings, and all other sliceable objects occurs in 0-indexed and half-open coordinate representation.

In contrast, see 1-indexed coordinates. For a detailed discussion with examples, see Coordinate systems used in genomics.

1-indexed
1-indexed coordinates

In 1-indexed coordinate systems, the first position or coordinate is labeled 1. In contrast, see 0-indexed coordinates. For a detailed discussion with examples, see Coordinate systems used in genomics.

alignment
read alignments

A record matching a short sequence of DNA or RNA to a region of identical or similar sequence in a genome. In a high-throughput sequencing experiment, alignment of short reads identifies the genomic coordinates from which each read presumably derived.

These are produced by running sequencing data through alignment programs, such as Bowtie, Tophat, or BWA. The most common format for short read alignments is BAM.

annotation

A file that describes locations and properties of features (e.g. genes, mRNAs, SNPs, start codons) in a genome. Annotation files come in various formats, such as BED, BigBed, GTF2, GFF3, and PSL, among others. In a high-throughput sequencing experiment, it is essential to make sure that the coordinates in the annotation correspond to the genome build used to generate the alignments.

count file

A file that assigns quantitative data – for example, read alignment counts, or conservation scores – to genomic coordinates. Strictly speaking, these include bedGraph or wiggle files but plastid can also treat alignment files in bowtie or BAM format as count files, if a mapping rule is applied.

counts

Colloquially, the number of read alignments overlapping a region of interest, or mapped to a nucleotide.

crossmap

A mask file that annotates regions of the genome that give rise to multimapping reads under given alignment criteria. Crossmaps may be made using the crossmap script

deep sequencing
high-throughput sequencing

A group of experimental techniques that produce as output millions of reads (strings) of short DNA sequences.

DMS-seq

An RNA structure probing technique using high-throughput sequencing. See [RZW+14] for details.

Extended BED
BED X+Y

Extended BED files contain 3-12 columns of BED-formatted data (x), plus additional (y) tab-delimited columns of arbitrary data. The ENCODE project has created several such formats (for a complete list, see the UCSC file format FAQ), including:

plastid supports reading BED X+Y formats via the extra_columns keyword that can be passed to BED_Reader, or the from_bed() method of SegmentChain and Transcript. It also supports writing BED 12+Y formats via the same keyword passed to the as_bed().

factory function

A function that produces functions

FDR
false discovery rate

The false discovery rate is defined as the fraction of positive results that are false positives ([BH95]):

\[FDR = \frac{FP}{FP + TP}\]

For example, at a 5% false discovery rate, a set of 20 positive results would contain approximately 1 false positive.

feature

A region of the genome with interesting or specific properties, such as a gene, an mRNA, an exon, a centromere, et c.

footprint
ribosome-protected footprint

A fragment of mRNA protected from nuclease digestion by a ribosome during ribosome profiling or other molecular biology assays.

fully-closed
end-inclusive

In fully-closed coordinate systems, the end coordinate of a feature is defined as the last position included in the feature. So, in this representation, the end coordinate of a 3-nucleotide feature that starts at position 3 would be 5.

In contrast, see half-open coordinates. For a detailed discussion, with examples, see Coordinate systems used in genomics.

genome assembly
genome build

A specific edition of a genome sequence for a given organism. These are updated over time as sequence data is added and/or corrected. When an assembly is updated, frequently the lengths of the chromosomes or contigs change as sequences are corrected.

genome browser

Software used for visualizing genomic sequence, feature annotations, read alignments, and other quantitative data (e.g. nucleotide-wise sequence conservation). Popular genome browsers include IGV and the UCSC genome browser.

half-open

In half-open coordinate systems, the end coordinate of a feature is defined as the first position NOT included in the feature. So, in this representation, the end coordinate of a 3-nucleotide feature that starts at position 3 would be 6.

half-open coordinates are typical in Python,

where all slicing and indexing of lists, strings, or other sliceable objects use 0-indexed and half-open coordinate representation.

In contrast, see fully-closed coordinates. For a detailed discussion with examples, see Coordinate systems used in genomics.

indexed file format

A file that indexes its own data, enabling readers to selectively load only the portions of data that are needed. This substantially saves memory. Indexed data formats include BAM, BigWig, BigBed and tabix-compressed GTF2, GFF3, and BED files. See Formats of data for further discussion.

k-mer

A sequence k nucleotides long.

mapping rule
mapping function

A function that describes how a read alignment is mapped to the genome for positional analyses. Reads typically are mapped to their fiveprime or threeprime ends, with an offset of 0 or more nucleotides that can optionally depend on the read length.

For example, ribosome-protected mRNA fragments are frequently mapped to their P-site offset by using a 15 nucleotide offset from the threeprime end of the fragment.

See Read mapping functions for an in-depth discusion, with examples.

mask file
mask annotation file

An annotation file that identifies regions of the genome to exclude from analysis, such as repetitive regions.

See Excluding (masking) regions of the genome for information on creating and using mask files.

maximal spanning window

The largest possible window over which a group of regions (for example, transcripts) share corresponding genomic positions.

For example, if a gene has a single start codon, the maximal spanning window surrounding that start codon can be made by growing a window along the transcripts in the 5’ and 3’ directions, starting at the start codon, and stopping in each direction as soon as the next coordinate no longer corresponds to the same genomic position in all transcripts:

Metagene - maximal spanning window

Maximal spanning window surrounding a start codon over a family of transcripts.

Maximal spanning windows are often used in metagene analyses.

metagene
metagene average

An average of quantitative data over one or more genomic regions (often genes or transcripts) aligned at some internal feature. For example, a metagene profile could be built around:

  • the average of ribosome density surrounding the start codons of all transcripts in a ribosome profiling dataset

  • an average phylogenetic conservation score surounding the 5’ splice site of the first introns of all transcripts

See Performing metagene analyses and/or the module documentation for the metagene script for more explanation.

multimap
multimapping

A read that aligns equally well (or nearly-equally well) to multiple regions in a genome or transcriptome is said to be multimapping in that genome or transcriptome.

Multimapping reads arise from repeated sequence, for example from duplicated genes, transposons, telomeres, tandem repeats, or segmental duplications within genes.

P-site offset

Distance from the 5’ or 3’ end of a ribosome-protected footprint to the P-site of the ribosome that generated the footprint.

Cartoon of ribosomal P-site

Ribosome, footprint, and P-site offset. After [IGNW09].

Because the P-site is the site where peptidyl elongation occurs, read alignments from ribosome profiling are frequently mapped to their P-sites, so that translation may be visualized along a transcript.

P-site offsets may be estimated from ribosome profiling data using the psite script. For a detailed discussion, see Determine P-site offsets for ribosome profiling data.

paired-end sequencing

A high-throughput sequencing technique in which 50-100 nucleotides of each end of a ~300 nucleotide sequence are read, and reported as a pair.

ribosome profiling

A high-throughput sequencing technique that captures the positions of all ribosomes on all RNAs at a snapshot in time. See [IGNW09] for more details

roi
region of interest

A region of the genome or of a transcript that contains an interesting feature.

RPKM

Reads per kilobase per million reads in a dataset. This is a unit of sequencing density that is normalized by sequencing depth (in millions of reads) and by the length of the region of interest (in kb).

single-end sequencing

A high-throughput sequencing technique that generates short reads of approximately 50-100 nt in length.

start codon peak

Large peaks of ribosome-protected footprint visible over initiator codons in ribosome profiling data. These arise because the kinetics of translation initiation are slow compared to the kinetics of elongation, causing a build-up over the initiator codon.

stop codon peak

Large peaks of ribosome-protected footprint visible over stop codons in some ribosome profiling datasets. These arise because the kinetics of translation termination are slow compared to the kinetics of elongation, causing a build-up over termination codons. These peaks are frequently absent from datasets if tissues are pre-treated with elongation inhibitors (e.g. cycloheximide) before lysis and sample prep.

sub-codon phasing
triplet periodicity

A feature of ribosome profiling data. Because ribosomes step three nucleotides in each cycle of translation elongation, in many ribosome profiling datasets a triplet periodicity is observable in the distribution of ribosome-protected footprints, in which 70-90% of the reads on a codon fall within the first of the three codon positions. This allows deduction of translation reading frames, if the reading frame is not known a priori. See [IGNW09] for more details

translation efficiency

An mRNA’s translation efficiency measures how much protein is made from that individual transcript. Translation efficiency for an mRNA is therefore proportional to its translation initiation rate.