Tour¶

This document contains a brief overview of plastid in the following sections:

Command-line scripts
Classes & data structures
- GenomicSegment
- SegmentChain & Transcript
- GenomeArray & its subclasses
- GenomeHash, BigBedGenomeHash, and TabixGenomeHash
See also

Detailed documentation of classes & functions may be found in the module documentation.

Command-line scripts ¶

plastid includes a handful of scripts for common high-throughput sequencing and ribosome profiling analyses. These include, among others:

Measuring read counts or density density in regions of interest (for example, to obtain gene expression values)

Determining P-site offsets for ribosome profiling experiments

Performing metagene analyses

Creating browser tracks from BAM or bowtie files, after applying Read mapping functions to extract the biology of interst from the alignments (e.g. P-sites in ribosome profiling data)

For a complete list, see Command-line scripts.

Classes & data structures ¶

plastid defines a number of classes to facilitate sequencing analyses:

Class

Purpose

GenomicSegment, SegmentChain & Transcript

Represent genomic features (e.g. mRNAs, genes, SNPs, stop codons) as Python objects

GenomeArray & its subclasses

Map quantitative values or read alignments to genomic coordinates.

GenomeHash & its subclasses

Index genomic features by genomic coordinates, for quick lookup of features that overlap or cover a region.

In the examples below, we’ll be using a small Demo dataset covering the human cytomegalovirus (hCMV) genome ([SGWM+12]).

`GenomicSegment`¶

GenomicSegments are the fundamental building block of genomic features. They are defined by:

a chromosome name

a start coordinate

an end coordinate

a strand:

‘+’ for forward-strand features

‘-’ for reverse-strand features

‘.’ for unstranded features

On their own, GenomicSegments are not very interesting. However, they can be used to build SegmentChains, which are interesting.

`SegmentChain` & `Transcript`¶

SegmentChain & its subclass Transcript model genomic features. They are constructed from zero or more GenomicSegments, and therefore can represent even discontinuous genomic features, such as transcripts or gapped alignments, in addition to continuous features (e.g. single exons).

SegmentChain and its subclasses provide methods for:

converting coordinates between the genome and the spliced space of the SegmentChain

fetching genomic sequence, read alignments, or count data over the SegmentChain, in its own 5’ to 3’ direction, automatically accounting for splicing of the segments and, for reverse-strand features, reverse-complementing the sequence

slicing or fetching sub-regions of a SegmentChain

testing for equality, inequality, overlap, containment, or coverage of other SegmentChain or GenomicSegment objects

exporting to BED, GTF2, or GFF3 formats, for use with other software packages or within a genome browser

SegmentChains and Transcripts can be constructed manually from zero or more GenomicSegments and any optional keywords, which will be stored in the SegmentChain’s attr dictionary:

>>> from plastid import GenomicSegment, SegmentChain, Transcript
>>> exon1 = GenomicSegment("chrI",129237,130487,"+")
>>> exon2 = GenomicSegment("chrI",130531,130572,"+")
>>> SegmentChain(exon1,exon2,ID="YAL013W",alias="DEP1")
<SegmentChain segments=2 bounds=chrI:129237-130572(+) name=YAL013W>

>>> dep1 = Transcript(exon1,exon2,ID="YAL013W",alias="DEP1",cds_genome_start=129270,cds_genome_end=130484)
>>> dep1
<Transcript segments=2 bounds=chrI:129237-130572(+) name=YAL013W>

>>> dep1.attr
{'ID': 'YAL013W',
 'alias': 'DEP1',
 'cds_genome_end': 130484,
 'cds_genome_start': 129270,
 'type': 'mRNA'}

More often, SegmentChains and Transcripts are loaded from annotation files (see plastid.readers):

>>> from plastid import BED_Reader

>>> # get an iterator over transcripts in file
>>> reader = BED_Reader("merlin_orfs.bed",return_type=Transcript)

>>> # do something with transcripts. here we just look at their names & attribute dictionaries
>>> for transcript in reader:
>>>     print(transcript.get_name() + ":\t" + str(transcript.attr))
ORFL1W_(RL1):        {'cds_genome_end': 2299, 'color': '#000000', 'score': 0.0, 'cds_genome_start': 1366, 'type': 'mRNA', 'ID': 'ORFL1W_(RL1)'}
ORFL2C:      {'cds_genome_end': 2723, 'color': '#000000', 'score': 0.0, 'cds_genome_start': 2501, 'type': 'mRNA', 'ID': 'ORFL2C'}
ORFL3C:      {'cds_genome_end': 3015, 'color': '#000000', 'score': 0.0, 'cds_genome_start': 2934, 'type': 'mRNA', 'ID': 'ORFL3C'}
[rest of output omitted]

SegmentChains and Transcripts can convert coordinates between the transcript and the genome:

>>> # load transcripts into a dictionary keyed on transcript ID
>>> transcript_dict = { X.get_name() : X for X in BED_Reader("merlin_orfs.bed",return_type=Transcript) }

>>> # we'll use the two-exon, minus-strand gene ORFL83C as an example
>>> demo_tx = transcript_dict["ORFL83C_(UL29)"]
>>> demo_tx
<Transcript segments=2 bounds=merlin:35004-37402(-) name=ORFL83C_(UL29)>

>>> # get genomic coordinate of 1124th nucleotide from 5' end of ORFL83C
>>> # right before the splice junction
>>> demo_tx.get_genomic_coordinate(1124)
('merlin', 36277, '-')

>>> # get genomic coordinate of 1125th nucleotide from 5' end of ORFL83C
>>> # right after the splice junction
>>> demo_tx.get_genomic_coordinate(1125)
('merlin', 36130, '-')

>>> # and the inverse operation also works
>>> demo_tx.get_segmentchain_coordinate("merlin",36130,"-")
1126

SegmentChains can fetch vectors of data covering each position in the chain from the 5’ to 3’ end (relative to the chain) from GenomeArrays (themselves explained below). For example, to count how many 5’ ends of sequencing reads appear at each position in a chain:

>>> from plastid import BAMGenomeArray, FivePrimeMapFactory

>>> # load read alignments, and map them to 5' ends
>>> alignments = BAMGenomeArray("SRR609197_riboprofile_5hr_rep1.bam")
>>> alignments.set_mapping(FivePrimeMapFactory())

>>> # fetch the number of 5' ends of alignments at positions 300-320
>>> demo_tx.get_counts(alignments)[320:340]
array([  23.,    3.,   17.,   67.,   22.,    5.,   15.,   14.,   99.,
         26.,   13.,   27.,  112.,   34.,    1.,   13.,    0.,    4.,
          2.,   11.])

It is also possible to fetch sub-sections of a Transcripts or SegmentChains as new SegmentChains:

>>> # take first 200 nucleotides of  mRNA
>>> subchain = demo_tx.get_subchain(0,200)
>>> subchain
<SegmentChain intervals=1 bounds=merlin:37202-37402(-) name=ORFL83C_(UL29)_subchain>

Transcript includes several convenience methods to fetch 5’ UTRs, coding regions, and 3’UTRs from coding transcripts:

>>> demo_tx.get_utr5()
<SegmentChain intervals=1 bounds=merlin:37353-37402(-) name=ORFL83C_(UL29)_5UTR>

>>> demo_cds = demo_tx.get_cds()
>>> demo_cds
<Transcript intervals=2 bounds=merlin:35104-37353(-) name=ORFL83C_(UL29)_CDS>

SegmentChain and its subclasses can also fetch their sequences from dictionaries of strings or Bio.SeqRecord.SeqRecord objects. These sequences will automatically be spliced and reverse-complemented, as necessary:

>>> from Bio import SeqIO
>>> genome = SeqIO.to_dict(SeqIO.parse(open("merlin_NC006273-2.fa"),"fasta"))
>>> demo_tx.get_cds().get_sequence(genome)
'ATGTCCGGCCGTCGCAAGGGCTGCTCGGCGGCCACGGCGTCTTCCTCCTCGTCGTCGCCGCCGTCGCGCCTTCCTCTGCCTGGGCACGCGCGTCGGCCGCGTCGCAAACGCTGCTTGGTACCCGAGG...'
# rest of output omitted

SegmentChains and Transcripts can do a lot more. For complete documentation see SegmentChain and Transcript in plastid.genomics.roitools.

`GenomeArray` & its subclasses¶

GenomeArrays are dictionary-like objects that map quantitative data, counts, or read alignments, to genomic positions. Data can be imported from count files (Wiggle, bedGraph) or alignment files (bowtie or BAM formats). For very large genomes a sparse implementation is provided by SparseGenomeArray. A BAMGenomeArray is provided for read alignments in BAM format.

GenomeArrays can be indexed by GenomicSegments or SegmentChains. Doing so returns a vector of counts at each position in the GenomicSegment or SegmentChain, with 5’ to 3’ coordinates relative to the chain (i.e. for reverse-strand features, position 0 of the vector corresponds to segment.end):

>>> # genomic segment
>>> seg = GenomicSegment("merlin",1500,1600,"+")
>>> alignments[seg]
array([ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,
    0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
    4.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
    0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
    0.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
    0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,
    1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
    0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

>>> # segment chain
>>> alignments[demo_cds][:100]
array([ 24.,   4.,   0.,   1.,   6.,   1.,   0.,   1.,  16.,   2.,   1.,
     1.,   2.,  13.,  17.,  13.,  13.,   2.,   3.,  23.,  10.,  39.,
    22.,  23.,  31.,  34.,  11.,  20.,  15.,   2.,   8.,  10.,   4.,
    11.,   9.,   5.,   5.,   4.,  13.,   5.,   2.,   0.,   2.,   4.,
     0.,   7.,  48.,  10.,  14.,   2.,   2.,   4.,   3.,   8.,   9.,
     0.,   9.,   8.,   8.,   9.,  10.,   9.,  14.,   3.,   9.,  33.,
     3.,   6.,  38.,   7.,   1.,  14.,   3.,  32.,  55.,  11.,   1.,
     4.,   1.,   9.,   9.,   1.,   3.,   2.,   0.,   6.,  17.,  21.,
     1.,  32.,   6.,   3.,  11.,   3.,   2.,   7.,  10.,   0.,  36.,
     4.])

>>> # has same effects as calling the 'get_counts()' method
>>> (demo_cds.get_counts(alignments) == alignments[demo_cds]).all()
True

When importing read alignments, users can specify a mapping rule to determine the genomic position(s) at which each alignment should be counted. plastid already includes mapping functions to map read alignments:

to their fiveprime or threeprime ends, with or without offsets from that end (e.g. for P-site mapping for ribosome profiling data)

fractionally over their entire lengths (for visualizing the full extent of transcripts in RNA-seq data)

fractionally to all positions covered by a central portion of the read alignment, after excluding a user-defined number of positions on each send of the read (as in ribosome profiling data from E. coli [OBS+11] or D. melanogaster [DFB+13])

mapping rules for BAMGenomeArrays can be changed at runtime:

>>> from plastid import FivePrimeMapFactory, ThreePrimeMapFactory

>>> alignments.set_mapping(FivePrimeMapFactory())
>>> demo_tx.get_cds().get_counts(alignments)[:50]
array([ 24.,   4.,   0.,   1.,   6.,   1.,   0.,   1.,  16.,   2.,   1.,
         1.,   2.,  13.,  17.,  13.,  13.,   2.,   3.,  23.,  10.,  39.,
        22.,  23.,  31.,  34.,  11.,  20.,  15.,   2.,   8.,  10.,   4.,
        11.,   9.,   5.,   5.,   4.,  13.,   5.,   2.,   0.,   2.,   4.,
         0.,   7.,  48.,  10.,  14.,   2.])

>>> # change to mapping with 15 nucleotide offset from 5' end
>>> alignments.set_mapping(FivePrimeMapFactory(offset=15))
>>> demo_tx.get_cds().get_counts(alignments)[:50]
array([  3.,  26.,   8.,  17.,   4.,   4.,   9.,  27.,   7.,   3.,  17.,
        10.,  18.,  20.,   1.,  24.,   4.,   0.,   1.,   6.,   1.,   0.,
         1.,  16.,   2.,   1.,   1.,   2.,  13.,  17.,  13.,  13.,   2.,
         3.,  23.,  10.,  39.,  22.,  23.,  31.,  34.,  11.,  20.,  15.,
         2.,   8.,  10.,   4.,  11.,   9.])

>>> # change to mapping from 3' end, with no offset
>>> alignments.set_mapping(ThreePrimeMapFactory())
>>> demo_tx.get_cds().get_counts(alignments)[:50]
array([  5.,   6.,  14.,  17.,  24.,   5.,  14.,  19.,   4.,   5.,  11.,
         6.,   4.,   4.,   0.,   2.,   0.,   0.,   1.,   6.,  14.,  26.,
        25.,   4.,  23.,   7.,   8.,  24.,  11.,  11.,  22.,   9.,  14.,
         2.,   0.,   1.,   5.,   9.,   7.,   1.,   6.,   3.,   1.,   4.,
         5.,  15.,  15.,   6.,  17.,   8.])

GenomeArrays and subclasses can be exported to wiggle or bedGraph files for use in a genome browser:

>>> # export minus strand as a bedgraph file
>>> with open("alignments_rc.wig","w") as fout:
>>>     alignments.to_bedgraph(fout,"my_trackname","-")

wiggle or bedGraph files can be also imported into a GenomeArray using the add_from_wiggle() method:

>>> from plastid import GenomeArray
>>> new_data = GenomeArray()
>>> new_data.add_from_wiggle(open("alignments_rc.wig"),"-")

>>> demo_tx.get_cds().get_counts(new_data)[:50]
array([  5.,   6.,  14.,  17.,  24.,   5.,  14.,  19.,   4.,   5.,  11.,
         6.,   4.,   4.,   0.,   0.,   0.,   0.,   0.,   6.,  14.,  26.,
        25.,   4.,  23.,   7.,   8.,  24.,  11.,  11.,  22.,   9.,  14.,
         2.,   0.,   0.,   5.,   9.,   7.,   1.,   6.,   3.,   1.,   4.,
         5.,  15.,  15.,   6.,  17.,   8.])

For further information, see:

The module documentation for genome_array

In-depth discussion of mapping rules

`GenomeHash`, `BigBedGenomeHash`, and `TabixGenomeHash`¶

Often one needs to know whether any features overlap a specific region in the genome, for example, to find transcripts that overlap one another.

But, it would be inefficient to scan an entire file to find overlapping features, or to test whether two features overlap if we already know from their genomic coordinates that they cannot.

GenomeHash and its subclasses avoid this problem by indexing features by location. A GenomeHash may be created from a list or dictionary of features (e.g. SegmentChains or Transcripts) in memory, or directly loaded from a genome annotation (in BED, GTF2, GFF3, or PSL format):

>>> from plastid import GenomeHash
>>> my_hash = GenomeHash(transcript_dict)

Having made a GenomeHash, we can ask what is where in the genome. For example, to find all features between bases 10000-20000 on the plus strand of chromosome chrI:

>>> roi = GenomicSegment("merlin",10000,20000,"+")
>>> my_hash[roi]
[<Transcript segments=1 bounds=merlin:14307-14957(+) name=ORFL35W_(UL5)>,
 <Transcript segments=1 bounds=merlin:16522-17040(+) name=ORFL40W_(UL8)>,
 <Transcript segments=1 bounds=merlin:15814-16632(+) name=ORFL37W_(UL7)>,
 <Transcript segments=1 bounds=merlin:19793-21178(+) name=ORFL46W.iORF2>,
 <Transcript segments=1 bounds=merlin:12684-12929(+) name=ORFL25W>,
 <Transcript segments=1 bounds=merlin:13185-13406(+) name=ORFL30W>,
 <Transcript segments=1 bounds=merlin:19559-21178(+) name=ORFL46W>,
 <Transcript segments=1 bounds=merlin:9799-11193(+) name=ORFL23W_(RL12)>,
 <Transcript segments=1 bounds=merlin:13561-13779(+) name=ORFL33W>,
 <Transcript segments=1 bounds=merlin:12872-13192(+) name=ORFL26W>,
 <Transcript segments=1 bounds=merlin:18591-19559(+) name=ORFL45W_(UL11)>,
 <Transcript segments=1 bounds=merlin:19607-21178(+) name=ORFL46W.iORF1_(UL13)>,
 <Transcript segments=1 bounds=merlin:19053-19559(+) name=ORFL45W.iORF1>,
 <Transcript segments=1 bounds=merlin:18467-18685(+) name=ORFL44W>,
 <Transcript segments=1 bounds=merlin:17867-18142(+) name=ORFL42W>,
 <Transcript segments=1 bounds=merlin:14914-15906(+) name=ORFL36W_(UL6)>,
 <Transcript segments=1 bounds=merlin:13770-14369(+) name=ORFL34W_(UL4)>,
 <Transcript segments=1 bounds=merlin:11138-12169(+) name=ORFL24W_(RL13)>,
 <Transcript segments=1 bounds=merlin:14565-14957(+) name=ORFL35W.iORF1>]

Or on both strands:

>>> my_hash.get_overlapping_features(roi,stranded=False)
[<Transcript segments=1 bounds=merlin:19793-21178(+) name=ORFL46W.iORF2>,
 <Transcript segments=1 bounds=merlin:17609-17857(-) name=ORFL43C.iORF1>,
 <Transcript segments=1 bounds=merlin:19607-21178(+) name=ORFL46W.iORF1_(UL13)>,
 <Transcript segments=1 bounds=merlin:17609-17968(-) name=ORFL43C>,
 <Transcript segments=2 bounds=merlin:7811-13052(-) name=ORFL27C>,
 <Transcript segments=1 bounds=merlin:15643-15915(-) name=ORFL38C>,
 <Transcript segments=1 bounds=merlin:13561-13779(+) name=ORFL33W>,
 <Transcript segments=1 bounds=merlin:19053-19559(+) name=ORFL45W.iORF1>,
 <Transcript segments=1 bounds=merlin:17867-18142(+) name=ORFL42W>,
 <Transcript segments=1 bounds=merlin:14914-15906(+) name=ORFL36W_(UL6)>,
 <Transcript segments=1 bounds=merlin:14307-14957(+) name=ORFL35W_(UL5)>,
 <Transcript segments=1 bounds=merlin:16522-17040(+) name=ORFL40W_(UL8)>,
 <Transcript segments=1 bounds=merlin:15698-15937(-) name=ORFL39C>,
 <Transcript segments=1 bounds=merlin:17061-17294(-) name=ORFL41C>,
 <Transcript segments=1 bounds=merlin:13169-13402(-) name=ORFL31C.iORF1>,
 <Transcript segments=1 bounds=merlin:13185-13406(+) name=ORFL30W>,
 <Transcript segments=1 bounds=merlin:19559-21178(+) name=ORFL46W>,
 <Transcript segments=1 bounds=merlin:12872-13192(+) name=ORFL26W>,
 <Transcript segments=1 bounds=merlin:12957-13226(-) name=ORFL29C>,
 <Transcript segments=1 bounds=merlin:13770-14369(+) name=ORFL34W_(UL4)>,
 <Transcript segments=1 bounds=merlin:12910-13140(-) name=ORFL28C.iORF1>,
 <Transcript segments=1 bounds=merlin:14565-14957(+) name=ORFL35W.iORF1>,
 <Transcript segments=1 bounds=merlin:15814-16632(+) name=ORFL37W_(UL7)>,
 <Transcript segments=1 bounds=merlin:12684-12929(+) name=ORFL25W>,
 <Transcript segments=1 bounds=merlin:9799-11193(+) name=ORFL23W_(RL12)>,
 <Transcript segments=1 bounds=merlin:13110-13439(-) name=ORFL31C_(UL2)>,
 <Transcript segments=1 bounds=merlin:18591-19559(+) name=ORFL45W_(UL11)>,
 <Transcript segments=1 bounds=merlin:11138-12169(+) name=ORFL24W_(RL13)>,
 <Transcript segments=1 bounds=merlin:12899-13159(-) name=ORFL28C>,
 <Transcript segments=1 bounds=merlin:18467-18685(+) name=ORFL44W>]

Does anything interesting overlap ORFL83C_(UL29)?

>>> my_hash[demo_tx]
[<Transcript segments=1 bounds=merlin:37077-37449(-) name=ORFL84C>,
 <Transcript segments=1 bounds=merlin:33081-35058(-) name=ORFL79C_(UL27)>,
 <Transcript segments=1 bounds=merlin:37382-37898(-) name=ORFL85C_(UL30)>,
 <Transcript segments=2 bounds=merlin:35004-37403(-) name=ORFL83C_(UL29)>]

For more information, see the module documentation for genome_hash.

Tour¶

Command-line scripts¶

Classes & data structures¶

GenomicSegment¶

SegmentChain & Transcript¶

GenomeArray & its subclasses¶

GenomeHash, BigBedGenomeHash, and TabixGenomeHash¶

See also¶