plastid.genomics.map_factories module

Mapping functions used by BAMGenomeArray

Summary

Numerous sequencing assays encode interesting biology in various properties of read alignments. For example:

  • Ribosome profiling encodes the positions of ribosomal P-sites as a function of both read length and alignment coordinates
  • Bisulfite sequencing encodes methylation status as C-to-T transitions within read alignments
  • DMS-Seq and Pseudouridine profiling encode unstructured regions of RNA or sites of pseudouridine modification, respectively, in the 5’ termini fo their read aligments

Extended summary

plastid uses configurable mapping functions to decode the biology of interest from read alignments.

This design enables plastid’s tools to operate on sequencing data from virtually any NGS assay, provided the appropriate mapping function and parameters. Mapping rules are described in the following section:

This module contains Cython implementations of mapping functions for BAM files. At present, the following are included:

Fiveprime end mapping
Each read alignment is mapped to its 5’ end, or at a fixed offset (in nucleotides) from its 5’ end
Variable fiveprime end mapping

Each read alignment is mapped at a fixed distance from its 5’ end, where the distance is determined by the length of the read alignment.

This is used for ribosome profiling of yeast ([IGNW09]) and mammalian cells ([ILW11]).

Stratified variable fiveprime end mapping

This multidimensional mapping rule behaves as variable fiveprime end mapping, and additionally returns arrays that are stratified by read length.

For each region of interest, a 2D array is returned, in which each row represents a given read length, and each column a nucleotide position in the region of interest. In other words, summing over the columns produces the same array that would be given by variable fiveprime end mapping.

Threeprime end mapping
Each read alignment is mapped to its 3’ end, or at a fixed offset (in nucleotides) from its 3’ end.
Entire or Center-weighted mapping

Zero or more positions are trimmed from each end of the read alignment, and the remaining N positions in the alignment are incremented by 1/N read counts (so that each read is still counted once, when integrated over its mapped length).

This is also used for ribosome profiling of E. coli and D. melanogaster, and RNA-seq.

Examples

Mapping functions are rarely called directly. Instead, they are invoked via plastid.genomics.genome_array.BAMGenomeArray.set_mapping():

# open BAM file
>>> ga = BAMGenomeArray(["some_file.bam"])

# set mapping 15 bases from 3' end of read
>>> ga.set_mapping(ThreePrimeMapFactory(offset=15))

# set mapping to 5' end of read
>>> ga.set_mapping(FivePrimeMapFactory())

# cut 3 nt off each end of the read, and map fractionally over the rest
>>> ga.set_mapping(CenterMapFactory(nibble=12))

# access reads in some region
>>> ga[GenomicSegment("chrI",10000,100000,"+")]
# output is numpy array

Other mapping functions are similarly invoked. See their docstrings for further details.

Implementation

Mapping functions must be callables, and can therefore be implemented as functions or classes. In order to give them configurable parameters, one could create function factories or use functools.partial() to supply the extra parameters. Here, we created classes that produce callable instances that mimic functions. We did this because early on we ran into trouble with limitations of pickle and multiprocessing, which can only capture functions and classes defined in the top-level scope. This limitation is no longer relevant to our purposes but is preserved here in history.

Generally, the callable must accept two arguments: a list of read alignments (as pysam.AlignedSegment) and a GenomicSegment representing a query interval. The callable must then return two values: a list of read alignments, and a numpy.ndarray of values.

Generally, all reads are mapped within the callable. Only those that contribute data to coordinates contained within the query interval are returned.

The numpy.ndarray of values may be multidimensional. The only stipulation is that the last dimension represent the positions in the query interval. For a 2D array, the positions would thus be columns. See StratifiedVariableFivePrimeMapFactory below for an example.

See also

Module documentation for genome_array
For a detailed discussion of mapping functions.
class plastid.genomics.map_factories.CenterMapFactory

Bases: object

Read mapping tool for BAMGenomeArray.set_mapping(). nibble positions is removed from each side of each read alignment, and the N remaining positions are each apportioned 1/N of the read count, so that the entire read is counted once.

Parameters:
nibble : int >= 0, optional

Number of bases to remove from each side of read (Default: 0)

Attributes:
nibble

Number of positions to trim from each side of read alignment before assigning genomic positions.

Methods

__call__ Returns reads covering a region, and a count vector mapping reads to specific positions in the region.
nibble

Number of positions to trim from each side of read alignment before assigning genomic positions.

class plastid.genomics.map_factories.FivePrimeMapFactory

Bases: object

Fiveprime mapping factory for BAMGenomeArray.set_mapping(). Reads are mapped at offset nucleotides from the fiveprime end of their alignment.

Parameters:
offset : int >= 0, optional

Offset from 5’ end of read, in direction of 3’ end, at which reads should be counted (Default: 0)

Attributes:
offset

Distance from 5’ end of read at which to assign reads

Methods

__call__ Returns reads covering a region, and a count vector mapping reads to specific positions in the region, mapping reads at self.offset from the fiveprime end of each read.
offset

Distance from 5’ end of read at which to assign reads

class plastid.genomics.map_factories.SizeFilterFactory

Bases: object

Create a read-length filter can be applied at runtime to a BAMGenomeArray using :BAMGenomeArray.add_filter()

Parameters:
min : float, optional

Minimum read length to pass filter, inclusive (Default: 1)

max : float or numpy.inf, optional

Maximum read length to pass filter, inclusive. If set to -1, then there is no maximum length filter. (Default: -1, no filter)

Methods

__call__(…) <==> x(…)
class plastid.genomics.map_factories.StratifiedVariableFivePrimeMapFactory

Bases: plastid.genomics.map_factories.VariableFivePrimeMapFactory

Fiveprime-variable mapping for BAMGenomeArray.set_mapping(), stratified by read length.

Reads are mapped into a 2D array of counts at each position in a region of interest (column), stratified by read length (row). Mapping is performed by applying a user-specified offset from the fiveprime end of each alignment. A unique offset may be supplied for each read length.

Reads outside of pre-specified minimum and maximum lengths are ignored.

Parameters:
offset_dict : dict or None

Dictionary mapping read lengths to offsets that should be applied to reads of that length during mapping. A special key, ‘default’ may be supplied to provide a default value for lengths not specifically enumerated in offset_dict. If None, all reads are mapped to their 5’ ends.

min_length : int

Minimum length read to map

max_length : int

Maximum length read to map, inclusive

Attributes:
row_keys : numpy.ndarray

numpy array of read lengths corresponding to each row of mapped data.

shape : list of integers

list containing size of each axis of output, excluding the final axis,

Methods

__call__ Map reads covering seg into a 2D count array, in which reads at each nucleotide position (column) are stratified by read length (row), and, optionally, offset from their 5’ ends.
from_file(fn_or_fh) Create a VariableFivePrimeMapFactory from a text file.
from_file(fn_or_fh)

Create a VariableFivePrimeMapFactory from a text file.

The text file should be formatted as by the psite script: a tab-delimited two-column text file in which the first column is a read length (or the special word ‘default’), and the second column an offset from the 5’ end of the read corresponding to that read length. For example:

25      12
26      12
27      13
28      13
29      14
30      14
default 14
Parameters:
fn_or_fh : str or open file-like

If str, it should be a path to the text file. If file-like, it should be an open file-like object pointing to the data that would be in the text file.

row_keys

numpy array of read lengths corresponding to each row of mapped data.

shape

list containing size of each axis of output, excluding the final axis, which is specified the length of incoming GenomicSegments.

class plastid.genomics.map_factories.ThreePrimeMapFactory

Bases: object

Threeprime mapping factory for BAMGenomeArray.set_mapping(). Reads are mapped offset nucleotides from the threeprime end of their alignments.

Parameters:
offset : int >= 0, optional

Offset from 3’ end of read, in direction of 5’ end, at which reads should be counted (Default: 0)

Attributes:
offset

Distance from 3’ end of read at which to assign reads

Methods

__call__ Returns reads covering a region, and a count vector mapping reads to specific positions in the region, mapping reads at self.offset from the threeprime end of each read.
offset

Distance from 3’ end of read at which to assign reads

class plastid.genomics.map_factories.VariableFivePrimeMapFactory

Bases: object

Fiveprime-variable mapping for BAMGenomeArray.set_mapping(). Reads are mapped at a user-specified offsets from the fiveprime end of each alignment. The offset for a read of a given length is supplied in offset_dict[readlen]

Parameters:
offset_dict : dict

Dictionary mapping read lengths to offsets that should be applied to reads of that length during mapping. A special key, ‘default’ may be supplied to provide a default value for lengths not specifically enumerated in offset_dict

Methods

__call__ Returns reads covering a region, and a count vector mapping reads to specific positions in the region, mapping reads at possibly varying offsets from the 5’ end of each read as determined by self.offset_dict
from_file(fn_or_fh) Create a VariableFivePrimeMapFactory from a text file.
from_file(fn_or_fh)

Create a VariableFivePrimeMapFactory from a text file.

The text file should be formatted as by the psite script: a tab-delimited two-column text file in which the first column is a read length (or the special word ‘default’), and the second column an offset from the 5’ end of the read corresponding to that read length. For example:

25      12
26      12
27      13
28      13
29      14
30      14
default 14
Parameters:
fn_or_fh : str or open file-like

If str, it should be a path to the text file. If file-like, it should be an open file-like object pointing to the data that would be in the text file.