plastid.genomics.map_factories module¶
Mapping functions used by BAMGenomeArray
Summary¶
Numerous sequencing assays encode interesting biology in various properties of read alignments. For example:
Ribosome profiling encodes the positions of ribosomal P-sites as a function of both read length and alignment coordinates
Bisulfite sequencing encodes methylation status as C-to-T transitions within read alignments
DMS-Seq and Pseudouridine profiling encode unstructured regions of RNA or sites of pseudouridine modification, respectively, in the 5’ termini of their read aligments
Extended summary¶
plastid uses configurable mapping functions to decode the biology of interest from read alignments.
This design enables plastid’s tools to operate on sequencing data from virtually any NGS assay, provided the appropriate mapping function and parameters. Mapping rules are described in the following section:
This module contains Cython implementations of mapping functions for BAM files. At present, the following are included:
- Fiveprime end mapping
Each read alignment is mapped to its 5’ end, or at a fixed offset (in nucleotides) from its 5’ end
- Variable fiveprime end mapping
Each read alignment is mapped at a fixed distance from its 5’ end, where the distance is determined by the length of the read alignment.
This is used for ribosome profiling of yeast ([IGNW09]) and mammalian cells ([ILW11]).
- Stratified variable fiveprime end mapping
This multidimensional mapping rule behaves as variable fiveprime end mapping, and additionally returns arrays that are stratified by read length.
For each region of interest, a 2D array is returned, in which each row represents a given read length, and each column a nucleotide position in the region of interest. In other words, summing over the columns produces the same array that would be given by variable fiveprime end mapping.
- Threeprime end mapping
Each read alignment is mapped to its 3’ end, or at a fixed offset (in nucleotides) from its 3’ end.
- Entire or Center-weighted mapping
Zero or more positions are trimmed from each end of the read alignment, and the remaining N positions in the alignment are incremented by 1/N read counts (so that each read is still counted once, when integrated over its mapped length).
This is also used for ribosome profiling of E. coli and D. melanogaster, and RNA-seq.
Examples¶
Mapping functions are rarely called directly. Instead, they are invoked via
plastid.genomics.genome_array.BAMGenomeArray.set_mapping()
:
# open BAM file >>> ga = BAMGenomeArray(["some_file.bam"]) # set mapping 15 bases from 3' end of read >>> ga.set_mapping(ThreePrimeMapFactory(offset=15)) # set mapping to 5' end of read >>> ga.set_mapping(FivePrimeMapFactory()) # cut 3 nt off each end of the read, and map fractionally over the rest >>> ga.set_mapping(CenterMapFactory(nibble=12)) # access reads in some region >>> ga[GenomicSegment("chrI", 10000, 100000, "+")] # output is numpy array
Other mapping functions are similarly invoked. See their docstrings for further details.
Implementation¶
Mapping functions must be callables, and can therefore be implemented as functions
or classes. In order to give them configurable parameters, one could create
function factories or use functools.partial()
to supply the extra
parameters. Here, we created classes that produce callable instances that mimic
functions. We did this because early on we ran into trouble with limitations of
pickle
and multiprocessing
, which can only capture functions and
classes defined in the top-level scope. This limitation is no longer relevant
to our purposes but is preserved here in history.
Generally, the callable must accept two arguments: a list of read alignments
(as pysam.AlignedSegment
) and a GenomicSegment
representing a query
interval. The callable must then return two values: a list of read alignments,
and a numpy.ndarray
of values.
Generally, all reads are mapped within the callable. Only those that contribute data to coordinates contained within the query interval are returned.
The numpy.ndarray
of values may be multidimensional. The only
stipulation is that the last dimension represent the positions in the query
interval. For a 2D array, the positions would thus be columns. See
StratifiedVariableFivePrimeMapFactory
below for an example.
See also¶
- Module documentation for
genome_array
For a detailed discussion of mapping functions.
- class plastid.genomics.map_factories.CenterMapFactory(nibble=0)¶
Bases:
object
Read mapping tool for
BAMGenomeArray.set_mapping()
. nibble positions is removed from each side of each read alignment, and the N remaining positions are each apportioned 1/N of the read count, so that the entire read is counted once.- Parameters
- nibbleint >= 0, optional
Number of bases to remove from each side of read (Default: 0)
- Attributes
nibble
Number of positions to trim from each side of read alignment before assigning genomic positions.
Methods
__call__
Returns reads covering a region, and a count vector mapping reads to specific positions in the region.
- nibble¶
Number of positions to trim from each side of read alignment before assigning genomic positions.
- class plastid.genomics.map_factories.FivePrimeMapFactory(offset=0)¶
Bases:
object
Fiveprime mapping factory for
BAMGenomeArray.set_mapping()
. Reads are mapped at offset nucleotides from the fiveprime end of their alignment.- Parameters
- offsetint >= 0, optional
Offset from 5’ end of read, in direction of 3’ end, at which reads should be counted (Default: 0)
- Attributes
offset
Distance from 5’ end of read at which to assign reads
Methods
__call__
Returns reads covering a region, and a count vector mapping reads to specific positions in the region, mapping reads at self.offset from the fiveprime end of each read.
- offset¶
Distance from 5’ end of read at which to assign reads
- class plastid.genomics.map_factories.SizeFilterFactory(min=1, max=- 1)¶
Bases:
object
Create a read-length filter can be applied at runtime to a
BAMGenomeArray
using :BAMGenomeArray.add_filter()
.- Parameters
- minfloat, optional
Minimum read length to pass filter, inclusive (Default: 1)
- maxfloat or numpy.inf, optional
Maximum read length to pass filter, inclusive. If set to -1, then there is no maximum length filter. (Default: -1, no filter)
Methods
__call__
(*args, **kwargs)Call self as a function.
- class plastid.genomics.map_factories.StratifiedVariableFivePrimeMapFactory(offset_dict, min=25, max=35)¶
Bases:
plastid.genomics.map_factories.VariableFivePrimeMapFactory
Fiveprime-variable mapping for
BAMGenomeArray.set_mapping()
, stratified by read length.Reads are mapped into a 2D array of counts at each position in a region of interest (column), stratified by read length (row). Mapping is performed by applying a user-specified offset from the fiveprime end of each alignment. A unique offset may be supplied for each read length.
Reads outside of pre-specified minimum and maximum lengths are ignored.
- Parameters
- offset_dictdict or None
Dictionary mapping read lengths to offsets that should be applied to reads of that length during mapping. A special key, ‘default’ may be supplied to provide a default value for lengths not specifically enumerated in offset_dict. If None, all reads are mapped to their 5’ ends.
- min_lengthint
Minimum length read to map
- max_lengthint
Maximum length read to map, inclusive
- Attributes
Methods
__call__
Map reads covering seg into a 2D count array, in which reads at each nucleotide position (column) are stratified by read length (row), and, optionally, offset from their 5' ends.
from_file
(fn_or_fh)Create a
VariableFivePrimeMapFactory
from a text file.- from_file(fn_or_fh)¶
Create a
VariableFivePrimeMapFactory
from a text file.The text file should be formatted as by the
psite
script: a tab-delimited two-column text file in which the first column is a read length (or the special word ‘default’), and the second column an offset from the 5’ end of the read corresponding to that read length. For example:25 12 26 12 27 13 28 13 29 14 30 14 default 14
- Parameters
- fn_or_fhstr or open file-like
If str, it should be a path to the text file. If file-like, it should be an open file-like object pointing to the data that would be in the text file.
- row_keys¶
numpy array of read lengths corresponding to each row of mapped data.
- shape¶
list containing size of each axis of output, excluding the final axis, which is specified the length of incoming GenomicSegments.
- class plastid.genomics.map_factories.ThreePrimeMapFactory(offset=0)¶
Bases:
object
Threeprime mapping factory for
BAMGenomeArray.set_mapping()
. Reads are mapped offset nucleotides from the threeprime end of their alignments.- Parameters
- offsetint >= 0, optional
Offset from 3’ end of read, in direction of 5’ end, at which reads should be counted (Default: 0)
- Attributes
offset
Distance from 3’ end of read at which to assign reads
Methods
__call__
Returns reads covering a region, and a count vector mapping reads to specific positions in the region, mapping reads at self.offset from the threeprime end of each read.
- offset¶
Distance from 3’ end of read at which to assign reads
- class plastid.genomics.map_factories.VariableFivePrimeMapFactory(offset_dict)¶
Bases:
object
Fiveprime-variable mapping for
BAMGenomeArray.set_mapping()
. Reads are mapped at a user-specified offsets from the fiveprime end of each alignment. The offset for a read of a given length is supplied in offset_dict[readlen]- Parameters
- offset_dictdict
Dictionary mapping read lengths to offsets that should be applied to reads of that length during mapping. A special key, ‘default’ may be supplied to provide a default value for lengths not specifically enumerated in offset_dict
Methods
__call__
Returns reads covering a region, and a count vector mapping reads to specific positions in the region, mapping reads at possibly varying offsets from the 5' end of each read as determined by self.offset_dict
from_file
(fn_or_fh)Create a
VariableFivePrimeMapFactory
from a text file.- from_file(fn_or_fh)¶
Create a
VariableFivePrimeMapFactory
from a text file.The text file should be formatted as by the
psite
script: a tab-delimited two-column text file in which the first column is a read length (or the special word ‘default’), and the second column an offset from the 5’ end of the read corresponding to that read length. For example:25 12 26 12 27 13 28 13 29 14 30 14 default 14
- Parameters
- fn_or_fhstr or open file-like
If str, it should be a path to the text file. If file-like, it should be an open file-like object pointing to the data that would be in the text file.