plastid.genomics.genome_array module¶

GenomeArrays map quantitative data or read alignments to genomic positions.

Summary
Module contents
Extended summary & examples

Summary ¶

GenomeArrays are typically accessed using SegmentChains.

The get_counts() method of SegmentChain and Transcript fetches arrays of data from GenomeArrays. The result is a numpy array in which each position corresponds to a position in the SegmentChain, moving from 5’ to 3’, in the spliced SegmentChain or Transcript. For a Transcript called my_transcript:

>>> # open a BAM file, map reads 15 nucleotides from 3' end
>>> genome_array = BAMGenomeArray("some_file.bam")
>>> genome_array.set_mapping(ThreePrimeMapFactory(offset=15))

>>> my_data = my_transcript.get_counts(genome_array)
>>> my_data
    array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
            76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
           402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
           375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
           117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
           423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
            86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
           274, 205, 185, 209, 275, 190, 434,  10, 241])

Alternatively, SegmentChains and GenomicSegments can be used directly as dictionary keys:

>>> my_data = genome_array[my_transcript]
>>> my_data
    array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
            76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
           402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
           375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
           117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
           423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
            86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
           274, 205, 185, 209, 275, 190, 434,  10, 241])

Module contents ¶

Several implementations are provided. These are tailored to data stored in specific formats, or for specific purposes. For example, mutable GenomeArrays allow values to be changed in-place, e.g. via mathematical operations or manual setting.

Implementation	Format of data	Mutable	Use of mapping functions
`BAMGenomeArray`	BAM file(s)	No	Inline; may be changed without reloading data
`BigWigGenomeArray`	BigWig file(s)	No	n/a
`GenomeArray`	bowtie, wiggle, bedGraph, or data in memory	Yes	On import from bowtie files only. Data must be reimported to change mapping
`SparseGenomeArray`	bowtie, wiggle, bedGraph, or data in memory	Yes	On import from bowtie files only. Data must be reimported to change mapping

Extended summary & examples ¶

Mapping functions ¶

GenomeArrays use Mapping functions to extract the biology of interest from read alignments, and to vectorize these data over genomic positions.

Depending on the data type and the purpose of the experiment, different functions may be appropriate. The following mapping rules are provided, although users are encouraged to provide their own if needed. These include:

Mapping strategy	Description	Sample use
Fiveprime	Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end.	RNA-seq, ribosome profiling, DMS-seq
Variable fiveprime	Each read alignment is mapped at a fixed distance from its 5’ end, where the distance is determined by the length of the alignment	Ribosome profiling
Stratified variable fiveprime	Like variable fiveprime mapping, except reads are stratified by length into separate rows of an array	Ribosome profiling
Threeprime	Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end.	RNA-seq, ribosome profiling
Center/entire	Zero or more positions are trimmed from each end of the read alignment, and the remaining N positions are incremented by 1/N	Ribosome profiling with micrococcal nuclease, RNA-seq

Mapping functions are rarely called directly. Instead, they are invoked via BAMGenomeArray.set_mapping():

# open BAM file
>>> ga = BAMGenomeArray("some_file.bam")

# set mapping 15 bases from 3' end of read
>>> ga.set_mapping(ThreePrimeMapFactory(offset=15))

# set mapping to 5' end of read
>>> ga.set_mapping(FivePrimeMapFactory())

# cut 3 nt off each end of the read, and map fractionally over the rest
>>> ga.set_mapping(CenterMapFactory(nibble=12))

# access reads in some region
>>> ga[GenomicSegment("chrI",10000,100000,"+")]
# output is numpy array

For further details see map_factories and Read mapping functions

Export to browser tracks ¶

GenomeArrays can export data to wiggle or bedGraph formats. To export forward-strand data to some_file.wig:

>>> genome_array.to_bedgraph(open("some_file.wig","w"),"my_track","+")

Normalization ¶

GenomeArrays also offer utilities for summing and normalization:

>>> genome_array.sum() # get sum
71292045

# set sum to an arbitrary value- can be useful for normalizing between
# multiple alignment files
>>> genome_array.set_sum(61241234324)
>>> genome_array.sum()
61241234324

# reset sum to data in file
>>> genome_array.reset_sum()
>>> genome_array.sum()
71292045

# with set_normalize(), all arrays will be normalized to counts per million
# relative to the current sum
>>> genome_array.set_normalize(True)
>>> my_transcript.get_counts(genome_array)
array([ 1.33254699,  1.86556579,  1.22033251,  6.88716392,  1.40268104,
        1.02395716,  4.32025761,  3.04381786,  6.26998426,  2.86146933,
        3.43656855,  3.70307795,  5.54059012,  1.06603759,  5.42837563,
        4.47455253,  0.60315285,  2.39858458,  1.41670785,  2.13207518,
        0.813555  ,  2.34247734,  0.74342095,  5.56864374,  3.19811278,
        0.61717966,  5.63877779,  3.99764097,  6.732869  ,  0.77147457,
        0.12624129,  0.71536733,  5.02159813,  4.3062308 ,  2.0198607 ,
        2.5528795 ,  4.10985545,  6.38219874,  5.6528046 ,  5.26005391,
        6.21387702,  6.19985021,  2.74925484,  1.86556579,  3.33838088,
        3.75918519,  0.98187673,  1.27643975,  2.41261139,  6.87313711,
        4.85327641,  4.12388226,  1.64113682,  5.94736762,  2.52482588,
        0.19637535,  5.61072417,  5.80709952,  3.60489028,  1.78140492,
        4.82522279,  4.3062308 ,  6.10166253,  2.13207518,  6.42427917,
        5.93334081,  0.84160863,  2.4266382 ,  5.07770537,  3.88542649,
        5.5125365 ,  4.1659627 ,  5.28810753,  5.00757132,  0.95382311,
        2.00583389,  6.55052047,  0.09818767,  1.2063057 ,  6.31206469,
        3.2261664 ,  0.86966225,  3.8994533 ,  0.91174268,  4.85327641,
        1.44476147,  2.59495993,  0.09818767,  4.90938365,  2.27234329,
        6.46635961,  3.84334606,  2.87549614,  2.59495993,  2.93160338,
        3.85737287,  2.66509398,  6.08763572,  0.1402681 ,  3.38046131])

# similarly, with set_normalize() on, browser tracks we'll be normalized 
>>> genome_array.to_bedgraph(open("some_file_normalized.wig","w"),"my_normalized_track","+")

class plastid.genomics.genome_array.BAMGenomeArray(*bamfiles, mapping=CenterMapFactory())[source]¶

Bases: plastid.genomics.genome_array.AbstractGenomeArray

A GenomeArray for read alignments in BAM files.

When a user requests read alignments or counts at a set of genomic positions, BAMGenomeArray uses a mapping rule to determine which reads from the BAM file should be returned and how these reads should be mapped to positions in a vector. Reads are not mapped until requested, and mapping rules are changeable at runtime via BAMGenomeArray.set_mapping()

Parameters

bamfileOne or more filenames or open pysam.AlignmentFile: Filename(s) of BAM files, or open pysam.AlignmentFile to include in array. Note: all BAM files must be sorted and indexed by samtools.
mappingfunc: mapping function that determines how each read alignment is mapped to a count at a genomic position. Examples include mapping reads to their fiveprime ends, mapping reads to their threeprime ends, or mapping somewhere in the middle. Factories to produce such functions are provided. See references below. (Default: CenterMapFactory())

Notes

The alignment data in BAMGenomeArray objects is immutable. If you need to change the data in-place, the BAMGenomeArray can be converted to a GenomeArray or SparseGenomeArray via to_genome_array()

Attributes

map_fnfunction: mapping function used to convert read alignments to counts
bamfileslist: BAM files or read alignments, as a list of pysam.AlignmentFile

Methods

`add_filter`(name, func)	Apply a function to filter reads retrieved from regions before mapping and counting
`chroms`()	Returns a list of chromosomes
`get`(roi[, roi_order])	Retrieve array of counts from a region of interest, following the mapping rule set by `set_mapping()`.
`get_mapping`()	Return the docstring of the current mapping function
`get_reads`(roi)	Returns reads covering a `GenomicSegment`.
`get_reads_and_counts`(roi[, roi_order])	Return read alignments covering a `GenomicSegment`, and a count vector mapping reads to each positions in the `GenomicSegment`, following the rule specified by `set_mapping()`.
`lengths`()	Returns a dictionary mapping chromosome names to lengths.
`remove_filter`(name)	Remove a generic filter
`reset_sum`()	Reset the sum to the total number of mapped reads in the `BAMGenomeArray`
`set_mapping`(mapping_function)	Change the mapping rule
`set_normalize`([value])	Toggle normalization of reported values to reads per million mapped in the dataset
`set_sum`(val)	Set sum used for normalization to an arbitrary value (e.g.
`strands`()	Return a tuple of strands in the GenomeArray
`sum`()	Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
`to_bedgraph`(fh, trackname, strand[, ...])	Write the contents of the `BAMGenomeArray` to a bedGraph file under the mapping rule set by `set_mapping()`.
`to_genome_array`([array_type])	Converts `BAMGenomeArray` to a `GenomeArray` or `SparseGenomeArray` under the mapping rule set by `set_mapping()`
`to_variable_step`(fh, trackname, strand[, ...])	Write the contents of the `BAMGenomeArray` to a variableStep Wiggle file under the mapping rule set by `set_mapping()`.

add_filter(name, func)[source]¶

Apply a function to filter reads retrieved from regions before mapping and counting

Parameters

namestr: A name for the filter. If not unique, will overwrite previous filter
funcfunc: Filter function. Function must take a pysam.AlignedSegment as a single parameter, and return True if that read should be included in output, or False otherwise

See also

plastid.genomics.map_factories.SizeFilterFactory: generate filter functions that gate read alignments on size

Notes

In Python, lambda functions do NOT have their own scope! We strongly recomend defining filter functions using the def syntax to avoid namespace collisions.

chroms()[source]¶

Returns a list of chromosomes

Returns

list: sorted chromosome names as strings

get(roi, roi_order=True)[source]¶

Retrieve array of counts from a region of interest, following the mapping rule set by set_mapping(). Values in the vector are ordered 5’ to 3’ relative to roi rather than the genome (i.e. are reversed for reverse-strand features).

Parameters

roiGenomicSegment or SegmentChain: Region of interest in genome
roi_orderbool, optional: If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns

numpy.ndarray: vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

Raises

ValueError: if bamfiles not sorted or not indexed

See also

plastid.genomics.roitools.SegmentChain.get_counts: Fetch a spliced vector of data covering a SegmentChain

get_mapping()[source]¶: Return the docstring of the current mapping function

get_reads(roi)[source]¶

Returns reads covering a GenomicSegment. Reads are strand-matched to roi by default, and are included or excluded depending upon the rule specified by set_mapping(). To obtain unstranded reads, set the value of roi.strand to ‘.’

Parameters

roiGenomicSegment: Region of interest

Returns

list: List of reads (as pysam.AlignedSegment) covering region of interest, according to the rule set in set_mapping()

Raises

ValueError: if bamfiles are not sorted or not indexed

get_reads_and_counts(roi, roi_order=True)[source]¶

Return read alignments covering a GenomicSegment, and a count vector mapping reads to each positions in the GenomicSegment, following the rule specified by set_mapping(). Reads are strand-matched to roi by default. To obtain unstranded reads, set the value of roi.strand to ‘.’

Parameters

roiGenomicSegment: Region of interest

Returns

list: List of reads (as pysam.AlignedSegment) covering region of interest
numpy.ndarray: Counts at each position of roi, under whatever mapping rule was set by set_mapping()

Raises

ValueError: if bamfiles not sorted or not indexed

lengths()[source]¶

Returns a dictionary mapping chromosome names to lengths.

Returns

dict: mapping chromosome names to lengths

remove_filter(name)[source]¶

Remove a generic filter

Parameters

namestr: A name for the filter

Returns

func: the removed filter function

reset_sum()[source]¶

Reset the sum to the total number of mapped reads in the BAMGenomeArray

Notes

Filters are not applied in this summation. See BAMGenomeArray.set_sum() to manually set a sum, which e.g. could correspond to a total number of filtered reads.

set_mapping(mapping_function)[source]¶

Change the mapping rule

Parameters

mappingfunc: Function that determines how each read alignment is mapped to a count at a position in the BAMGenomeArray. Examples include mapping reads to their fiveprime or threeprime ends, with or without offsets. Factories to produce such functions are provided. See references below.

See also

plastid.genomics.map_factories.FivePrimeMapFactory: map reads to 5’ ends, with or without applying an offset
plastid.genomics.map_factories.VariableFivePrimeMapFactory: map reads to 5’ ends, choosing an offset determined by read length
plastid.genomics.map_factories.ThreePrimeMapFactory: map reads to 3’ ends, with or without applying an offset
plastid.genomics.map_factories.CenterMapFactory: map each read fractionally to every position in the read, optionally trimming positions from the ends first

set_normalize(value=True)¶

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters

valuebool: If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)¶

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters

valint or float: a number

strands()¶

Return a tuple of strands in the GenomeArray

Returns

tuple: Chromosome strands as strings

sum()¶

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns

int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]¶

Write the contents of the BAMGenomeArray to a bedGraph file under the mapping rule set by set_mapping().

See the bedGraph spec for format details.

Parameters

fhfile-like: Filehandle to write to
tracknamestr: Name of browser track
strandstr: Strand of BAMGenomeArray to export. ‘+’, ‘-’, or ‘.’
window_sizeint: Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient
printerfile-like, optional: Something implementing a write() method for output
**kwargs: Any other key-value pairs to include in track definition line

to_genome_array(array_type=None)[source]¶

Converts BAMGenomeArray to a GenomeArray or SparseGenomeArray under the mapping rule set by set_mapping()

Parameters

array_typeclass: Type of GenomeArray to return. GenomeArray or SparseGenomeArray. (Default: GenomeArray)

Returns

GenomeArray or SparseGenomeArray

to_variable_step(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]¶

Write the contents of the BAMGenomeArray to a variableStep Wiggle file under the mapping rule set by set_mapping().

See the Wiggle spec for format details.

Parameters

fhfile-like: Filehandle to write to
tracknamestr: Name of browser track
strandstr: Strand of BAMGenomeArray to export. ‘+’, ‘-’, or ‘.’
window_sizeint: Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient
printerfile-like, optional: Something implementing a write() method for output
**kwargs: Any other key-value pairs to include in track definition line

class plastid.genomics.genome_array.BigWigGenomeArray(maxmem=0)[source]¶

Bases: plastid.genomics.genome_array.AbstractGenomeArray

High-performance GenomeArray for BigWig files.

Parameters

maxmemfloat: Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. (Default: 0, No maximum)

Methods

`add_from_bigwig`(filename, strand)	Import additional data from a BigWig file
`chroms`()	Return a list of chromosomes in the GenomeArray
`get`(roi[, roi_order])	Retrieve array of counts from a region of interest.
`lengths`()	Return a dictionary mapping chromosome names to lengths.
`reset_sum`()	Reset sum to total of data in the `BigWigGenomeArray`
`set_normalize`([value])	Toggle normalization of reported values to reads per million mapped in the dataset
`set_sum`(val)	Set sum used for normalization to an arbitrary value (e.g.
`strands`()	Return a tuple of strands in the GenomeArray
`sum`()	Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
`to_genome_array`()	Converts `BigWigGenomeArray` to a `GenomeArray`

add_from_bigwig(filename, strand)[source]¶

Import additional data from a BigWig file

Parameters

filenamestr: Path to BigWig file
strandstr: Strand to which data should be added. ‘+’, ‘-’, or ‘.’

chroms()[source]¶

Return a list of chromosomes in the GenomeArray

Returns

list: Chromosome names as strings

get(roi, roi_order=True)[source]¶

Retrieve array of counts from a region of interest.

Parameters

roiGenomicSegment or SegmentChain: Region of interest in genome
roi_orderbool, optional: If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns

numpy.ndarray: vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

plastid.genomics.roitools.SegmentChain.get_counts: Fetch a spliced vector of data covering a SegmentChain

lengths()[source]¶

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns

dict: Dictionary mapping chromosome names to chromosome lengths

reset_sum()[source]¶: Reset sum to total of data in the BigWigGenomeArray

set_normalize(value=True)¶

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters

valuebool: If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)¶

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters

valint or float: a number

strands()[source]¶

Return a tuple of strands in the GenomeArray

Returns

tuple: Chromosome strands as strings

sum()¶

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns

int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_genome_array()[source]¶

Converts BigWigGenomeArray to a GenomeArray

Returns

GenomeArray

class plastid.genomics.genome_array.GenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]¶

Bases: plastid.genomics.genome_array.MutableAbstractGenomeArray

Array-like data structure that maps numerical values (e.g. read counts or conservation data, et c) to nucleotide positions in a genome.

Data in the array may be manipulated manually and/or imported from Wiggle or BedGraph files, as well as bowtie alignments.

GenomeArray supports basic mathematical operations elementwise, where elements are nucleotide positions.

Parameters

chr_lengthsdict or None, optional: Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed.
min_chr_sizeint: If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If the genome has large chromosomes, it is much much better, speed-wise, to provide chr_lengths than to provide a guess here that is too small.
strandssequence: Sequence of strand names for the GenomeArray. (Default: (‘+’,’-‘))

Methods

`add_from_bowtie`(fh, mapfunc[, min_length, ...])	Import alignment data in native bowtie format to the current GenomeArray
`add_from_wiggle`(fh, strand)	Import data from a Wiggle or bedGraph file to current GenomeArray
`apply_operation`(other, func[, mode])	Applies a binary operator to a copy of self and to other elementwise.
`chroms`()	Return a list of chromosomes in the GenomeArray
`get`(roi[, roi_order])	Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.
`iterchroms`()	Return an iterator of each chromosome strand array
`keys`()	Return names of chromosomes in the GenomeArray
`lengths`()	Return a dictionary mapping chromosome names to lengths.
`like`(other)	Return a `GenomeArray` of same dimension as the input array
`nonzero`()	Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
`plot`([chroms, strands])	Create a plot of coverage along tracks or chromosomes
`reset_sum`()	Reset the sum of the `GenomeArray` to the sum of all positions in the array
`set_normalize`([value])	Toggle normalization of reported values to reads per million mapped in the dataset
`set_sum`(val)	Set sum used for normalization to an arbitrary value (e.g.
`strands`()	Return a tuple of strands in the GenomeArray
`sum`()	Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
`to_bedgraph`(fh, trackname, strand[, printer])	Write the contents of the GenomeArray to a bedGraph file
`to_variable_step`(fh, trackname, strand[, ...])	Export the contents of the GenomeArray to a variable step Wiggle file.

add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)[source]¶

Import alignment data in native bowtie format to the current GenomeArray

Parameters

fhfile-like: filehandle pointing to bowtie file
mapfuncfunc: a mapping function that transforms alignment coordinates into a sequence of (GenomicSegment,value) pairs
min_lengthint, optional: minimum length for read to be counted (Default: 25)
max_lengthint or numpy.inf, optional: maximum length for read to be counted (Default: infinity)
trans_argsdict, optional: Keyword arguments to pass to transformation function

See also

five_prime_map: map reads to 5’ ends, with or without applying an offset
variable_five_prime_map: map reads to 5’ ends, choosing an offset determined by read length
three_prime_map: map reads to 3’ ends, with or without applying an offset
center_map: map each read fractionally to every position in the read, optionally trimming positions from the ends first

add_from_wiggle(fh, strand)[source]¶

Import data from a Wiggle or bedGraph file to current GenomeArray

Parameters

fhfile-like: filehandle pointing to wiggle file
strandstr: Strand to which data should be added. ‘+’, ‘-’, or ‘.’

apply_operation(other, func, mode='same')[source]¶

Applies a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another GenomeArray. In both cases, a new GenomeArray is returned, and self is left unmodified. If set_normalize() is set to True, it is disabled during the operation.

Parameters

otherfloat, int, or MutableAbstractGenomeArray

Second argument to func

funcfunc

Function to perform. This must take two arguments. a numpy.ndarray (chromosome-strand) from the GenomeArray will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.

If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.

modestr, choice of ‘same’, ‘all’, or ‘truncate’

Only relevant if other is a MutableAbstractGenomeArray

If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent GenomeArray s. In addition, both GenomeArray s must have the same chromosomes, and the same strands.

If ‘all’, corresponding chromosomes in the parent GenomeArray s will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.

If ‘truncate’,corresponding chromosomes in the parent GenomeArray s will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.

Returns

GenomeArray: new GenomeArray after the operation is applied

chroms()¶

Return a list of chromosomes in the GenomeArray

Returns

list: Chromosome names as strings

get(roi, roi_order=True)[source]¶

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).

Parameters

roiGenomicSegment or SegmentChain: Region of interest in genome
roi_orderbool, optional: If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns

numpy.ndarray: vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

plastid.genomics.roitools.SegmentChain.get_counts: Fetch a spliced vector of data covering a SegmentChain

iterchroms()[source]¶

Return an iterator of each chromosome strand array

Yields

numpy.ndarray: Positionwise counts on a chromosome strand

keys()[source]¶: Return names of chromosomes in the GenomeArray

lengths()¶

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns

dict: Dictionary mapping chromosome names to chromosome lengths

static like(other)[source]¶

Return a GenomeArray of same dimension as the input array

Parameters

otherGenomeArray

Returns

GenomeArray: empty GenomeArray of same size as other

nonzero()[source]¶

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.

Returns

dict: dict[chrom][strand] = numpy.ndarray of indices

plot(chroms=None, strands=None, **plot_kw)[source]¶

Create a plot of coverage along tracks or chromosomes

Parameters

chromslist: Chromosomes to plot (default: all)
strandslist: Strands to plot (default: all)
plot_kw: A dictionary of keywords to pass to matplotlib

Returns

matplotlib.figure.Figure

reset_sum()[source]¶: Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize(value=True)¶

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters

valuebool: If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)¶

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters

valint or float: a number

strands()¶

Return a tuple of strands in the GenomeArray

Returns

tuple: Chromosome strands as strings

sum()¶

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns

int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, printer=None, **kwargs)[source]¶

Write the contents of the GenomeArray to a bedGraph file

See the bedGraph spec for format details

Parameters

fhfile-like: Open filehandle to which data will be written
tracknamestr: Name of browser track
strandstr: Strand to export. ‘+’, ‘-’, or ‘.’
printerfile-like, optional: Something implementing a write() method for output
**kwargs: Any other key-value pairs to include in track definition line

to_variable_step(fh, trackname, strand, printer=None, **kwargs)[source]¶

Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.

See the Wiggle spec for format details

Parameters

fhfile-like: Open filehandle to which data will be written
tracknamestr: Name of browser track
strandstr: Strand to export. ‘+’, ‘-’, or ‘.’
printerfile-like, optional: Something implementing a write() method for output
**kwargs: Any other key-value pairs to include in track definition line

class plastid.genomics.genome_array.MutableAbstractGenomeArray(chr_lengths=None, strands=None)[source]¶

Bases: plastid.genomics.genome_array.AbstractGenomeArray

Abstract base class for GenomeArray-like objects whose values can be changed

Methods

`chroms`()	Return a list of chromosomes in the GenomeArray
`lengths`()	Return a dictionary mapping chromosome names to lengths.
`reset_sum`()	Reset sum to total mapped reads in the GenomeArray
`set_normalize`([value])	Toggle normalization of reported values to reads per million mapped in the dataset
`set_sum`(val)	Set sum used for normalization to an arbitrary value (e.g.
`strands`()	Return a tuple of strands in the GenomeArray
`sum`()	Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

chroms()¶

Return a list of chromosomes in the GenomeArray

Returns

list: Chromosome names as strings

lengths()¶

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns

dict: Dictionary mapping chromosome names to chromosome lengths

abstract reset_sum()¶: Reset sum to total mapped reads in the GenomeArray

set_normalize(value=True)¶

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters

valuebool: If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)¶

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters

valint or float: a number

strands()¶

Return a tuple of strands in the GenomeArray

Returns

tuple: Chromosome strands as strings

sum()¶

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns

int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

class plastid.genomics.genome_array.SparseGenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]¶

Bases: plastid.genomics.genome_array.GenomeArray

A memory-efficient sublcass of GenomeArray using sparse internal representation. Note, savings in memory may come at a cost in performance when repeatedly getting/setting values, compared to a GenomeArray.

Parameters

chr_lengthsdict, optional: Dictionary mapping chromosome names to lengths. Suppyling this parameter yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed (Default: {} )
min_chr_sizeint,optional: If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If your genome has large chromosomes, it is much much better, speed-wise, to provide a chr_lengths dict than to provide a guess here that is too small. (Default: %s)
strandssequence: Sequence of strand names for the GenomeArray. (Default: (‘+’,’-‘))

Methods

`add_from_bowtie`(fh, mapfunc[, min_length, ...])	Import alignment data in native bowtie format to the current GenomeArray
`add_from_wiggle`(fh, strand)	Import data from a Wiggle or bedGraph file to current GenomeArray
`apply_operation`(other, func[, mode])	Apply a binary operator to a copy of self and to other elementwise.
`chroms`()	Return a list of chromosomes in the GenomeArray
`get`(roi[, roi_order])	Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.
`iterchroms`()	Return an iterator of each chromosome strand array
`keys`()	Return names of chromosomes in the GenomeArray
`lengths`()	Return a dictionary mapping chromosome names to lengths.
`like`(other)	Return a `SparseGenomeArray` of same dimension as the input array
`nonzero`()	Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
`plot`([chroms, strands])	Create a plot of coverage along tracks or chromosomes
`reset_sum`()	Reset the sum of the `GenomeArray` to the sum of all positions in the array
`set_normalize`([value])	Toggle normalization of reported values to reads per million mapped in the dataset
`set_sum`(val)	Set sum used for normalization to an arbitrary value (e.g.
`strands`()	Return a tuple of strands in the GenomeArray
`sum`()	Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
`to_bedgraph`(fh, trackname, strand[, printer])	Write the contents of the GenomeArray to a bedGraph file
`to_variable_step`(fh, trackname, strand[, ...])	Export the contents of the GenomeArray to a variable step Wiggle file.

add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)¶

Import alignment data in native bowtie format to the current GenomeArray

Parameters

fhfile-like: filehandle pointing to bowtie file
mapfuncfunc: a mapping function that transforms alignment coordinates into a sequence of (GenomicSegment,value) pairs
min_lengthint, optional: minimum length for read to be counted (Default: 25)
max_lengthint or numpy.inf, optional: maximum length for read to be counted (Default: infinity)
trans_argsdict, optional: Keyword arguments to pass to transformation function

See also

five_prime_map: map reads to 5’ ends, with or without applying an offset
variable_five_prime_map: map reads to 5’ ends, choosing an offset determined by read length
three_prime_map: map reads to 3’ ends, with or without applying an offset
center_map: map each read fractionally to every position in the read, optionally trimming positions from the ends first

add_from_wiggle(fh, strand)¶

Import data from a Wiggle or bedGraph file to current GenomeArray

Parameters

fhfile-like: filehandle pointing to wiggle file
strandstr: Strand to which data should be added. ‘+’, ‘-’, or ‘.’

apply_operation(other, func, mode=None)[source]¶

Apply a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another GenomeArray. In both cases, a new SparseGenomeArray is returned, and self is left unmodified. If set_normalize() is set to True, it is disabled during the operation.

Parameters

otherfloat, int, or MutableAbstractGenomeArray

Second argument to func

funcfunc

Function to perform. This must take two arguments. a numpy.ndarray (chromosome-strand) from the SparseGenomeArray will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.

If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.

modestr, choice of ‘same’, ‘all’, or ‘truncate’

Only relevant if other is a MutableAbstractGenomeArray

If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent SparseGenomeArrays. In addition, both SparseGenomeArrays must have the same chromosomes, and the same strands.

If ‘all’, corresponding chromosomes in the parent SparseGenomeArrays will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.

If ‘truncate’,corresponding chromosomes in the parent SparseGenomeArrays will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.

Returns

SparseGenomeArray: new SparseGenomeArray after the operation is applied

chroms()¶

Return a list of chromosomes in the GenomeArray

Returns

list: Chromosome names as strings

get(roi, roi_order=True)[source]¶

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).

Parameters

roiGenomicSegment or SegmentChain: Region of interest in genome
roi_orderbool, optional: If True (default) return vector of values 5’ to 3’ relative to roi rather than genome.

Returns

numpy.ndarray: vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

SegmentChain.get_counts: Fetch a spliced vector of data covering a SegmentChain

iterchroms()¶

Return an iterator of each chromosome strand array

Yields

numpy.ndarray: Positionwise counts on a chromosome strand

keys()¶: Return names of chromosomes in the GenomeArray

lengths()[source]¶

Return a dictionary mapping chromosome names to lengths. In the case where two strands report different lengths for a chromosome, the max length is taken.

Returns

dict: mapping chromosome names to lengths

static like(other)[source]¶

Return a SparseGenomeArray of same dimension as the input array

Parameters

otherGenomeArray or SparseGenomeArray

Returns

SparseGenomeArray: of same size as other

nonzero()[source]¶

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.

Returns

dict: dict[chrom][strand] = numpy.ndarray of indices

plot(chroms=None, strands=None, **plot_kw)¶

Create a plot of coverage along tracks or chromosomes

Parameters

chromslist: Chromosomes to plot (default: all)
strandslist: Strands to plot (default: all)
plot_kw: A dictionary of keywords to pass to matplotlib

Returns

matplotlib.figure.Figure

reset_sum()¶: Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize(value=True)¶

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters

valuebool: If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)¶

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters

valint or float: a number

strands()¶

Return a tuple of strands in the GenomeArray

Returns

tuple: Chromosome strands as strings

sum()¶

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns

int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, printer=None, **kwargs)¶

Write the contents of the GenomeArray to a bedGraph file

See the bedGraph spec for format details

Parameters

fhfile-like: Open filehandle to which data will be written
tracknamestr: Name of browser track
strandstr: Strand to export. ‘+’, ‘-’, or ‘.’
printerfile-like, optional: Something implementing a write() method for output
**kwargs: Any other key-value pairs to include in track definition line

to_variable_step(fh, trackname, strand, printer=None, **kwargs)¶

Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.

See the Wiggle spec for format details

Parameters

fhfile-like: Open filehandle to which data will be written
tracknamestr: Name of browser track
strandstr: Strand to export. ‘+’, ‘-’, or ‘.’
printerfile-like, optional: Something implementing a write() method for output
**kwargs: Any other key-value pairs to include in track definition line

plastid.genomics.genome_array.center_map(feature, **kwargs)[source]¶

Center-mapping function used as an argument to GenomeArray.add_from_bowtie(). A user-specified number of bases is optionally removed from each side of each read alignment, and the N remaining bases are each apportioned 1/N of the read count, so that the entire read is counted once.

Parameters

featureSegmentChain: Ungapped genomic alignment
kwargs[‘value’]float, optional: Total value to divide over aligning positions (Default: 1.0)
kwargs[‘nibble’]int, optional: Positions to remove from each end before mapping (Default: 0)
kwargs[‘offset’]int, optional: Mapping offset, if any, from 5’ end of read (Default: 0)

Returns

list: tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.five_prime_map(feature, **kwargs)[source]¶

Fiveprime mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment.

Parameters

featureSegmentChain: Ungapped genomic alignment
kwargs[‘value’]float or int, optional: Value to apportion (Default: 1)
kwargs[‘offset’]int, optional: Mapping offset, if any, from 5’ toward 3’ end of read

Returns

list: tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.three_prime_map(feature, **kwargs)[source]¶

Threeprime mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the threeprime end of the alignment.

Parameters

featureSegmentChain: Ungapped genomic alignment
kwargs[‘value’]float or int, optional: Value to apportion (Default: 1)
kwargs[‘offset’]int, optional: Mapping offset, if any, from 3’ toward 5’ end of read.

Returns

list: tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.variable_five_prime_map(feature, **kwargs)[source]¶

Fiveprime variable mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment. The offset for a read of a given length is supplied in kwargs[offset][readlen]

Parameters

featureSegmentChain: Ungapped genomic alignment
kwargs[‘offset’]dict, required: Dictionary mapping read lengths to offsets
kwargs[‘value’]float or int, optional: Value to apportion (Default: 1)

Returns

list: tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array module¶

Summary¶

Module contents¶

Extended summary & examples¶

Mapping functions¶

Export to browser tracks¶

Normalization¶

Summary ¶

Module contents ¶

Extended summary & examples ¶

Mapping functions ¶

Export to browser tracks ¶

Normalization ¶