plastid.genomics.genome_array module

GenomeArrays map quantitative data or read alignments to genomic positions.

Summary

GenomeArrays are typically accessed using SegmentChains.

The get_counts() method of SegmentChain and Transcript fetches arrays of data from GenomeArrays. The result is a numpy array in which each position corresponds to a position in the SegmentChain, moving from 5’ to 3’, in the spliced SegmentChain or Transcript. For a Transcript called my_transcript:

>>> # open a BAM file, map reads 15 nucleotides from 3' end
>>> genome_array = BAMGenomeArray("some_file.bam")
>>> genome_array.set_mapping(ThreePrimeMapFactory(offset=15))

>>> my_data = my_transcript.get_counts(genome_array)
>>> my_data
    array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
            76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
           402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
           375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
           117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
           423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
            86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
           274, 205, 185, 209, 275, 190, 434,  10, 241])

Alternatively, SegmentChains and GenomicSegments can be used directly as dictionary keys:

>>> my_data = genome_array[my_transcript]
>>> my_data
    array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
            76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
           402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
           375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
           117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
           423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
            86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
           274, 205, 185, 209, 275, 190, 434,  10, 241])

Module contents

Several implementations are provided. These are tailored to data stored in specific formats, or for specific purposes. For example, mutable GenomeArrays allow values to be changed in-place, e.g. via mathematical operations or manual setting.

Implementation Format of data Mutable Use of mapping functions
BAMGenomeArray BAM file(s) No Inline; may be changed without reloading data
BigWigGenomeArray BigWig file(s) No n/a
GenomeArray bowtie, wiggle, bedGraph, or data in memory Yes On import from bowtie files only. Data must be reimported to change mapping
SparseGenomeArray bowtie, wiggle, bedGraph, or data in memory Yes On import from bowtie files only. Data must be reimported to change mapping

Extended summary & examples

Mapping functions

GenomeArrays use Mapping functions to extract the biology of interest from read alignments, and to vectorize these data over genomic positions.

Depending on the data type and the purpose of the experiment, different functions may be appropriate. The following mapping rules are provided, although users are encouraged to provide their own if needed. These include:

Mapping strategy Description Sample use
Fiveprime Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end. RNA-seq, ribosome profiling, DMS-seq
Variable fiveprime Each read alignment is mapped at a fixed distance from its 5’ end, where the distance is determined by the length of the alignment Ribosome profiling
Stratified variable fiveprime Like variable fiveprime mapping, except reads are stratified by length into separate rows of an array Ribosome profiling
Threeprime Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end. RNA-seq, ribosome profiling
Center/entire Zero or more positions are trimmed from each end of the read alignment, and the remaining N positions are incremented by 1/N Ribosome profiling with micrococcal nuclease, RNA-seq

Mapping functions are rarely called directly. Instead, they are invoked via BAMGenomeArray.set_mapping():

# open BAM file
>>> ga = BAMGenomeArray("some_file.bam")

# set mapping 15 bases from 3' end of read
>>> ga.set_mapping(ThreePrimeMapFactory(offset=15))

# set mapping to 5' end of read
>>> ga.set_mapping(FivePrimeMapFactory())

# cut 3 nt off each end of the read, and map fractionally over the rest
>>> ga.set_mapping(CenterMapFactory(nibble=12))

# access reads in some region
>>> ga[GenomicSegment("chrI",10000,100000,"+")]
# output is numpy array

For further details see map_factories and Read mapping functions

Export to browser tracks

GenomeArrays can export data to wiggle or bedGraph formats. To export forward-strand data to some_file.wig:

>>> genome_array.to_bedgraph(open("some_file.wig","w"),"my_track","+")

Normalization

GenomeArrays also offer utilities for summing and normalization:

>>> genome_array.sum() # get sum
71292045

# set sum to an arbitrary value- can be useful for normalizing between
# multiple alignment files
>>> genome_array.set_sum(61241234324)
>>> genome_array.sum()
61241234324

# reset sum to data in file
>>> genome_array.reset_sum()
>>> genome_array.sum()
71292045

# with set_normalize(), all arrays will be normalized to counts per million
# relative to the current sum
>>> genome_array.set_normalize(True)
>>> my_transcript.get_counts(genome_array)
array([ 1.33254699,  1.86556579,  1.22033251,  6.88716392,  1.40268104,
        1.02395716,  4.32025761,  3.04381786,  6.26998426,  2.86146933,
        3.43656855,  3.70307795,  5.54059012,  1.06603759,  5.42837563,
        4.47455253,  0.60315285,  2.39858458,  1.41670785,  2.13207518,
        0.813555  ,  2.34247734,  0.74342095,  5.56864374,  3.19811278,
        0.61717966,  5.63877779,  3.99764097,  6.732869  ,  0.77147457,
        0.12624129,  0.71536733,  5.02159813,  4.3062308 ,  2.0198607 ,
        2.5528795 ,  4.10985545,  6.38219874,  5.6528046 ,  5.26005391,
        6.21387702,  6.19985021,  2.74925484,  1.86556579,  3.33838088,
        3.75918519,  0.98187673,  1.27643975,  2.41261139,  6.87313711,
        4.85327641,  4.12388226,  1.64113682,  5.94736762,  2.52482588,
        0.19637535,  5.61072417,  5.80709952,  3.60489028,  1.78140492,
        4.82522279,  4.3062308 ,  6.10166253,  2.13207518,  6.42427917,
        5.93334081,  0.84160863,  2.4266382 ,  5.07770537,  3.88542649,
        5.5125365 ,  4.1659627 ,  5.28810753,  5.00757132,  0.95382311,
        2.00583389,  6.55052047,  0.09818767,  1.2063057 ,  6.31206469,
        3.2261664 ,  0.86966225,  3.8994533 ,  0.91174268,  4.85327641,
        1.44476147,  2.59495993,  0.09818767,  4.90938365,  2.27234329,
        6.46635961,  3.84334606,  2.87549614,  2.59495993,  2.93160338,
        3.85737287,  2.66509398,  6.08763572,  0.1402681 ,  3.38046131])

# similarly, with set_normalize() on, browser tracks we'll be normalized 
>>> genome_array.to_bedgraph(open("some_file_normalized.wig","w"),"my_normalized_track","+")
class plastid.genomics.genome_array.BAMGenomeArray(*bamfiles, mapping=CenterMapFactory())[source]

Bases: plastid.genomics.genome_array.AbstractGenomeArray

A GenomeArray for read alignments in BAM files.

When a user requests read alignments or counts at a set of genomic positions, BAMGenomeArray uses a mapping rule to determine which reads from the BAM file should be returned and how these reads should be mapped to positions in a vector. Reads are not mapped until requested, and mapping rules are changeable at runtime via BAMGenomeArray.set_mapping()

Parameters:
bamfile : One or more filenames or open pysam.AlignmentFile

Filename(s) of BAM files, or open pysam.AlignmentFile to include in array. Note: all BAM files must be sorted and indexed by samtools.

mapping : func

mapping function that determines how each read alignment is mapped to a count at a genomic position. Examples include mapping reads to their fiveprime ends, mapping reads to their threeprime ends, or mapping somewhere in the middle. Factories to produce such functions are provided. See references below. (Default: CenterMapFactory())

Notes

The alignment data in BAMGenomeArray objects is immutable. If you need to change the data in-place, the BAMGenomeArray can be converted to a GenomeArray or SparseGenomeArray via to_genome_array()

Attributes:
map_fn : function

mapping function used to convert read alignments to counts

bamfiles : list

BAM files or read alignments, as a list of pysam.AlignmentFile

Methods

add_filter(name, func) Apply a function to filter reads retrieved from regions before mapping and counting
chroms() Returns a list of chromosomes
get(roi[, roi_order]) Retrieve array of counts from a region of interest, following the mapping rule set by set_mapping().
get_mapping() Return the docstring of the current mapping function
get_reads(roi) Returns reads covering a GenomicSegment.
get_reads_and_counts(roi[, roi_order]) Return read alignments covering a GenomicSegment, and a count vector mapping reads to each positions in the GenomicSegment, following the rule specified by set_mapping().
lengths() Returns a dictionary mapping chromosome names to lengths.
remove_filter(name) Remove a generic filter
reset_sum() Reset the sum to the total number of mapped reads in the BAMGenomeArray
set_mapping(mapping_function) Change the mapping rule
set_normalize([value]) Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val) Set sum used for normalization to an arbitrary value (e.g.
strands() Return a tuple of strands in the GenomeArray
sum() Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph(fh, trackname, strand[, …]) Write the contents of the BAMGenomeArray to a bedGraph file under the mapping rule set by set_mapping().
to_genome_array([array_type]) Converts BAMGenomeArray to a GenomeArray or SparseGenomeArray under the mapping rule set by set_mapping()
to_variable_step(fh, trackname, strand[, …]) Write the contents of the BAMGenomeArray to a variableStep Wiggle file under the mapping rule set by set_mapping().
add_filter(name, func)[source]

Apply a function to filter reads retrieved from regions before mapping and counting

Parameters:
name : str

A name for the filter. If not unique, will overwrite previous filter

func : func

Filter function. Function must take a pysam.AlignedSegment as a single parameter, and return True if that read should be included in output, or False otherwise

See also

plastid.genomics.map_factories.SizeFilterFactory
generate filter functions that gate read alignments on size

Notes

In Python, lambda functions do NOT have their own scope! We strongly recomend defining filter functions using the def syntax to avoid namespace collisions.

chroms()[source]

Returns a list of chromosomes

Returns:
list

sorted chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest, following the mapping rule set by set_mapping(). Values in the vector are ordered 5’ to 3’ relative to roi rather than the genome (i.e. are reversed for reverse-strand features).

Parameters:
roi : GenomicSegment or SegmentChain

Region of interest in genome

roi_order : bool, optional

If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns:
numpy.ndarray

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

Raises:
ValueError

if bamfiles not sorted or not indexed

See also

plastid.genomics.roitools.SegmentChain.get_counts
Fetch a spliced vector of data covering a SegmentChain
get_mapping()[source]

Return the docstring of the current mapping function

get_reads(roi)[source]

Returns reads covering a GenomicSegment. Reads are strand-matched to roi by default, and are included or excluded depending upon the rule specified by set_mapping(). To obtain unstranded reads, set the value of roi.strand to ‘.’

Parameters:
roi : GenomicSegment

Region of interest

Returns:
list

List of reads (as pysam.AlignedSegment) covering region of interest, according to the rule set in set_mapping()

Raises:
ValueError

if bamfiles are not sorted or not indexed

get_reads_and_counts(roi, roi_order=True)[source]

Return read alignments covering a GenomicSegment, and a count vector mapping reads to each positions in the GenomicSegment, following the rule specified by set_mapping(). Reads are strand-matched to roi by default. To obtain unstranded reads, set the value of roi.strand to ‘.’

Parameters:
roi : GenomicSegment

Region of interest

Returns:
list

List of reads (as pysam.AlignedSegment) covering region of interest

numpy.ndarray

Counts at each position of roi, under whatever mapping rule was set by set_mapping()

Raises:
ValueError

if bamfiles not sorted or not indexed

lengths()[source]

Returns a dictionary mapping chromosome names to lengths.

Returns:
dict

mapping chromosome names to lengths

remove_filter(name)[source]

Remove a generic filter

Parameters:
name : str

A name for the filter

Returns:
func

the removed filter function

reset_sum()[source]

Reset the sum to the total number of mapped reads in the BAMGenomeArray

Notes

Filters are not applied in this summation. See BAMGenomeArray.set_sum() to manually set a sum, which e.g. could correspond to a total number of filtered reads.

set_mapping(mapping_function)[source]

Change the mapping rule

Parameters:
mapping : func

Function that determines how each read alignment is mapped to a count at a position in the BAMGenomeArray. Examples include mapping reads to their fiveprime or threeprime ends, with or without offsets. Factories to produce such functions are provided. See references below.

See also

plastid.genomics.map_factories.FivePrimeMapFactory
map reads to 5’ ends, with or without applying an offset
plastid.genomics.map_factories.VariableFivePrimeMapFactory
map reads to 5’ ends, choosing an offset determined by read length
plastid.genomics.map_factories.ThreePrimeMapFactory
map reads to 3’ ends, with or without applying an offset
plastid.genomics.map_factories.CenterMapFactory
map each read fractionally to every position in the read, optionally trimming positions from the ends first
set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters:
value : bool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters:
val : int or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns:
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns:
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]

Write the contents of the BAMGenomeArray to a bedGraph file under the mapping rule set by set_mapping().

See the bedGraph spec for format details.

Parameters:
fh : file-like

Filehandle to write to

trackname : str

Name of browser track

strand : str

Strand of BAMGenomeArray to export. ‘+’, ‘-‘, or ‘.’

window_size : int

Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient

printer : file-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

to_genome_array(array_type=None)[source]

Converts BAMGenomeArray to a GenomeArray or SparseGenomeArray under the mapping rule set by set_mapping()

Parameters:
array_type : class

Type of GenomeArray to return. GenomeArray or SparseGenomeArray. (Default: GenomeArray)

Returns:
|GenomeArray| or |SparseGenomeArray|
to_variable_step(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]

Write the contents of the BAMGenomeArray to a variableStep Wiggle file under the mapping rule set by set_mapping().

See the Wiggle spec for format details.

Parameters:
fh : file-like

Filehandle to write to

trackname : str

Name of browser track

strand : str

Strand of BAMGenomeArray to export. ‘+’, ‘-‘, or ‘.’

window_size : int

Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient

printer : file-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

class plastid.genomics.genome_array.BigWigGenomeArray(maxmem=0)[source]

Bases: plastid.genomics.genome_array.AbstractGenomeArray

High-performance GenomeArray for BigWig files.

Parameters:
maxmem : float

Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. (Default: 0, No maximum)

Methods

add_from_bigwig(filename, strand) Import additional data from a BigWig file
chroms() Return a list of chromosomes in the GenomeArray
get(roi[, roi_order]) Retrieve array of counts from a region of interest.
lengths() Return a dictionary mapping chromosome names to lengths.
reset_sum() Reset sum to total of data in the BigWigGenomeArray
set_normalize([value]) Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val) Set sum used for normalization to an arbitrary value (e.g.
strands() Return a tuple of strands in the GenomeArray
sum() Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_genome_array() Converts BigWigGenomeArray to a GenomeArray
add_from_bigwig(filename, strand)[source]

Import additional data from a BigWig file

Parameters:
filename : str

Path to BigWig file

strand : str

Strand to which data should be added. ‘+’, ‘-‘, or ‘.’

chroms()[source]

Return a list of chromosomes in the GenomeArray

Returns:
list

Chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest.

Parameters:
roi : GenomicSegment or SegmentChain

Region of interest in genome

roi_order : bool, optional

If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns:
numpy.ndarray

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

plastid.genomics.roitools.SegmentChain.get_counts
Fetch a spliced vector of data covering a SegmentChain
lengths()[source]

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns:
dict

Dictionary mapping chromosome names to chromosome lengths

reset_sum()[source]

Reset sum to total of data in the BigWigGenomeArray

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters:
value : bool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters:
val : int or float

a number

strands()[source]

Return a tuple of strands in the GenomeArray

Returns:
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns:
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_genome_array()[source]

Converts BigWigGenomeArray to a GenomeArray

Returns:
|GenomeArray|
class plastid.genomics.genome_array.GenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]

Bases: plastid.genomics.genome_array.MutableAbstractGenomeArray

Array-like data structure that maps numerical values (e.g. read counts or conservation data, et c) to nucleotide positions in a genome.

Data in the array may be manipulated manually and/or imported from Wiggle or BedGraph files, as well as bowtie alignments.

GenomeArray supports basic mathematical operations elementwise, where elements are nucleotide positions.

Parameters:
chr_lengths : dict or None, optional

Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed.

min_chr_size : int

If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If the genome has large chromosomes, it is much much better, speed-wise, to provide chr_lengths than to provide a guess here that is too small.

strands : sequence

Sequence of strand names for the GenomeArray. (Default: (‘+’,’-‘))

Methods

add_from_bowtie(fh, mapfunc[, min_length, …]) Import alignment data in native bowtie format to the current GenomeArray
add_from_wiggle(fh, strand) Import data from a Wiggle or bedGraph file to current GenomeArray
apply_operation(other, func[, mode]) Applies a binary operator to a copy of self and to other elementwise.
chroms() Return a list of chromosomes in the GenomeArray
get(roi[, roi_order]) Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e.
iterchroms() Return an iterator of each chromosome strand array
keys() Return names of chromosomes in the GenomeArray
lengths() Return a dictionary mapping chromosome names to lengths.
like(other) Return a GenomeArray of same dimension as the input array
nonzero() Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
plot([chroms, strands]) Create a plot of coverage along tracks or chromosomes
reset_sum() Reset the sum of the GenomeArray to the sum of all positions in the array
set_normalize([value]) Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val) Set sum used for normalization to an arbitrary value (e.g.
strands() Return a tuple of strands in the GenomeArray
sum() Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph(fh, trackname, strand[, printer]) Write the contents of the GenomeArray to a bedGraph file
to_variable_step(fh, trackname, strand[, …]) Export the contents of the GenomeArray to a variable step Wiggle file.
add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)[source]

Import alignment data in native bowtie format to the current GenomeArray

Parameters:
fh : file-like

filehandle pointing to bowtie file

mapfunc : func

a mapping function that transforms alignment coordinates into a sequence of (GenomicSegment,value) pairs

min_length : int, optional

minimum length for read to be counted (Default: 25)

max_length : int or numpy.inf, optional

maximum length for read to be counted (Default: infinity)

trans_args : dict, optional

Keyword arguments to pass to transformation function

See also

five_prime_map
map reads to 5’ ends, with or without applying an offset
variable_five_prime_map
map reads to 5’ ends, choosing an offset determined by read length
three_prime_map
map reads to 3’ ends, with or without applying an offset
center_map
map each read fractionally to every position in the read, optionally trimming positions from the ends first
add_from_wiggle(fh, strand)[source]

Import data from a Wiggle or bedGraph file to current GenomeArray

Parameters:
fh : file-like

filehandle pointing to wiggle file

strand : str

Strand to which data should be added. ‘+’, ‘-‘, or ‘.’

apply_operation(other, func, mode='same')[source]

Applies a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another GenomeArray. In both cases, a new GenomeArray is returned, and self is left unmodified. If set_normalize() is set to True, it is disabled during the operation.

Parameters:
other : float, int, or MutableAbstractGenomeArray

Second argument to func

func : func

Function to perform. This must take two arguments. a numpy.ndarray (chromosome-strand) from the GenomeArray will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.

If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.

mode : str, choice of ‘same’, ‘all’, or ‘truncate’

Only relevant if other is a MutableAbstractGenomeArray

If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent GenomeArray s. In addition, both GenomeArray s must have the same chromosomes, and the same strands.

If ‘all’, corresponding chromosomes in the parent GenomeArray s will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.

If ‘truncate’,corresponding chromosomes in the parent GenomeArray s will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.

Returns:
|GenomeArray|

new GenomeArray after the operation is applied

chroms()

Return a list of chromosomes in the GenomeArray

Returns:
list

Chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).

Parameters:
roi : GenomicSegment or SegmentChain

Region of interest in genome

roi_order : bool, optional

If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns:
:class:`numpy.ndarray`

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

plastid.genomics.roitools.SegmentChain.get_counts
Fetch a spliced vector of data covering a SegmentChain
iterchroms()[source]

Return an iterator of each chromosome strand array

Yields:
numpy.ndarray

Positionwise counts on a chromosome strand

keys()[source]

Return names of chromosomes in the GenomeArray

lengths()

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns:
dict

Dictionary mapping chromosome names to chromosome lengths

static like(other)[source]

Return a GenomeArray of same dimension as the input array

Parameters:
other : GenomeArray
Returns:
GenomeArray

empty GenomeArray of same size as other

nonzero()[source]

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.

Returns:
dict

dict[chrom][strand] = numpy.ndarray of indices

plot(chroms=None, strands=None, **plot_kw)[source]

Create a plot of coverage along tracks or chromosomes

Parameters:
chroms : list

Chromosomes to plot (default: all)

strands : list

Strands to plot (default: all)

plot_kw

A dictionary of keywords to pass to matplotlib

Returns:
:py:class:`matplotlib.figure.Figure`
reset_sum()[source]

Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters:
value : bool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters:
val : int or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns:
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns:
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, printer=None, **kwargs)[source]

Write the contents of the GenomeArray to a bedGraph file

See the bedGraph spec for format details

Parameters:
fh : file-like

Open filehandle to which data will be written

trackname : str

Name of browser track

strand : str

Strand to export. ‘+’, ‘-‘, or ‘.’

printer : file-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

to_variable_step(fh, trackname, strand, printer=None, **kwargs)[source]

Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.

See the Wiggle spec for format details

Parameters:
fh : file-like

Open filehandle to which data will be written

trackname : str

Name of browser track

strand : str

Strand to export. ‘+’, ‘-‘, or ‘.’

printer : file-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

class plastid.genomics.genome_array.MutableAbstractGenomeArray(chr_lengths=None, strands=None)[source]

Bases: plastid.genomics.genome_array.AbstractGenomeArray

Abstract base class for GenomeArray-like objects whose values can be changed

Methods

chroms() Return a list of chromosomes in the GenomeArray
lengths() Return a dictionary mapping chromosome names to lengths.
reset_sum() Reset sum to total mapped reads in the GenomeArray
set_normalize([value]) Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val) Set sum used for normalization to an arbitrary value (e.g.
strands() Return a tuple of strands in the GenomeArray
sum() Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
chroms()

Return a list of chromosomes in the GenomeArray

Returns:
list

Chromosome names as strings

lengths()

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns:
dict

Dictionary mapping chromosome names to chromosome lengths

reset_sum()

Reset sum to total mapped reads in the GenomeArray

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters:
value : bool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters:
val : int or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns:
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns:
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

class plastid.genomics.genome_array.SparseGenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]

Bases: plastid.genomics.genome_array.GenomeArray

A memory-efficient sublcass of GenomeArray using sparse internal representation. Note, savings in memory may come at a cost in performance when repeatedly getting/setting values, compared to a GenomeArray.

Parameters:
chr_lengths : dict, optional

Dictionary mapping chromosome names to lengths. Suppyling this parameter yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed (Default: {} )

min_chr_size : int,optional

If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If your genome has large chromosomes, it is much much better, speed-wise, to provide a chr_lengths dict than to provide a guess here that is too small. (Default: %s)

strands : sequence

Sequence of strand names for the GenomeArray. (Default: (‘+’,’-‘))

Methods

add_from_bowtie(fh, mapfunc[, min_length, …]) Import alignment data in native bowtie format to the current GenomeArray
add_from_wiggle(fh, strand) Import data from a Wiggle or bedGraph file to current GenomeArray
apply_operation(other, func[, mode]) Apply a binary operator to a copy of self and to other elementwise.
chroms() Return a list of chromosomes in the GenomeArray
get(roi[, roi_order]) Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e.
iterchroms() Return an iterator of each chromosome strand array
keys() Return names of chromosomes in the GenomeArray
lengths() Return a dictionary mapping chromosome names to lengths.
like(other) Return a SparseGenomeArray of same dimension as the input array
nonzero() Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
plot([chroms, strands]) Create a plot of coverage along tracks or chromosomes
reset_sum() Reset the sum of the GenomeArray to the sum of all positions in the array
set_normalize([value]) Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val) Set sum used for normalization to an arbitrary value (e.g.
strands() Return a tuple of strands in the GenomeArray
sum() Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph(fh, trackname, strand[, printer]) Write the contents of the GenomeArray to a bedGraph file
to_variable_step(fh, trackname, strand[, …]) Export the contents of the GenomeArray to a variable step Wiggle file.
add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)

Import alignment data in native bowtie format to the current GenomeArray

Parameters:
fh : file-like

filehandle pointing to bowtie file

mapfunc : func

a mapping function that transforms alignment coordinates into a sequence of (GenomicSegment,value) pairs

min_length : int, optional

minimum length for read to be counted (Default: 25)

max_length : int or numpy.inf, optional

maximum length for read to be counted (Default: infinity)

trans_args : dict, optional

Keyword arguments to pass to transformation function

See also

five_prime_map
map reads to 5’ ends, with or without applying an offset
variable_five_prime_map
map reads to 5’ ends, choosing an offset determined by read length
three_prime_map
map reads to 3’ ends, with or without applying an offset
center_map
map each read fractionally to every position in the read, optionally trimming positions from the ends first
add_from_wiggle(fh, strand)

Import data from a Wiggle or bedGraph file to current GenomeArray

Parameters:
fh : file-like

filehandle pointing to wiggle file

strand : str

Strand to which data should be added. ‘+’, ‘-‘, or ‘.’

apply_operation(other, func, mode=None)[source]

Apply a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another GenomeArray. In both cases, a new SparseGenomeArray is returned, and self is left unmodified. If set_normalize() is set to True, it is disabled during the operation.

Parameters:
other : float, int, or MutableAbstractGenomeArray

Second argument to func

func : func

Function to perform. This must take two arguments. a numpy.ndarray (chromosome-strand) from the SparseGenomeArray will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.

If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.

mode : str, choice of ‘same’, ‘all’, or ‘truncate’

Only relevant if other is a MutableAbstractGenomeArray

If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent SparseGenomeArrays. In addition, both SparseGenomeArrays must have the same chromosomes, and the same strands.

If ‘all’, corresponding chromosomes in the parent SparseGenomeArrays will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.

If ‘truncate’,corresponding chromosomes in the parent SparseGenomeArrays will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.

Returns:
|SparseGenomeArray|

new SparseGenomeArray after the operation is applied

chroms()

Return a list of chromosomes in the GenomeArray

Returns:
list

Chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).

Parameters:
roi : GenomicSegment or SegmentChain

Region of interest in genome

roi_order : bool, optional

If True (default) return vector of values 5’ to 3’ relative to roi rather than genome.

Returns:
:class:`numpy.ndarray`

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

SegmentChain.get_counts
Fetch a spliced vector of data covering a SegmentChain
iterchroms()

Return an iterator of each chromosome strand array

Yields:
numpy.ndarray

Positionwise counts on a chromosome strand

keys()

Return names of chromosomes in the GenomeArray

lengths()[source]

Return a dictionary mapping chromosome names to lengths. In the case where two strands report different lengths for a chromosome, the max length is taken.

Returns:
dict

mapping chromosome names to lengths

static like(other)[source]

Return a SparseGenomeArray of same dimension as the input array

Parameters:
other : GenomeArray or SparseGenomeArray
Returns:
|SparseGenomeArray|

of same size as other

nonzero()[source]

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.

Returns:
dict

dict[chrom][strand] = numpy.ndarray of indices

plot(chroms=None, strands=None, **plot_kw)

Create a plot of coverage along tracks or chromosomes

Parameters:
chroms : list

Chromosomes to plot (default: all)

strands : list

Strands to plot (default: all)

plot_kw

A dictionary of keywords to pass to matplotlib

Returns:
:py:class:`matplotlib.figure.Figure`
reset_sum()

Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters:
value : bool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters:
val : int or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns:
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns:
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, printer=None, **kwargs)

Write the contents of the GenomeArray to a bedGraph file

See the bedGraph spec for format details

Parameters:
fh : file-like

Open filehandle to which data will be written

trackname : str

Name of browser track

strand : str

Strand to export. ‘+’, ‘-‘, or ‘.’

printer : file-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

to_variable_step(fh, trackname, strand, printer=None, **kwargs)

Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.

See the Wiggle spec for format details

Parameters:
fh : file-like

Open filehandle to which data will be written

trackname : str

Name of browser track

strand : str

Strand to export. ‘+’, ‘-‘, or ‘.’

printer : file-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

plastid.genomics.genome_array.center_map(feature, **kwargs)[source]

Center-mapping function used as an argument to GenomeArray.add_from_bowtie(). A user-specified number of bases is optionally removed from each side of each read alignment, and the N remaining bases are each apportioned 1/N of the read count, so that the entire read is counted once.

Parameters:
feature : SegmentChain

Ungapped genomic alignment

kwargs[‘value’] : float, optional

Total value to divide over aligning positions (Default: 1.0)

kwargs[‘nibble’] : int, optional

Positions to remove from each end before mapping (Default: 0)

kwargs[‘offset’] : int, optional

Mapping offset, if any, from 5’ end of read (Default: 0)

Returns:
list

tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.five_prime_map(feature, **kwargs)[source]

Fiveprime mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment.

Parameters:
feature : SegmentChain

Ungapped genomic alignment

kwargs[‘value’] : float or int, optional

Value to apportion (Default: 1)

kwargs[‘offset’] : int, optional

Mapping offset, if any, from 5’ toward 3’ end of read

Returns:
list

tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.three_prime_map(feature, **kwargs)[source]

Threeprime mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the threeprime end of the alignment.

Parameters:
feature : SegmentChain

Ungapped genomic alignment

kwargs[‘value’] : float or int, optional

Value to apportion (Default: 1)

kwargs[‘offset’] : int, optional

Mapping offset, if any, from 3’ toward 5’ end of read.

Returns:
list

tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.variable_five_prime_map(feature, **kwargs)[source]

Fiveprime variable mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment. The offset for a read of a given length is supplied in kwargs[offset][readlen]

Parameters:
feature : SegmentChain

Ungapped genomic alignment

kwargs[‘offset’] : dict, required

Dictionary mapping read lengths to offsets

kwargs[‘value’] : float or int, optional

Value to apportion (Default: 1)

Returns:
list

tuples of (GenomicSegment,float value over segment)