plastid.genomics.genome_array module

GenomeArrays map quantitative data or read alignments to genomic positions.

Summary

GenomeArrays are typically accessed using SegmentChains.

The get_counts() method of SegmentChain and Transcript fetches arrays of data from GenomeArrays. The result is a numpy array in which each position corresponds to a position in the SegmentChain, moving from 5’ to 3’, in the spliced SegmentChain or Transcript. For a Transcript called my_transcript:

>>> # open a BAM file, map reads 15 nucleotides from 3' end
>>> genome_array = BAMGenomeArray("some_file.bam")
>>> genome_array.set_mapping(ThreePrimeMapFactory(offset=15))

>>> my_data = my_transcript.get_counts(genome_array)
>>> my_data
    array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
            76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
           402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
           375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
           117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
           423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
            86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
           274, 205, 185, 209, 275, 190, 434,  10, 241])

Alternatively, SegmentChains and GenomicSegments can be used directly as dictionary keys:

>>> my_data = genome_array[my_transcript]
>>> my_data
    array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
            76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
           402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
           375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
           117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
           423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
            86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
           274, 205, 185, 209, 275, 190, 434,  10, 241])

Module contents

Several implementations are provided. These are tailored to data stored in specific formats, or for specific purposes. For example, mutable GenomeArrays allow values to be changed in-place, e.g. via mathematical operations or manual setting.

Implementation

Format of data

Mutable

Use of mapping functions

BAMGenomeArray

BAM file(s)

No

Inline; may be changed without reloading data

BigWigGenomeArray

BigWig file(s)

No

n/a

GenomeArray

bowtie, wiggle, bedGraph, or data in memory

Yes

On import from bowtie files only. Data must be reimported to change mapping

SparseGenomeArray

bowtie, wiggle, bedGraph, or data in memory

Yes

On import from bowtie files only. Data must be reimported to change mapping

Extended summary & examples

Mapping functions

GenomeArrays use Mapping functions to extract the biology of interest from read alignments, and to vectorize these data over genomic positions.

Depending on the data type and the purpose of the experiment, different functions may be appropriate. The following mapping rules are provided, although users are encouraged to provide their own if needed. These include:

Mapping strategy

Description

Sample use

Fiveprime

Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end.

RNA-seq, ribosome profiling, DMS-seq

Variable fiveprime

Each read alignment is mapped at a fixed distance from its 5’ end, where the distance is determined by the length of the alignment

Ribosome profiling

Stratified variable fiveprime

Like variable fiveprime mapping, except reads are stratified by length into separate rows of an array

Ribosome profiling

Threeprime

Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end.

RNA-seq, ribosome profiling

Center/entire

Zero or more positions are trimmed from each end of the read alignment, and the remaining N positions are incremented by 1/N

Ribosome profiling with micrococcal nuclease, RNA-seq

Mapping functions are rarely called directly. Instead, they are invoked via BAMGenomeArray.set_mapping():

# open BAM file
>>> ga = BAMGenomeArray("some_file.bam")

# set mapping 15 bases from 3' end of read
>>> ga.set_mapping(ThreePrimeMapFactory(offset=15))

# set mapping to 5' end of read
>>> ga.set_mapping(FivePrimeMapFactory())

# cut 3 nt off each end of the read, and map fractionally over the rest
>>> ga.set_mapping(CenterMapFactory(nibble=12))

# access reads in some region
>>> ga[GenomicSegment("chrI",10000,100000,"+")]
# output is numpy array

For further details see map_factories and Read mapping functions

Export to browser tracks

GenomeArrays can export data to wiggle or bedGraph formats. To export forward-strand data to some_file.wig:

>>> genome_array.to_bedgraph(open("some_file.wig","w"),"my_track","+")

Normalization

GenomeArrays also offer utilities for summing and normalization:

>>> genome_array.sum() # get sum
71292045

# set sum to an arbitrary value- can be useful for normalizing between
# multiple alignment files
>>> genome_array.set_sum(61241234324)
>>> genome_array.sum()
61241234324

# reset sum to data in file
>>> genome_array.reset_sum()
>>> genome_array.sum()
71292045

# with set_normalize(), all arrays will be normalized to counts per million
# relative to the current sum
>>> genome_array.set_normalize(True)
>>> my_transcript.get_counts(genome_array)
array([ 1.33254699,  1.86556579,  1.22033251,  6.88716392,  1.40268104,
        1.02395716,  4.32025761,  3.04381786,  6.26998426,  2.86146933,
        3.43656855,  3.70307795,  5.54059012,  1.06603759,  5.42837563,
        4.47455253,  0.60315285,  2.39858458,  1.41670785,  2.13207518,
        0.813555  ,  2.34247734,  0.74342095,  5.56864374,  3.19811278,
        0.61717966,  5.63877779,  3.99764097,  6.732869  ,  0.77147457,
        0.12624129,  0.71536733,  5.02159813,  4.3062308 ,  2.0198607 ,
        2.5528795 ,  4.10985545,  6.38219874,  5.6528046 ,  5.26005391,
        6.21387702,  6.19985021,  2.74925484,  1.86556579,  3.33838088,
        3.75918519,  0.98187673,  1.27643975,  2.41261139,  6.87313711,
        4.85327641,  4.12388226,  1.64113682,  5.94736762,  2.52482588,
        0.19637535,  5.61072417,  5.80709952,  3.60489028,  1.78140492,
        4.82522279,  4.3062308 ,  6.10166253,  2.13207518,  6.42427917,
        5.93334081,  0.84160863,  2.4266382 ,  5.07770537,  3.88542649,
        5.5125365 ,  4.1659627 ,  5.28810753,  5.00757132,  0.95382311,
        2.00583389,  6.55052047,  0.09818767,  1.2063057 ,  6.31206469,
        3.2261664 ,  0.86966225,  3.8994533 ,  0.91174268,  4.85327641,
        1.44476147,  2.59495993,  0.09818767,  4.90938365,  2.27234329,
        6.46635961,  3.84334606,  2.87549614,  2.59495993,  2.93160338,
        3.85737287,  2.66509398,  6.08763572,  0.1402681 ,  3.38046131])

# similarly, with set_normalize() on, browser tracks we'll be normalized 
>>> genome_array.to_bedgraph(open("some_file_normalized.wig","w"),"my_normalized_track","+")
class plastid.genomics.genome_array.BAMGenomeArray(*bamfiles, mapping=CenterMapFactory())[source]

Bases: plastid.genomics.genome_array.AbstractGenomeArray

A GenomeArray for read alignments in BAM files.

When a user requests read alignments or counts at a set of genomic positions, BAMGenomeArray uses a mapping rule to determine which reads from the BAM file should be returned and how these reads should be mapped to positions in a vector. Reads are not mapped until requested, and mapping rules are changeable at runtime via BAMGenomeArray.set_mapping()

Parameters
bamfileOne or more filenames or open pysam.AlignmentFile

Filename(s) of BAM files, or open pysam.AlignmentFile to include in array. Note: all BAM files must be sorted and indexed by samtools.

mappingfunc

mapping function that determines how each read alignment is mapped to a count at a genomic position. Examples include mapping reads to their fiveprime ends, mapping reads to their threeprime ends, or mapping somewhere in the middle. Factories to produce such functions are provided. See references below. (Default: CenterMapFactory())

Notes

The alignment data in BAMGenomeArray objects is immutable. If you need to change the data in-place, the BAMGenomeArray can be converted to a GenomeArray or SparseGenomeArray via to_genome_array()

Attributes
map_fnfunction

mapping function used to convert read alignments to counts

bamfileslist

BAM files or read alignments, as a list of pysam.AlignmentFile

Methods

add_filter(name, func)

Apply a function to filter reads retrieved from regions before mapping and counting

chroms()

Returns a list of chromosomes

get(roi[, roi_order])

Retrieve array of counts from a region of interest, following the mapping rule set by set_mapping().

get_mapping()

Return the docstring of the current mapping function

get_reads(roi)

Returns reads covering a GenomicSegment.

get_reads_and_counts(roi[, roi_order])

Return read alignments covering a GenomicSegment, and a count vector mapping reads to each positions in the GenomicSegment, following the rule specified by set_mapping().

lengths()

Returns a dictionary mapping chromosome names to lengths.

remove_filter(name)

Remove a generic filter

reset_sum()

Reset the sum to the total number of mapped reads in the BAMGenomeArray

set_mapping(mapping_function)

Change the mapping rule

set_normalize([value])

Toggle normalization of reported values to reads per million mapped in the dataset

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g.

strands()

Return a tuple of strands in the GenomeArray

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

to_bedgraph(fh, trackname, strand[, ...])

Write the contents of the BAMGenomeArray to a bedGraph file under the mapping rule set by set_mapping().

to_genome_array([array_type])

Converts BAMGenomeArray to a GenomeArray or SparseGenomeArray under the mapping rule set by set_mapping()

to_variable_step(fh, trackname, strand[, ...])

Write the contents of the BAMGenomeArray to a variableStep Wiggle file under the mapping rule set by set_mapping().

add_filter(name, func)[source]

Apply a function to filter reads retrieved from regions before mapping and counting

Parameters
namestr

A name for the filter. If not unique, will overwrite previous filter

funcfunc

Filter function. Function must take a pysam.AlignedSegment as a single parameter, and return True if that read should be included in output, or False otherwise

See also

plastid.genomics.map_factories.SizeFilterFactory

generate filter functions that gate read alignments on size

Notes

In Python, lambda functions do NOT have their own scope! We strongly recomend defining filter functions using the def syntax to avoid namespace collisions.

chroms()[source]

Returns a list of chromosomes

Returns
list

sorted chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest, following the mapping rule set by set_mapping(). Values in the vector are ordered 5’ to 3’ relative to roi rather than the genome (i.e. are reversed for reverse-strand features).

Parameters
roiGenomicSegment or SegmentChain

Region of interest in genome

roi_orderbool, optional

If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns
numpy.ndarray

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

Raises
ValueError

if bamfiles not sorted or not indexed

See also

plastid.genomics.roitools.SegmentChain.get_counts

Fetch a spliced vector of data covering a SegmentChain

get_mapping()[source]

Return the docstring of the current mapping function

get_reads(roi)[source]

Returns reads covering a GenomicSegment. Reads are strand-matched to roi by default, and are included or excluded depending upon the rule specified by set_mapping(). To obtain unstranded reads, set the value of roi.strand to ‘.’

Parameters
roiGenomicSegment

Region of interest

Returns
list

List of reads (as pysam.AlignedSegment) covering region of interest, according to the rule set in set_mapping()

Raises
ValueError

if bamfiles are not sorted or not indexed

get_reads_and_counts(roi, roi_order=True)[source]

Return read alignments covering a GenomicSegment, and a count vector mapping reads to each positions in the GenomicSegment, following the rule specified by set_mapping(). Reads are strand-matched to roi by default. To obtain unstranded reads, set the value of roi.strand to ‘.’

Parameters
roiGenomicSegment

Region of interest

Returns
list

List of reads (as pysam.AlignedSegment) covering region of interest

numpy.ndarray

Counts at each position of roi, under whatever mapping rule was set by set_mapping()

Raises
ValueError

if bamfiles not sorted or not indexed

lengths()[source]

Returns a dictionary mapping chromosome names to lengths.

Returns
dict

mapping chromosome names to lengths

remove_filter(name)[source]

Remove a generic filter

Parameters
namestr

A name for the filter

Returns
func

the removed filter function

reset_sum()[source]

Reset the sum to the total number of mapped reads in the BAMGenomeArray

Notes

Filters are not applied in this summation. See BAMGenomeArray.set_sum() to manually set a sum, which e.g. could correspond to a total number of filtered reads.

set_mapping(mapping_function)[source]

Change the mapping rule

Parameters
mappingfunc

Function that determines how each read alignment is mapped to a count at a position in the BAMGenomeArray. Examples include mapping reads to their fiveprime or threeprime ends, with or without offsets. Factories to produce such functions are provided. See references below.

See also

plastid.genomics.map_factories.FivePrimeMapFactory

map reads to 5’ ends, with or without applying an offset

plastid.genomics.map_factories.VariableFivePrimeMapFactory

map reads to 5’ ends, choosing an offset determined by read length

plastid.genomics.map_factories.ThreePrimeMapFactory

map reads to 3’ ends, with or without applying an offset

plastid.genomics.map_factories.CenterMapFactory

map each read fractionally to every position in the read, optionally trimming positions from the ends first

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters
valuebool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters
valint or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]

Write the contents of the BAMGenomeArray to a bedGraph file under the mapping rule set by set_mapping().

See the bedGraph spec for format details.

Parameters
fhfile-like

Filehandle to write to

tracknamestr

Name of browser track

strandstr

Strand of BAMGenomeArray to export. ‘+’, ‘-’, or ‘.’

window_sizeint

Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient

printerfile-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

to_genome_array(array_type=None)[source]

Converts BAMGenomeArray to a GenomeArray or SparseGenomeArray under the mapping rule set by set_mapping()

Parameters
array_typeclass

Type of GenomeArray to return. GenomeArray or SparseGenomeArray. (Default: GenomeArray)

Returns
GenomeArray or SparseGenomeArray
to_variable_step(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]

Write the contents of the BAMGenomeArray to a variableStep Wiggle file under the mapping rule set by set_mapping().

See the Wiggle spec for format details.

Parameters
fhfile-like

Filehandle to write to

tracknamestr

Name of browser track

strandstr

Strand of BAMGenomeArray to export. ‘+’, ‘-’, or ‘.’

window_sizeint

Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient

printerfile-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

class plastid.genomics.genome_array.BigWigGenomeArray(maxmem=0)[source]

Bases: plastid.genomics.genome_array.AbstractGenomeArray

High-performance GenomeArray for BigWig files.

Parameters
maxmemfloat

Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. (Default: 0, No maximum)

Methods

add_from_bigwig(filename, strand)

Import additional data from a BigWig file

chroms()

Return a list of chromosomes in the GenomeArray

get(roi[, roi_order])

Retrieve array of counts from a region of interest.

lengths()

Return a dictionary mapping chromosome names to lengths.

reset_sum()

Reset sum to total of data in the BigWigGenomeArray

set_normalize([value])

Toggle normalization of reported values to reads per million mapped in the dataset

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g.

strands()

Return a tuple of strands in the GenomeArray

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

to_genome_array()

Converts BigWigGenomeArray to a GenomeArray

add_from_bigwig(filename, strand)[source]

Import additional data from a BigWig file

Parameters
filenamestr

Path to BigWig file

strandstr

Strand to which data should be added. ‘+’, ‘-’, or ‘.’

chroms()[source]

Return a list of chromosomes in the GenomeArray

Returns
list

Chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest.

Parameters
roiGenomicSegment or SegmentChain

Region of interest in genome

roi_orderbool, optional

If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns
numpy.ndarray

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

plastid.genomics.roitools.SegmentChain.get_counts

Fetch a spliced vector of data covering a SegmentChain

lengths()[source]

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns
dict

Dictionary mapping chromosome names to chromosome lengths

reset_sum()[source]

Reset sum to total of data in the BigWigGenomeArray

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters
valuebool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters
valint or float

a number

strands()[source]

Return a tuple of strands in the GenomeArray

Returns
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_genome_array()[source]

Converts BigWigGenomeArray to a GenomeArray

Returns
GenomeArray
class plastid.genomics.genome_array.GenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]

Bases: plastid.genomics.genome_array.MutableAbstractGenomeArray

Array-like data structure that maps numerical values (e.g. read counts or conservation data, et c) to nucleotide positions in a genome.

Data in the array may be manipulated manually and/or imported from Wiggle or BedGraph files, as well as bowtie alignments.

GenomeArray supports basic mathematical operations elementwise, where elements are nucleotide positions.

Parameters
chr_lengthsdict or None, optional

Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed.

min_chr_sizeint

If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If the genome has large chromosomes, it is much much better, speed-wise, to provide chr_lengths than to provide a guess here that is too small.

strandssequence

Sequence of strand names for the GenomeArray. (Default: (‘+’,’-‘))

Methods

add_from_bowtie(fh, mapfunc[, min_length, ...])

Import alignment data in native bowtie format to the current GenomeArray

add_from_wiggle(fh, strand)

Import data from a Wiggle or bedGraph file to current GenomeArray

apply_operation(other, func[, mode])

Applies a binary operator to a copy of self and to other elementwise.

chroms()

Return a list of chromosomes in the GenomeArray

get(roi[, roi_order])

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.

iterchroms()

Return an iterator of each chromosome strand array

keys()

Return names of chromosomes in the GenomeArray

lengths()

Return a dictionary mapping chromosome names to lengths.

like(other)

Return a GenomeArray of same dimension as the input array

nonzero()

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.

plot([chroms, strands])

Create a plot of coverage along tracks or chromosomes

reset_sum()

Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize([value])

Toggle normalization of reported values to reads per million mapped in the dataset

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g.

strands()

Return a tuple of strands in the GenomeArray

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

to_bedgraph(fh, trackname, strand[, printer])

Write the contents of the GenomeArray to a bedGraph file

to_variable_step(fh, trackname, strand[, ...])

Export the contents of the GenomeArray to a variable step Wiggle file.

add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)[source]

Import alignment data in native bowtie format to the current GenomeArray

Parameters
fhfile-like

filehandle pointing to bowtie file

mapfuncfunc

a mapping function that transforms alignment coordinates into a sequence of (GenomicSegment,value) pairs

min_lengthint, optional

minimum length for read to be counted (Default: 25)

max_lengthint or numpy.inf, optional

maximum length for read to be counted (Default: infinity)

trans_argsdict, optional

Keyword arguments to pass to transformation function

See also

five_prime_map

map reads to 5’ ends, with or without applying an offset

variable_five_prime_map

map reads to 5’ ends, choosing an offset determined by read length

three_prime_map

map reads to 3’ ends, with or without applying an offset

center_map

map each read fractionally to every position in the read, optionally trimming positions from the ends first

add_from_wiggle(fh, strand)[source]

Import data from a Wiggle or bedGraph file to current GenomeArray

Parameters
fhfile-like

filehandle pointing to wiggle file

strandstr

Strand to which data should be added. ‘+’, ‘-’, or ‘.’

apply_operation(other, func, mode='same')[source]

Applies a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another GenomeArray. In both cases, a new GenomeArray is returned, and self is left unmodified. If set_normalize() is set to True, it is disabled during the operation.

Parameters
otherfloat, int, or MutableAbstractGenomeArray

Second argument to func

funcfunc

Function to perform. This must take two arguments. a numpy.ndarray (chromosome-strand) from the GenomeArray will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.

If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.

modestr, choice of ‘same’, ‘all’, or ‘truncate’

Only relevant if other is a MutableAbstractGenomeArray

If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent GenomeArray s. In addition, both GenomeArray s must have the same chromosomes, and the same strands.

If ‘all’, corresponding chromosomes in the parent GenomeArray s will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.

If ‘truncate’,corresponding chromosomes in the parent GenomeArray s will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.

Returns
GenomeArray

new GenomeArray after the operation is applied

chroms()

Return a list of chromosomes in the GenomeArray

Returns
list

Chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).

Parameters
roiGenomicSegment or SegmentChain

Region of interest in genome

roi_orderbool, optional

If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.

Returns
numpy.ndarray

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

plastid.genomics.roitools.SegmentChain.get_counts

Fetch a spliced vector of data covering a SegmentChain

iterchroms()[source]

Return an iterator of each chromosome strand array

Yields
numpy.ndarray

Positionwise counts on a chromosome strand

keys()[source]

Return names of chromosomes in the GenomeArray

lengths()

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns
dict

Dictionary mapping chromosome names to chromosome lengths

static like(other)[source]

Return a GenomeArray of same dimension as the input array

Parameters
otherGenomeArray
Returns
GenomeArray

empty GenomeArray of same size as other

nonzero()[source]

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.

Returns
dict

dict[chrom][strand] = numpy.ndarray of indices

plot(chroms=None, strands=None, **plot_kw)[source]

Create a plot of coverage along tracks or chromosomes

Parameters
chromslist

Chromosomes to plot (default: all)

strandslist

Strands to plot (default: all)

plot_kw

A dictionary of keywords to pass to matplotlib

Returns
matplotlib.figure.Figure
reset_sum()[source]

Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters
valuebool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters
valint or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, printer=None, **kwargs)[source]

Write the contents of the GenomeArray to a bedGraph file

See the bedGraph spec for format details

Parameters
fhfile-like

Open filehandle to which data will be written

tracknamestr

Name of browser track

strandstr

Strand to export. ‘+’, ‘-’, or ‘.’

printerfile-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

to_variable_step(fh, trackname, strand, printer=None, **kwargs)[source]

Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.

See the Wiggle spec for format details

Parameters
fhfile-like

Open filehandle to which data will be written

tracknamestr

Name of browser track

strandstr

Strand to export. ‘+’, ‘-’, or ‘.’

printerfile-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

class plastid.genomics.genome_array.MutableAbstractGenomeArray(chr_lengths=None, strands=None)[source]

Bases: plastid.genomics.genome_array.AbstractGenomeArray

Abstract base class for GenomeArray-like objects whose values can be changed

Methods

chroms()

Return a list of chromosomes in the GenomeArray

lengths()

Return a dictionary mapping chromosome names to lengths.

reset_sum()

Reset sum to total mapped reads in the GenomeArray

set_normalize([value])

Toggle normalization of reported values to reads per million mapped in the dataset

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g.

strands()

Return a tuple of strands in the GenomeArray

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

chroms()

Return a list of chromosomes in the GenomeArray

Returns
list

Chromosome names as strings

lengths()

Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.

Returns
dict

Dictionary mapping chromosome names to chromosome lengths

abstract reset_sum()

Reset sum to total mapped reads in the GenomeArray

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters
valuebool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters
valint or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

class plastid.genomics.genome_array.SparseGenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]

Bases: plastid.genomics.genome_array.GenomeArray

A memory-efficient sublcass of GenomeArray using sparse internal representation. Note, savings in memory may come at a cost in performance when repeatedly getting/setting values, compared to a GenomeArray.

Parameters
chr_lengthsdict, optional

Dictionary mapping chromosome names to lengths. Suppyling this parameter yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed (Default: {} )

min_chr_sizeint,optional

If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If your genome has large chromosomes, it is much much better, speed-wise, to provide a chr_lengths dict than to provide a guess here that is too small. (Default: %s)

strandssequence

Sequence of strand names for the GenomeArray. (Default: (‘+’,’-‘))

Methods

add_from_bowtie(fh, mapfunc[, min_length, ...])

Import alignment data in native bowtie format to the current GenomeArray

add_from_wiggle(fh, strand)

Import data from a Wiggle or bedGraph file to current GenomeArray

apply_operation(other, func[, mode])

Apply a binary operator to a copy of self and to other elementwise.

chroms()

Return a list of chromosomes in the GenomeArray

get(roi[, roi_order])

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.

iterchroms()

Return an iterator of each chromosome strand array

keys()

Return names of chromosomes in the GenomeArray

lengths()

Return a dictionary mapping chromosome names to lengths.

like(other)

Return a SparseGenomeArray of same dimension as the input array

nonzero()

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.

plot([chroms, strands])

Create a plot of coverage along tracks or chromosomes

reset_sum()

Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize([value])

Toggle normalization of reported values to reads per million mapped in the dataset

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g.

strands()

Return a tuple of strands in the GenomeArray

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

to_bedgraph(fh, trackname, strand[, printer])

Write the contents of the GenomeArray to a bedGraph file

to_variable_step(fh, trackname, strand[, ...])

Export the contents of the GenomeArray to a variable step Wiggle file.

add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)

Import alignment data in native bowtie format to the current GenomeArray

Parameters
fhfile-like

filehandle pointing to bowtie file

mapfuncfunc

a mapping function that transforms alignment coordinates into a sequence of (GenomicSegment,value) pairs

min_lengthint, optional

minimum length for read to be counted (Default: 25)

max_lengthint or numpy.inf, optional

maximum length for read to be counted (Default: infinity)

trans_argsdict, optional

Keyword arguments to pass to transformation function

See also

five_prime_map

map reads to 5’ ends, with or without applying an offset

variable_five_prime_map

map reads to 5’ ends, choosing an offset determined by read length

three_prime_map

map reads to 3’ ends, with or without applying an offset

center_map

map each read fractionally to every position in the read, optionally trimming positions from the ends first

add_from_wiggle(fh, strand)

Import data from a Wiggle or bedGraph file to current GenomeArray

Parameters
fhfile-like

filehandle pointing to wiggle file

strandstr

Strand to which data should be added. ‘+’, ‘-’, or ‘.’

apply_operation(other, func, mode=None)[source]

Apply a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another GenomeArray. In both cases, a new SparseGenomeArray is returned, and self is left unmodified. If set_normalize() is set to True, it is disabled during the operation.

Parameters
otherfloat, int, or MutableAbstractGenomeArray

Second argument to func

funcfunc

Function to perform. This must take two arguments. a numpy.ndarray (chromosome-strand) from the SparseGenomeArray will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.

If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.

modestr, choice of ‘same’, ‘all’, or ‘truncate’

Only relevant if other is a MutableAbstractGenomeArray

If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent SparseGenomeArrays. In addition, both SparseGenomeArrays must have the same chromosomes, and the same strands.

If ‘all’, corresponding chromosomes in the parent SparseGenomeArrays will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.

If ‘truncate’,corresponding chromosomes in the parent SparseGenomeArrays will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.

Returns
SparseGenomeArray

new SparseGenomeArray after the operation is applied

chroms()

Return a list of chromosomes in the GenomeArray

Returns
list

Chromosome names as strings

get(roi, roi_order=True)[source]

Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).

Parameters
roiGenomicSegment or SegmentChain

Region of interest in genome

roi_orderbool, optional

If True (default) return vector of values 5’ to 3’ relative to roi rather than genome.

Returns
numpy.ndarray

vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi

See also

SegmentChain.get_counts

Fetch a spliced vector of data covering a SegmentChain

iterchroms()

Return an iterator of each chromosome strand array

Yields
numpy.ndarray

Positionwise counts on a chromosome strand

keys()

Return names of chromosomes in the GenomeArray

lengths()[source]

Return a dictionary mapping chromosome names to lengths. In the case where two strands report different lengths for a chromosome, the max length is taken.

Returns
dict

mapping chromosome names to lengths

static like(other)[source]

Return a SparseGenomeArray of same dimension as the input array

Parameters
otherGenomeArray or SparseGenomeArray
Returns
SparseGenomeArray

of same size as other

nonzero()[source]

Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.

Returns
dict

dict[chrom][strand] = numpy.ndarray of indices

plot(chroms=None, strands=None, **plot_kw)

Create a plot of coverage along tracks or chromosomes

Parameters
chromslist

Chromosomes to plot (default: all)

strandslist

Strands to plot (default: all)

plot_kw

A dictionary of keywords to pass to matplotlib

Returns
matplotlib.figure.Figure
reset_sum()

Reset the sum of the GenomeArray to the sum of all positions in the array

set_normalize(value=True)

Toggle normalization of reported values to reads per million mapped in the dataset

Parameters
valuebool

If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.

set_sum(val)

Set sum used for normalization to an arbitrary value (e.g. from another dataset)

Parameters
valint or float

a number

strands()

Return a tuple of strands in the GenomeArray

Returns
tuple

Chromosome strands as strings

sum()

Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray

Returns
int or float

Notes

The true (i.e. unnormalized) sum is always reported, even if set_normalize() is set to True

to_bedgraph(fh, trackname, strand, printer=None, **kwargs)

Write the contents of the GenomeArray to a bedGraph file

See the bedGraph spec for format details

Parameters
fhfile-like

Open filehandle to which data will be written

tracknamestr

Name of browser track

strandstr

Strand to export. ‘+’, ‘-’, or ‘.’

printerfile-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

to_variable_step(fh, trackname, strand, printer=None, **kwargs)

Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.

See the Wiggle spec for format details

Parameters
fhfile-like

Open filehandle to which data will be written

tracknamestr

Name of browser track

strandstr

Strand to export. ‘+’, ‘-’, or ‘.’

printerfile-like, optional

Something implementing a write() method for output

**kwargs

Any other key-value pairs to include in track definition line

plastid.genomics.genome_array.center_map(feature, **kwargs)[source]

Center-mapping function used as an argument to GenomeArray.add_from_bowtie(). A user-specified number of bases is optionally removed from each side of each read alignment, and the N remaining bases are each apportioned 1/N of the read count, so that the entire read is counted once.

Parameters
featureSegmentChain

Ungapped genomic alignment

kwargs[‘value’]float, optional

Total value to divide over aligning positions (Default: 1.0)

kwargs[‘nibble’]int, optional

Positions to remove from each end before mapping (Default: 0)

kwargs[‘offset’]int, optional

Mapping offset, if any, from 5’ end of read (Default: 0)

Returns
list

tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.five_prime_map(feature, **kwargs)[source]

Fiveprime mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment.

Parameters
featureSegmentChain

Ungapped genomic alignment

kwargs[‘value’]float or int, optional

Value to apportion (Default: 1)

kwargs[‘offset’]int, optional

Mapping offset, if any, from 5’ toward 3’ end of read

Returns
list

tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.three_prime_map(feature, **kwargs)[source]

Threeprime mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the threeprime end of the alignment.

Parameters
featureSegmentChain

Ungapped genomic alignment

kwargs[‘value’]float or int, optional

Value to apportion (Default: 1)

kwargs[‘offset’]int, optional

Mapping offset, if any, from 3’ toward 5’ end of read.

Returns
list

tuples of (GenomicSegment,float value over segment)

plastid.genomics.genome_array.variable_five_prime_map(feature, **kwargs)[source]

Fiveprime variable mapping function used as an argument to GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment. The offset for a read of a given length is supplied in kwargs[offset][readlen]

Parameters
featureSegmentChain

Ungapped genomic alignment

kwargs[‘offset’]dict, required

Dictionary mapping read lengths to offsets

kwargs[‘value’]float or int, optional

Value to apportion (Default: 1)

Returns
list

tuples of (GenomicSegment,float value over segment)