plastid.genomics.genome_array module¶
GenomeArrays map quantitative data or read alignments to genomic positions.
Summary¶
GenomeArrays are typically accessed using SegmentChains
.
The get_counts()
method of
SegmentChain
and Transcript
fetches arrays of data from GenomeArrays. The
result is a numpy array
in which each position
corresponds to a position in the SegmentChain
, moving from 5’ to 3’, in the
spliced SegmentChain
or Transcript
. For a Transcript
called
my_transcript:
>>> # open a BAM file, map reads 15 nucleotides from 3' end
>>> genome_array = BAMGenomeArray("some_file.bam")
>>> genome_array.set_mapping(ThreePrimeMapFactory(offset=15))
>>> my_data = my_transcript.get_counts(genome_array)
>>> my_data
array([ 95, 133, 87, 491, 100, 73, 308, 217, 447, 204, 245, 264, 395,
76, 387, 319, 43, 171, 101, 152, 58, 167, 53, 397, 228, 44,
402, 285, 480, 55, 9, 51, 358, 307, 144, 182, 293, 455, 403,
375, 443, 442, 196, 133, 238, 268, 70, 91, 172, 490, 346, 294,
117, 424, 180, 14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
423, 60, 173, 362, 277, 393, 297, 377, 357, 68, 143, 467, 7,
86, 450, 230, 62, 278, 65, 346, 103, 185, 7, 350, 162, 461,
274, 205, 185, 209, 275, 190, 434, 10, 241])
Alternatively, SegmentChains
and GenomicSegments
can be used directly as
dictionary keys:
>>> my_data = genome_array[my_transcript]
>>> my_data
array([ 95, 133, 87, 491, 100, 73, 308, 217, 447, 204, 245, 264, 395,
76, 387, 319, 43, 171, 101, 152, 58, 167, 53, 397, 228, 44,
402, 285, 480, 55, 9, 51, 358, 307, 144, 182, 293, 455, 403,
375, 443, 442, 196, 133, 238, 268, 70, 91, 172, 490, 346, 294,
117, 424, 180, 14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
423, 60, 173, 362, 277, 393, 297, 377, 357, 68, 143, 467, 7,
86, 450, 230, 62, 278, 65, 346, 103, 185, 7, 350, 162, 461,
274, 205, 185, 209, 275, 190, 434, 10, 241])
Module contents¶
Several implementations are provided. These are tailored to data stored in specific formats, or for specific purposes. For example, mutable GenomeArrays allow values to be changed in-place, e.g. via mathematical operations or manual setting.
Implementation |
Format of data |
Mutable |
Use of mapping functions |
BAM file(s) |
No |
Inline; may be changed without reloading data |
|
BigWig file(s) |
No |
n/a |
|
Yes |
On import from bowtie files only. Data must be reimported to change mapping |
||
Yes |
On import from bowtie files only. Data must be reimported to change mapping |
Extended summary & examples¶
Mapping functions¶
GenomeArrays use Mapping functions to extract the biology of interest from read alignments, and to vectorize these data over genomic positions.
Depending on the data type and the purpose of the experiment, different functions may be appropriate. The following mapping rules are provided, although users are encouraged to provide their own if needed. These include:
Mapping strategy |
Description |
Sample use |
Fiveprime |
Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end. |
RNA-seq, ribosome profiling, DMS-seq |
Variable fiveprime |
Each read alignment is mapped at a fixed distance from its 5’ end, where the distance is determined by the length of the alignment |
Ribosome profiling |
Stratified variable fiveprime |
Like variable fiveprime mapping, except reads are stratified by length into separate rows of an array |
Ribosome profiling |
Threeprime |
Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end. |
RNA-seq, ribosome profiling |
Center/entire |
Zero or more positions are trimmed from each end of the read alignment, and the remaining N positions are incremented by 1/N |
Ribosome profiling with micrococcal nuclease, RNA-seq |
Mapping functions are rarely called directly. Instead, they are invoked via
BAMGenomeArray.set_mapping()
:
# open BAM file >>> ga = BAMGenomeArray("some_file.bam") # set mapping 15 bases from 3' end of read >>> ga.set_mapping(ThreePrimeMapFactory(offset=15)) # set mapping to 5' end of read >>> ga.set_mapping(FivePrimeMapFactory()) # cut 3 nt off each end of the read, and map fractionally over the rest >>> ga.set_mapping(CenterMapFactory(nibble=12)) # access reads in some region >>> ga[GenomicSegment("chrI",10000,100000,"+")] # output is numpy array
For further details see map_factories
and
Read mapping functions
Export to browser tracks¶
GenomeArrays can export data to wiggle or bedGraph formats. To export forward-strand data to some_file.wig:
>>> genome_array.to_bedgraph(open("some_file.wig","w"),"my_track","+")
Normalization¶
GenomeArrays also offer utilities for summing and normalization:
>>> genome_array.sum() # get sum
71292045
# set sum to an arbitrary value- can be useful for normalizing between
# multiple alignment files
>>> genome_array.set_sum(61241234324)
>>> genome_array.sum()
61241234324
# reset sum to data in file
>>> genome_array.reset_sum()
>>> genome_array.sum()
71292045
# with set_normalize(), all arrays will be normalized to counts per million
# relative to the current sum
>>> genome_array.set_normalize(True)
>>> my_transcript.get_counts(genome_array)
array([ 1.33254699, 1.86556579, 1.22033251, 6.88716392, 1.40268104,
1.02395716, 4.32025761, 3.04381786, 6.26998426, 2.86146933,
3.43656855, 3.70307795, 5.54059012, 1.06603759, 5.42837563,
4.47455253, 0.60315285, 2.39858458, 1.41670785, 2.13207518,
0.813555 , 2.34247734, 0.74342095, 5.56864374, 3.19811278,
0.61717966, 5.63877779, 3.99764097, 6.732869 , 0.77147457,
0.12624129, 0.71536733, 5.02159813, 4.3062308 , 2.0198607 ,
2.5528795 , 4.10985545, 6.38219874, 5.6528046 , 5.26005391,
6.21387702, 6.19985021, 2.74925484, 1.86556579, 3.33838088,
3.75918519, 0.98187673, 1.27643975, 2.41261139, 6.87313711,
4.85327641, 4.12388226, 1.64113682, 5.94736762, 2.52482588,
0.19637535, 5.61072417, 5.80709952, 3.60489028, 1.78140492,
4.82522279, 4.3062308 , 6.10166253, 2.13207518, 6.42427917,
5.93334081, 0.84160863, 2.4266382 , 5.07770537, 3.88542649,
5.5125365 , 4.1659627 , 5.28810753, 5.00757132, 0.95382311,
2.00583389, 6.55052047, 0.09818767, 1.2063057 , 6.31206469,
3.2261664 , 0.86966225, 3.8994533 , 0.91174268, 4.85327641,
1.44476147, 2.59495993, 0.09818767, 4.90938365, 2.27234329,
6.46635961, 3.84334606, 2.87549614, 2.59495993, 2.93160338,
3.85737287, 2.66509398, 6.08763572, 0.1402681 , 3.38046131])
# similarly, with set_normalize() on, browser tracks we'll be normalized
>>> genome_array.to_bedgraph(open("some_file_normalized.wig","w"),"my_normalized_track","+")
- class plastid.genomics.genome_array.BAMGenomeArray(*bamfiles, mapping=CenterMapFactory())[source]¶
Bases:
plastid.genomics.genome_array.AbstractGenomeArray
A GenomeArray for read alignments in BAM files.
When a user requests read alignments or counts at a set of genomic positions,
BAMGenomeArray
uses a mapping rule to determine which reads from the BAM file should be returned and how these reads should be mapped to positions in a vector. Reads are not mapped until requested, and mapping rules are changeable at runtime viaBAMGenomeArray.set_mapping()
- Parameters
- bamfileOne or more filenames or open
pysam.AlignmentFile
Filename(s) of BAM files, or open
pysam.AlignmentFile
to include in array. Note: all BAM files must be sorted and indexed by samtools.- mappingfunc
mapping function that determines how each read alignment is mapped to a count at a genomic position. Examples include mapping reads to their fiveprime ends, mapping reads to their threeprime ends, or mapping somewhere in the middle. Factories to produce such functions are provided. See references below. (Default:
CenterMapFactory()
)
- bamfileOne or more filenames or open
Notes
The alignment data in
BAMGenomeArray
objects is immutable. If you need to change the data in-place, theBAMGenomeArray
can be converted to aGenomeArray
orSparseGenomeArray
viato_genome_array()
- Attributes
- map_fnfunction
mapping function used to convert read alignments to counts
- bamfileslist
BAM files or read alignments, as a list of
pysam.AlignmentFile
Methods
add_filter
(name, func)Apply a function to filter reads retrieved from regions before mapping and counting
chroms
()Returns a list of chromosomes
get
(roi[, roi_order])Retrieve array of counts from a region of interest, following the mapping rule set by
set_mapping()
.Return the docstring of the current mapping function
get_reads
(roi)Returns reads covering a
GenomicSegment
.get_reads_and_counts
(roi[, roi_order])Return read alignments covering a
GenomicSegment
, and a count vector mapping reads to each positions in theGenomicSegment
, following the rule specified byset_mapping()
.lengths
()Returns a dictionary mapping chromosome names to lengths.
remove_filter
(name)Remove a generic filter
Reset the sum to the total number of mapped reads in the
BAMGenomeArray
set_mapping
(mapping_function)Change the mapping rule
set_normalize
([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum
(val)Set sum used for normalization to an arbitrary value (e.g.
strands
()Return a tuple of strands in the GenomeArray
sum
()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph
(fh, trackname, strand[, ...])Write the contents of the
BAMGenomeArray
to a bedGraph file under the mapping rule set byset_mapping()
.to_genome_array
([array_type])Converts
BAMGenomeArray
to aGenomeArray
orSparseGenomeArray
under the mapping rule set byset_mapping()
to_variable_step
(fh, trackname, strand[, ...])Write the contents of the
BAMGenomeArray
to a variableStep Wiggle file under the mapping rule set byset_mapping()
.- add_filter(name, func)[source]¶
Apply a function to filter reads retrieved from regions before mapping and counting
- Parameters
- namestr
A name for the filter. If not unique, will overwrite previous filter
- funcfunc
Filter function. Function must take a
pysam.AlignedSegment
as a single parameter, and return True if that read should be included in output, or False otherwise
See also
plastid.genomics.map_factories.SizeFilterFactory
generate filter functions that gate read alignments on size
Notes
In Python, lambda functions do NOT have their own scope! We strongly recomend defining filter functions using the
def
syntax to avoid namespace collisions.
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest, following the mapping rule set by
set_mapping()
. Values in the vector are ordered 5’ to 3’ relative to roi rather than the genome (i.e. are reversed for reverse-strand features).- Parameters
- roi
GenomicSegment
orSegmentChain
Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.
- roi
- Returns
- numpy.ndarray
vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
- Raises
- ValueError
if bamfiles not sorted or not indexed
See also
plastid.genomics.roitools.SegmentChain.get_counts
Fetch a spliced vector of data covering a
SegmentChain
- get_reads(roi)[source]¶
Returns reads covering a
GenomicSegment
. Reads are strand-matched to roi by default, and are included or excluded depending upon the rule specified byset_mapping()
. To obtain unstranded reads, set the value of roi.strand to ‘.’- Parameters
- roi
GenomicSegment
Region of interest
- roi
- Returns
- list
List of reads (as
pysam.AlignedSegment
) covering region of interest, according to the rule set inset_mapping()
- Raises
- ValueError
if bamfiles are not sorted or not indexed
- get_reads_and_counts(roi, roi_order=True)[source]¶
Return read alignments covering a
GenomicSegment
, and a count vector mapping reads to each positions in theGenomicSegment
, following the rule specified byset_mapping()
. Reads are strand-matched to roi by default. To obtain unstranded reads, set the value of roi.strand to ‘.’- Parameters
- roi
GenomicSegment
Region of interest
- roi
- Returns
- list
List of reads (as
pysam.AlignedSegment
) covering region of interest- numpy.ndarray
Counts at each position of roi, under whatever mapping rule was set by
set_mapping()
- Raises
- ValueError
if bamfiles not sorted or not indexed
- lengths()[source]¶
Returns a dictionary mapping chromosome names to lengths.
- Returns
- dict
mapping chromosome names to lengths
- remove_filter(name)[source]¶
Remove a generic filter
- Parameters
- namestr
A name for the filter
- Returns
- func
the removed filter function
- reset_sum()[source]¶
Reset the sum to the total number of mapped reads in the
BAMGenomeArray
Notes
Filters are not applied in this summation. See
BAMGenomeArray.set_sum()
to manually set a sum, which e.g. could correspond to a total number of filtered reads.
- set_mapping(mapping_function)[source]¶
Change the mapping rule
- Parameters
- mappingfunc
Function that determines how each read alignment is mapped to a count at a position in the
BAMGenomeArray
. Examples include mapping reads to their fiveprime or threeprime ends, with or without offsets. Factories to produce such functions are provided. See references below.
See also
plastid.genomics.map_factories.FivePrimeMapFactory
map reads to 5’ ends, with or without applying an offset
plastid.genomics.map_factories.VariableFivePrimeMapFactory
map reads to 5’ ends, choosing an offset determined by read length
plastid.genomics.map_factories.ThreePrimeMapFactory
map reads to 3’ ends, with or without applying an offset
plastid.genomics.map_factories.CenterMapFactory
map each read fractionally to every position in the read, optionally trimming positions from the ends first
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()
is set to True
- to_bedgraph(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]¶
Write the contents of the
BAMGenomeArray
to a bedGraph file under the mapping rule set byset_mapping()
.See the bedGraph spec for format details.
- Parameters
- fhfile-like
Filehandle to write to
- tracknamestr
Name of browser track
- strandstr
Strand of
BAMGenomeArray
to export. ‘+’, ‘-’, or ‘.’- window_sizeint
Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- to_genome_array(array_type=None)[source]¶
Converts
BAMGenomeArray
to aGenomeArray
orSparseGenomeArray
under the mapping rule set byset_mapping()
- Parameters
- array_typeclass
Type of GenomeArray to return.
GenomeArray
orSparseGenomeArray
. (Default:GenomeArray
)
- Returns
- to_variable_step(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]¶
Write the contents of the
BAMGenomeArray
to a variableStep Wiggle file under the mapping rule set byset_mapping()
.See the Wiggle spec for format details.
- Parameters
- fhfile-like
Filehandle to write to
- tracknamestr
Name of browser track
- strandstr
Strand of
BAMGenomeArray
to export. ‘+’, ‘-’, or ‘.’- window_sizeint
Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- class plastid.genomics.genome_array.BigWigGenomeArray(maxmem=0)[source]¶
Bases:
plastid.genomics.genome_array.AbstractGenomeArray
High-performance GenomeArray for BigWig files.
- Parameters
- maxmemfloat
Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. (Default: 0, No maximum)
Methods
add_from_bigwig
(filename, strand)Import additional data from a BigWig file
chroms
()Return a list of chromosomes in the GenomeArray
get
(roi[, roi_order])Retrieve array of counts from a region of interest.
lengths
()Return a dictionary mapping chromosome names to lengths.
Reset sum to total of data in the
BigWigGenomeArray
set_normalize
([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum
(val)Set sum used for normalization to an arbitrary value (e.g.
strands
()Return a tuple of strands in the GenomeArray
sum
()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
Converts
BigWigGenomeArray
to aGenomeArray
- add_from_bigwig(filename, strand)[source]¶
Import additional data from a BigWig file
- Parameters
- filenamestr
Path to BigWig file
- strandstr
Strand to which data should be added. ‘+’, ‘-’, or ‘.’
- chroms()[source]¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest.
- Parameters
- roi
GenomicSegment
orSegmentChain
Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.
- roi
- Returns
- numpy.ndarray
vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
See also
plastid.genomics.roitools.SegmentChain.get_counts
Fetch a spliced vector of data covering a
SegmentChain
- lengths()[source]¶
Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
Dictionary mapping chromosome names to chromosome lengths
- reset_sum()[source]¶
Reset sum to total of data in the
BigWigGenomeArray
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()[source]¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()
is set to True
- to_genome_array()[source]¶
Converts
BigWigGenomeArray
to aGenomeArray
- Returns
- class plastid.genomics.genome_array.GenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]¶
Bases:
plastid.genomics.genome_array.MutableAbstractGenomeArray
Array-like data structure that maps numerical values (e.g. read counts or conservation data, et c) to nucleotide positions in a genome.
Data in the array may be manipulated manually and/or imported from Wiggle or BedGraph files, as well as bowtie alignments.
GenomeArray
supports basic mathematical operations elementwise, where elements are nucleotide positions.- Parameters
- chr_lengthsdict or None, optional
Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed.
- min_chr_sizeint
If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If the genome has large chromosomes, it is much much better, speed-wise, to provide chr_lengths than to provide a guess here that is too small.
- strandssequence
Sequence of strand names for the
GenomeArray
. (Default: (‘+’,’-‘))
Methods
add_from_bowtie
(fh, mapfunc[, min_length, ...])Import alignment data in native bowtie format to the current GenomeArray
add_from_wiggle
(fh, strand)Import data from a Wiggle or bedGraph file to current GenomeArray
apply_operation
(other, func[, mode])Applies a binary operator to a copy of self and to other elementwise.
chroms
()Return a list of chromosomes in the GenomeArray
get
(roi[, roi_order])Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.
Return an iterator of each chromosome strand array
keys
()Return names of chromosomes in the GenomeArray
lengths
()Return a dictionary mapping chromosome names to lengths.
like
(other)Return a
GenomeArray
of same dimension as the input arraynonzero
()Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
plot
([chroms, strands])Create a plot of coverage along tracks or chromosomes
Reset the sum of the
GenomeArray
to the sum of all positions in the arrayset_normalize
([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum
(val)Set sum used for normalization to an arbitrary value (e.g.
strands
()Return a tuple of strands in the GenomeArray
sum
()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph
(fh, trackname, strand[, printer])Write the contents of the GenomeArray to a bedGraph file
to_variable_step
(fh, trackname, strand[, ...])Export the contents of the GenomeArray to a variable step Wiggle file.
- add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)[source]¶
Import alignment data in native bowtie format to the current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to bowtie file
- mapfuncfunc
a mapping function that transforms alignment coordinates into a sequence of (
GenomicSegment
,value) pairs- min_lengthint, optional
minimum length for read to be counted (Default: 25)
- max_lengthint or numpy.inf, optional
maximum length for read to be counted (Default: infinity)
- trans_argsdict, optional
Keyword arguments to pass to transformation function
See also
five_prime_map
map reads to 5’ ends, with or without applying an offset
variable_five_prime_map
map reads to 5’ ends, choosing an offset determined by read length
three_prime_map
map reads to 3’ ends, with or without applying an offset
center_map
map each read fractionally to every position in the read, optionally trimming positions from the ends first
- add_from_wiggle(fh, strand)[source]¶
Import data from a Wiggle or bedGraph file to current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to wiggle file
- strandstr
Strand to which data should be added. ‘+’, ‘-’, or ‘.’
- apply_operation(other, func, mode='same')[source]¶
Applies a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another
GenomeArray
. In both cases, a newGenomeArray
is returned, and self is left unmodified. Ifset_normalize()
is set to True, it is disabled during the operation.- Parameters
- otherfloat, int, or
MutableAbstractGenomeArray
Second argument to func
- funcfunc
Function to perform. This must take two arguments. a
numpy.ndarray
(chromosome-strand) from theGenomeArray
will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.
- modestr, choice of ‘same’, ‘all’, or ‘truncate’
Only relevant if other is a
MutableAbstractGenomeArray
If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent
GenomeArray
s. In addition, bothGenomeArray
s must have the same chromosomes, and the same strands.If ‘all’, corresponding chromosomes in the parent
GenomeArray
s will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.If ‘truncate’,corresponding chromosomes in the parent
GenomeArray
s will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.
- otherfloat, int, or
- Returns
GenomeArray
new
GenomeArray
after the operation is applied
- chroms()¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).
- Parameters
- roi
GenomicSegment
orSegmentChain
Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.
- roi
- Returns
numpy.ndarray
vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
See also
plastid.genomics.roitools.SegmentChain.get_counts
Fetch a spliced vector of data covering a
SegmentChain
- iterchroms()[source]¶
Return an iterator of each chromosome strand array
- Yields
- numpy.ndarray
Positionwise counts on a chromosome strand
- lengths()¶
Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
Dictionary mapping chromosome names to chromosome lengths
- static like(other)[source]¶
Return a
GenomeArray
of same dimension as the input array- Parameters
- other
GenomeArray
- other
- Returns
- GenomeArray
empty
GenomeArray
of same size as other
- nonzero()[source]¶
Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.
- Returns
- dict
dict[chrom][strand] = numpy.ndarray of indices
- plot(chroms=None, strands=None, **plot_kw)[source]¶
Create a plot of coverage along tracks or chromosomes
- Parameters
- chromslist
Chromosomes to plot (default: all)
- strandslist
Strands to plot (default: all)
- plot_kw
A dictionary of keywords to pass to matplotlib
- Returns
- reset_sum()[source]¶
Reset the sum of the
GenomeArray
to the sum of all positions in the array
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()
is set to True
- to_bedgraph(fh, trackname, strand, printer=None, **kwargs)[source]¶
Write the contents of the GenomeArray to a bedGraph file
See the bedGraph spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- to_variable_step(fh, trackname, strand, printer=None, **kwargs)[source]¶
Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.
See the Wiggle spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- class plastid.genomics.genome_array.MutableAbstractGenomeArray(chr_lengths=None, strands=None)[source]¶
Bases:
plastid.genomics.genome_array.AbstractGenomeArray
Abstract base class for
GenomeArray
-like objects whose values can be changedMethods
chroms
()Return a list of chromosomes in the GenomeArray
lengths
()Return a dictionary mapping chromosome names to lengths.
Reset sum to total mapped reads in the GenomeArray
set_normalize
([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum
(val)Set sum used for normalization to an arbitrary value (e.g.
strands
()Return a tuple of strands in the GenomeArray
sum
()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- chroms()¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- lengths()¶
Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
Dictionary mapping chromosome names to chromosome lengths
- abstract reset_sum()¶
Reset sum to total mapped reads in the GenomeArray
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()
is set to True
- class plastid.genomics.genome_array.SparseGenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]¶
Bases:
plastid.genomics.genome_array.GenomeArray
A memory-efficient sublcass of
GenomeArray
using sparse internal representation. Note, savings in memory may come at a cost in performance when repeatedly getting/setting values, compared to aGenomeArray
.- Parameters
- chr_lengthsdict, optional
Dictionary mapping chromosome names to lengths. Suppyling this parameter yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed (Default: {} )
- min_chr_sizeint,optional
If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If your genome has large chromosomes, it is much much better, speed-wise, to provide a chr_lengths dict than to provide a guess here that is too small. (Default: %s)
- strandssequence
Sequence of strand names for the
GenomeArray
. (Default: (‘+’,’-‘))
Methods
add_from_bowtie
(fh, mapfunc[, min_length, ...])Import alignment data in native bowtie format to the current GenomeArray
add_from_wiggle
(fh, strand)Import data from a Wiggle or bedGraph file to current GenomeArray
apply_operation
(other, func[, mode])Apply a binary operator to a copy of self and to other elementwise.
chroms
()Return a list of chromosomes in the GenomeArray
get
(roi[, roi_order])Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.
Return an iterator of each chromosome strand array
keys
()Return names of chromosomes in the GenomeArray
lengths
()Return a dictionary mapping chromosome names to lengths.
like
(other)Return a
SparseGenomeArray
of same dimension as the input arraynonzero
()Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
plot
([chroms, strands])Create a plot of coverage along tracks or chromosomes
Reset the sum of the
GenomeArray
to the sum of all positions in the arrayset_normalize
([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum
(val)Set sum used for normalization to an arbitrary value (e.g.
strands
()Return a tuple of strands in the GenomeArray
sum
()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph
(fh, trackname, strand[, printer])Write the contents of the GenomeArray to a bedGraph file
to_variable_step
(fh, trackname, strand[, ...])Export the contents of the GenomeArray to a variable step Wiggle file.
- add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)¶
Import alignment data in native bowtie format to the current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to bowtie file
- mapfuncfunc
a mapping function that transforms alignment coordinates into a sequence of (
GenomicSegment
,value) pairs- min_lengthint, optional
minimum length for read to be counted (Default: 25)
- max_lengthint or numpy.inf, optional
maximum length for read to be counted (Default: infinity)
- trans_argsdict, optional
Keyword arguments to pass to transformation function
See also
five_prime_map
map reads to 5’ ends, with or without applying an offset
variable_five_prime_map
map reads to 5’ ends, choosing an offset determined by read length
three_prime_map
map reads to 3’ ends, with or without applying an offset
center_map
map each read fractionally to every position in the read, optionally trimming positions from the ends first
- add_from_wiggle(fh, strand)¶
Import data from a Wiggle or bedGraph file to current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to wiggle file
- strandstr
Strand to which data should be added. ‘+’, ‘-’, or ‘.’
- apply_operation(other, func, mode=None)[source]¶
Apply a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another
GenomeArray
. In both cases, a newSparseGenomeArray
is returned, and self is left unmodified. Ifset_normalize()
is set to True, it is disabled during the operation.- Parameters
- otherfloat, int, or
MutableAbstractGenomeArray
Second argument to func
- funcfunc
Function to perform. This must take two arguments. a
numpy.ndarray
(chromosome-strand) from theSparseGenomeArray
will be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.
- modestr, choice of ‘same’, ‘all’, or ‘truncate’
Only relevant if other is a
MutableAbstractGenomeArray
If ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent
SparseGenomeArrays
. In addition, bothSparseGenomeArrays
must have the same chromosomes, and the same strands.If ‘all’, corresponding chromosomes in the parent
SparseGenomeArrays
will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.If ‘truncate’,corresponding chromosomes in the parent
SparseGenomeArrays
will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.
- otherfloat, int, or
- Returns
SparseGenomeArray
new
SparseGenomeArray
after the operation is applied
- chroms()¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).
- Parameters
- roi
GenomicSegment
orSegmentChain
Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to roi rather than genome.
- roi
- Returns
numpy.ndarray
vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
See also
SegmentChain.get_counts
Fetch a spliced vector of data covering a
SegmentChain
- iterchroms()¶
Return an iterator of each chromosome strand array
- Yields
- numpy.ndarray
Positionwise counts on a chromosome strand
- keys()¶
Return names of chromosomes in the GenomeArray
- lengths()[source]¶
Return a dictionary mapping chromosome names to lengths. In the case where two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
mapping chromosome names to lengths
- static like(other)[source]¶
Return a
SparseGenomeArray
of same dimension as the input array- Parameters
- other
GenomeArray
orSparseGenomeArray
- other
- Returns
SparseGenomeArray
of same size as other
- nonzero()[source]¶
Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.
- Returns
- dict
dict[chrom][strand] = numpy.ndarray of indices
- plot(chroms=None, strands=None, **plot_kw)¶
Create a plot of coverage along tracks or chromosomes
- Parameters
- chromslist
Chromosomes to plot (default: all)
- strandslist
Strands to plot (default: all)
- plot_kw
A dictionary of keywords to pass to matplotlib
- Returns
- reset_sum()¶
Reset the sum of the
GenomeArray
to the sum of all positions in the array
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()
is set to True
- to_bedgraph(fh, trackname, strand, printer=None, **kwargs)¶
Write the contents of the GenomeArray to a bedGraph file
See the bedGraph spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- to_variable_step(fh, trackname, strand, printer=None, **kwargs)¶
Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.
See the Wiggle spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- plastid.genomics.genome_array.center_map(feature, **kwargs)[source]¶
Center-mapping function used as an argument to
GenomeArray.add_from_bowtie()
. A user-specified number of bases is optionally removed from each side of each read alignment, and the N remaining bases are each apportioned 1/N of the read count, so that the entire read is counted once.- Parameters
- feature
SegmentChain
Ungapped genomic alignment
- kwargs[‘value’]float, optional
Total value to divide over aligning positions (Default: 1.0)
- kwargs[‘nibble’]int, optional
Positions to remove from each end before mapping (Default: 0)
- kwargs[‘offset’]int, optional
Mapping offset, if any, from 5’ end of read (Default: 0)
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)
- plastid.genomics.genome_array.five_prime_map(feature, **kwargs)[source]¶
Fiveprime mapping function used as an argument to
GenomeArray.add_from_bowtie()
. Reads are mapped at a user-specified offset from the fiveprime end of the alignment.- Parameters
- feature
SegmentChain
Ungapped genomic alignment
- kwargs[‘value’]float or int, optional
Value to apportion (Default: 1)
- kwargs[‘offset’]int, optional
Mapping offset, if any, from 5’ toward 3’ end of read
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)
- plastid.genomics.genome_array.three_prime_map(feature, **kwargs)[source]¶
Threeprime mapping function used as an argument to
GenomeArray.add_from_bowtie()
. Reads are mapped at a user-specified offset from the threeprime end of the alignment.- Parameters
- feature
SegmentChain
Ungapped genomic alignment
- kwargs[‘value’]float or int, optional
Value to apportion (Default: 1)
- kwargs[‘offset’]int, optional
Mapping offset, if any, from 3’ toward 5’ end of read.
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)
- plastid.genomics.genome_array.variable_five_prime_map(feature, **kwargs)[source]¶
Fiveprime variable mapping function used as an argument to
GenomeArray.add_from_bowtie()
. Reads are mapped at a user-specified offset from the fiveprime end of the alignment. The offset for a read of a given length is supplied in kwargs[offset][readlen]- Parameters
- feature
SegmentChain
Ungapped genomic alignment
- kwargs[‘offset’]dict, required
Dictionary mapping read lengths to offsets
- kwargs[‘value’]float or int, optional
Value to apportion (Default: 1)
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)