plastid.genomics.genome_array module¶
GenomeArrays map quantitative data or read alignments to genomic positions.
Summary¶
GenomeArrays are typically accessed using SegmentChains.
The get_counts() method of
SegmentChain and Transcript fetches arrays of data from GenomeArrays. The
result is a numpy array in which each position
corresponds to a position in the SegmentChain, moving from 5’ to 3’, in the
spliced SegmentChain or Transcript. For a Transcript called
my_transcript:
>>> # open a BAM file, map reads 15 nucleotides from 3' end
>>> genome_array = BAMGenomeArray("some_file.bam")
>>> genome_array.set_mapping(ThreePrimeMapFactory(offset=15))
>>> my_data = my_transcript.get_counts(genome_array)
>>> my_data
array([ 95, 133, 87, 491, 100, 73, 308, 217, 447, 204, 245, 264, 395,
76, 387, 319, 43, 171, 101, 152, 58, 167, 53, 397, 228, 44,
402, 285, 480, 55, 9, 51, 358, 307, 144, 182, 293, 455, 403,
375, 443, 442, 196, 133, 238, 268, 70, 91, 172, 490, 346, 294,
117, 424, 180, 14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
423, 60, 173, 362, 277, 393, 297, 377, 357, 68, 143, 467, 7,
86, 450, 230, 62, 278, 65, 346, 103, 185, 7, 350, 162, 461,
274, 205, 185, 209, 275, 190, 434, 10, 241])
Alternatively, SegmentChains and GenomicSegments can be used directly as
dictionary keys:
>>> my_data = genome_array[my_transcript]
>>> my_data
array([ 95, 133, 87, 491, 100, 73, 308, 217, 447, 204, 245, 264, 395,
76, 387, 319, 43, 171, 101, 152, 58, 167, 53, 397, 228, 44,
402, 285, 480, 55, 9, 51, 358, 307, 144, 182, 293, 455, 403,
375, 443, 442, 196, 133, 238, 268, 70, 91, 172, 490, 346, 294,
117, 424, 180, 14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
423, 60, 173, 362, 277, 393, 297, 377, 357, 68, 143, 467, 7,
86, 450, 230, 62, 278, 65, 346, 103, 185, 7, 350, 162, 461,
274, 205, 185, 209, 275, 190, 434, 10, 241])
Module contents¶
Several implementations are provided. These are tailored to data stored in specific formats, or for specific purposes. For example, mutable GenomeArrays allow values to be changed in-place, e.g. via mathematical operations or manual setting.
Implementation |
Format of data |
Mutable |
Use of mapping functions |
BAM file(s) |
No |
Inline; may be changed without reloading data |
|
BigWig file(s) |
No |
n/a |
|
Yes |
On import from bowtie files only. Data must be reimported to change mapping |
||
Yes |
On import from bowtie files only. Data must be reimported to change mapping |
Extended summary & examples¶
Mapping functions¶
GenomeArrays use Mapping functions to extract the biology of interest from read alignments, and to vectorize these data over genomic positions.
Depending on the data type and the purpose of the experiment, different functions may be appropriate. The following mapping rules are provided, although users are encouraged to provide their own if needed. These include:
Mapping strategy |
Description |
Sample use |
Fiveprime |
Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end. |
RNA-seq, ribosome profiling, DMS-seq |
Variable fiveprime |
Each read alignment is mapped at a fixed distance from its 5’ end, where the distance is determined by the length of the alignment |
Ribosome profiling |
Stratified variable fiveprime |
Like variable fiveprime mapping, except reads are stratified by length into separate rows of an array |
Ribosome profiling |
Threeprime |
Each read alignment is mapped to its 5’ end, or at a fixed distance from its 5’ end. |
RNA-seq, ribosome profiling |
Center/entire |
Zero or more positions are trimmed from each end of the read alignment, and the remaining N positions are incremented by 1/N |
Ribosome profiling with micrococcal nuclease, RNA-seq |
Mapping functions are rarely called directly. Instead, they are invoked via
BAMGenomeArray.set_mapping():
# open BAM file >>> ga = BAMGenomeArray("some_file.bam") # set mapping 15 bases from 3' end of read >>> ga.set_mapping(ThreePrimeMapFactory(offset=15)) # set mapping to 5' end of read >>> ga.set_mapping(FivePrimeMapFactory()) # cut 3 nt off each end of the read, and map fractionally over the rest >>> ga.set_mapping(CenterMapFactory(nibble=12)) # access reads in some region >>> ga[GenomicSegment("chrI",10000,100000,"+")] # output is numpy array
For further details see map_factories and
Read mapping functions
Export to browser tracks¶
GenomeArrays can export data to wiggle or bedGraph formats. To export forward-strand data to some_file.wig:
>>> genome_array.to_bedgraph(open("some_file.wig","w"),"my_track","+")
Normalization¶
GenomeArrays also offer utilities for summing and normalization:
>>> genome_array.sum() # get sum
71292045
# set sum to an arbitrary value- can be useful for normalizing between
# multiple alignment files
>>> genome_array.set_sum(61241234324)
>>> genome_array.sum()
61241234324
# reset sum to data in file
>>> genome_array.reset_sum()
>>> genome_array.sum()
71292045
# with set_normalize(), all arrays will be normalized to counts per million
# relative to the current sum
>>> genome_array.set_normalize(True)
>>> my_transcript.get_counts(genome_array)
array([ 1.33254699, 1.86556579, 1.22033251, 6.88716392, 1.40268104,
1.02395716, 4.32025761, 3.04381786, 6.26998426, 2.86146933,
3.43656855, 3.70307795, 5.54059012, 1.06603759, 5.42837563,
4.47455253, 0.60315285, 2.39858458, 1.41670785, 2.13207518,
0.813555 , 2.34247734, 0.74342095, 5.56864374, 3.19811278,
0.61717966, 5.63877779, 3.99764097, 6.732869 , 0.77147457,
0.12624129, 0.71536733, 5.02159813, 4.3062308 , 2.0198607 ,
2.5528795 , 4.10985545, 6.38219874, 5.6528046 , 5.26005391,
6.21387702, 6.19985021, 2.74925484, 1.86556579, 3.33838088,
3.75918519, 0.98187673, 1.27643975, 2.41261139, 6.87313711,
4.85327641, 4.12388226, 1.64113682, 5.94736762, 2.52482588,
0.19637535, 5.61072417, 5.80709952, 3.60489028, 1.78140492,
4.82522279, 4.3062308 , 6.10166253, 2.13207518, 6.42427917,
5.93334081, 0.84160863, 2.4266382 , 5.07770537, 3.88542649,
5.5125365 , 4.1659627 , 5.28810753, 5.00757132, 0.95382311,
2.00583389, 6.55052047, 0.09818767, 1.2063057 , 6.31206469,
3.2261664 , 0.86966225, 3.8994533 , 0.91174268, 4.85327641,
1.44476147, 2.59495993, 0.09818767, 4.90938365, 2.27234329,
6.46635961, 3.84334606, 2.87549614, 2.59495993, 2.93160338,
3.85737287, 2.66509398, 6.08763572, 0.1402681 , 3.38046131])
# similarly, with set_normalize() on, browser tracks we'll be normalized
>>> genome_array.to_bedgraph(open("some_file_normalized.wig","w"),"my_normalized_track","+")
- class plastid.genomics.genome_array.BAMGenomeArray(*bamfiles, mapping=CenterMapFactory())[source]¶
Bases:
plastid.genomics.genome_array.AbstractGenomeArrayA GenomeArray for read alignments in BAM files.
When a user requests read alignments or counts at a set of genomic positions,
BAMGenomeArrayuses a mapping rule to determine which reads from the BAM file should be returned and how these reads should be mapped to positions in a vector. Reads are not mapped until requested, and mapping rules are changeable at runtime viaBAMGenomeArray.set_mapping()- Parameters
- bamfileOne or more filenames or open
pysam.AlignmentFile Filename(s) of BAM files, or open
pysam.AlignmentFileto include in array. Note: all BAM files must be sorted and indexed by samtools.- mappingfunc
mapping function that determines how each read alignment is mapped to a count at a genomic position. Examples include mapping reads to their fiveprime ends, mapping reads to their threeprime ends, or mapping somewhere in the middle. Factories to produce such functions are provided. See references below. (Default:
CenterMapFactory())
- bamfileOne or more filenames or open
Notes
The alignment data in
BAMGenomeArrayobjects is immutable. If you need to change the data in-place, theBAMGenomeArraycan be converted to aGenomeArrayorSparseGenomeArrayviato_genome_array()- Attributes
- map_fnfunction
mapping function used to convert read alignments to counts
- bamfileslist
BAM files or read alignments, as a list of
pysam.AlignmentFile
Methods
add_filter(name, func)Apply a function to filter reads retrieved from regions before mapping and counting
chroms()Returns a list of chromosomes
get(roi[, roi_order])Retrieve array of counts from a region of interest, following the mapping rule set by
set_mapping().Return the docstring of the current mapping function
get_reads(roi)Returns reads covering a
GenomicSegment.get_reads_and_counts(roi[, roi_order])Return read alignments covering a
GenomicSegment, and a count vector mapping reads to each positions in theGenomicSegment, following the rule specified byset_mapping().lengths()Returns a dictionary mapping chromosome names to lengths.
remove_filter(name)Remove a generic filter
Reset the sum to the total number of mapped reads in the
BAMGenomeArrayset_mapping(mapping_function)Change the mapping rule
set_normalize([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val)Set sum used for normalization to an arbitrary value (e.g.
strands()Return a tuple of strands in the GenomeArray
sum()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph(fh, trackname, strand[, ...])Write the contents of the
BAMGenomeArrayto a bedGraph file under the mapping rule set byset_mapping().to_genome_array([array_type])Converts
BAMGenomeArrayto aGenomeArrayorSparseGenomeArrayunder the mapping rule set byset_mapping()to_variable_step(fh, trackname, strand[, ...])Write the contents of the
BAMGenomeArrayto a variableStep Wiggle file under the mapping rule set byset_mapping().- add_filter(name, func)[source]¶
Apply a function to filter reads retrieved from regions before mapping and counting
- Parameters
- namestr
A name for the filter. If not unique, will overwrite previous filter
- funcfunc
Filter function. Function must take a
pysam.AlignedSegmentas a single parameter, and return True if that read should be included in output, or False otherwise
See also
plastid.genomics.map_factories.SizeFilterFactorygenerate filter functions that gate read alignments on size
Notes
In Python, lambda functions do NOT have their own scope! We strongly recomend defining filter functions using the
defsyntax to avoid namespace collisions.
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest, following the mapping rule set by
set_mapping(). Values in the vector are ordered 5’ to 3’ relative to roi rather than the genome (i.e. are reversed for reverse-strand features).- Parameters
- roi
GenomicSegmentorSegmentChain Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.
- roi
- Returns
- numpy.ndarray
vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
- Raises
- ValueError
if bamfiles not sorted or not indexed
See also
plastid.genomics.roitools.SegmentChain.get_countsFetch a spliced vector of data covering a
SegmentChain
- get_reads(roi)[source]¶
Returns reads covering a
GenomicSegment. Reads are strand-matched to roi by default, and are included or excluded depending upon the rule specified byset_mapping(). To obtain unstranded reads, set the value of roi.strand to ‘.’- Parameters
- roi
GenomicSegment Region of interest
- roi
- Returns
- list
List of reads (as
pysam.AlignedSegment) covering region of interest, according to the rule set inset_mapping()
- Raises
- ValueError
if bamfiles are not sorted or not indexed
- get_reads_and_counts(roi, roi_order=True)[source]¶
Return read alignments covering a
GenomicSegment, and a count vector mapping reads to each positions in theGenomicSegment, following the rule specified byset_mapping(). Reads are strand-matched to roi by default. To obtain unstranded reads, set the value of roi.strand to ‘.’- Parameters
- roi
GenomicSegment Region of interest
- roi
- Returns
- list
List of reads (as
pysam.AlignedSegment) covering region of interest- numpy.ndarray
Counts at each position of roi, under whatever mapping rule was set by
set_mapping()
- Raises
- ValueError
if bamfiles not sorted or not indexed
- lengths()[source]¶
Returns a dictionary mapping chromosome names to lengths.
- Returns
- dict
mapping chromosome names to lengths
- remove_filter(name)[source]¶
Remove a generic filter
- Parameters
- namestr
A name for the filter
- Returns
- func
the removed filter function
- reset_sum()[source]¶
Reset the sum to the total number of mapped reads in the
BAMGenomeArrayNotes
Filters are not applied in this summation. See
BAMGenomeArray.set_sum()to manually set a sum, which e.g. could correspond to a total number of filtered reads.
- set_mapping(mapping_function)[source]¶
Change the mapping rule
- Parameters
- mappingfunc
Function that determines how each read alignment is mapped to a count at a position in the
BAMGenomeArray. Examples include mapping reads to their fiveprime or threeprime ends, with or without offsets. Factories to produce such functions are provided. See references below.
See also
plastid.genomics.map_factories.FivePrimeMapFactorymap reads to 5’ ends, with or without applying an offset
plastid.genomics.map_factories.VariableFivePrimeMapFactorymap reads to 5’ ends, choosing an offset determined by read length
plastid.genomics.map_factories.ThreePrimeMapFactorymap reads to 3’ ends, with or without applying an offset
plastid.genomics.map_factories.CenterMapFactorymap each read fractionally to every position in the read, optionally trimming positions from the ends first
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()is set to True
- to_bedgraph(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]¶
Write the contents of the
BAMGenomeArrayto a bedGraph file under the mapping rule set byset_mapping().See the bedGraph spec for format details.
- Parameters
- fhfile-like
Filehandle to write to
- tracknamestr
Name of browser track
- strandstr
Strand of
BAMGenomeArrayto export. ‘+’, ‘-’, or ‘.’- window_sizeint
Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- to_genome_array(array_type=None)[source]¶
Converts
BAMGenomeArrayto aGenomeArrayorSparseGenomeArrayunder the mapping rule set byset_mapping()- Parameters
- array_typeclass
Type of GenomeArray to return.
GenomeArrayorSparseGenomeArray. (Default:GenomeArray)
- Returns
- to_variable_step(fh, trackname, strand, window_size=100000, printer=None, **kwargs)[source]¶
Write the contents of the
BAMGenomeArrayto a variableStep Wiggle file under the mapping rule set byset_mapping().See the Wiggle spec for format details.
- Parameters
- fhfile-like
Filehandle to write to
- tracknamestr
Name of browser track
- strandstr
Strand of
BAMGenomeArrayto export. ‘+’, ‘-’, or ‘.’- window_sizeint
Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- class plastid.genomics.genome_array.BigWigGenomeArray(maxmem=0)[source]¶
Bases:
plastid.genomics.genome_array.AbstractGenomeArrayHigh-performance GenomeArray for BigWig files.
- Parameters
- maxmemfloat
Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. (Default: 0, No maximum)
Methods
add_from_bigwig(filename, strand)Import additional data from a BigWig file
chroms()Return a list of chromosomes in the GenomeArray
get(roi[, roi_order])Retrieve array of counts from a region of interest.
lengths()Return a dictionary mapping chromosome names to lengths.
Reset sum to total of data in the
BigWigGenomeArrayset_normalize([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val)Set sum used for normalization to an arbitrary value (e.g.
strands()Return a tuple of strands in the GenomeArray
sum()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
Converts
BigWigGenomeArrayto aGenomeArray- add_from_bigwig(filename, strand)[source]¶
Import additional data from a BigWig file
- Parameters
- filenamestr
Path to BigWig file
- strandstr
Strand to which data should be added. ‘+’, ‘-’, or ‘.’
- chroms()[source]¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest.
- Parameters
- roi
GenomicSegmentorSegmentChain Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.
- roi
- Returns
- numpy.ndarray
vector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
See also
plastid.genomics.roitools.SegmentChain.get_countsFetch a spliced vector of data covering a
SegmentChain
- lengths()[source]¶
Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
Dictionary mapping chromosome names to chromosome lengths
- reset_sum()[source]¶
Reset sum to total of data in the
BigWigGenomeArray
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()[source]¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()is set to True
- to_genome_array()[source]¶
Converts
BigWigGenomeArrayto aGenomeArray- Returns
- class plastid.genomics.genome_array.GenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]¶
Bases:
plastid.genomics.genome_array.MutableAbstractGenomeArrayArray-like data structure that maps numerical values (e.g. read counts or conservation data, et c) to nucleotide positions in a genome.
Data in the array may be manipulated manually and/or imported from Wiggle or BedGraph files, as well as bowtie alignments.
GenomeArraysupports basic mathematical operations elementwise, where elements are nucleotide positions.- Parameters
- chr_lengthsdict or None, optional
Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed.
- min_chr_sizeint
If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If the genome has large chromosomes, it is much much better, speed-wise, to provide chr_lengths than to provide a guess here that is too small.
- strandssequence
Sequence of strand names for the
GenomeArray. (Default: (‘+’,’-‘))
Methods
add_from_bowtie(fh, mapfunc[, min_length, ...])Import alignment data in native bowtie format to the current GenomeArray
add_from_wiggle(fh, strand)Import data from a Wiggle or bedGraph file to current GenomeArray
apply_operation(other, func[, mode])Applies a binary operator to a copy of self and to other elementwise.
chroms()Return a list of chromosomes in the GenomeArray
get(roi[, roi_order])Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.
Return an iterator of each chromosome strand array
keys()Return names of chromosomes in the GenomeArray
lengths()Return a dictionary mapping chromosome names to lengths.
like(other)Return a
GenomeArrayof same dimension as the input arraynonzero()Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
plot([chroms, strands])Create a plot of coverage along tracks or chromosomes
Reset the sum of the
GenomeArrayto the sum of all positions in the arrayset_normalize([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val)Set sum used for normalization to an arbitrary value (e.g.
strands()Return a tuple of strands in the GenomeArray
sum()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph(fh, trackname, strand[, printer])Write the contents of the GenomeArray to a bedGraph file
to_variable_step(fh, trackname, strand[, ...])Export the contents of the GenomeArray to a variable step Wiggle file.
- add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)[source]¶
Import alignment data in native bowtie format to the current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to bowtie file
- mapfuncfunc
a mapping function that transforms alignment coordinates into a sequence of (
GenomicSegment,value) pairs- min_lengthint, optional
minimum length for read to be counted (Default: 25)
- max_lengthint or numpy.inf, optional
maximum length for read to be counted (Default: infinity)
- trans_argsdict, optional
Keyword arguments to pass to transformation function
See also
five_prime_mapmap reads to 5’ ends, with or without applying an offset
variable_five_prime_mapmap reads to 5’ ends, choosing an offset determined by read length
three_prime_mapmap reads to 3’ ends, with or without applying an offset
center_mapmap each read fractionally to every position in the read, optionally trimming positions from the ends first
- add_from_wiggle(fh, strand)[source]¶
Import data from a Wiggle or bedGraph file to current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to wiggle file
- strandstr
Strand to which data should be added. ‘+’, ‘-’, or ‘.’
- apply_operation(other, func, mode='same')[source]¶
Applies a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another
GenomeArray. In both cases, a newGenomeArrayis returned, and self is left unmodified. Ifset_normalize()is set to True, it is disabled during the operation.- Parameters
- otherfloat, int, or
MutableAbstractGenomeArray Second argument to func
- funcfunc
Function to perform. This must take two arguments. a
numpy.ndarray(chromosome-strand) from theGenomeArraywill be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.
- modestr, choice of ‘same’, ‘all’, or ‘truncate’
Only relevant if other is a
MutableAbstractGenomeArrayIf ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent
GenomeArrays. In addition, bothGenomeArrays must have the same chromosomes, and the same strands.If ‘all’, corresponding chromosomes in the parent
GenomeArrays will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.If ‘truncate’,corresponding chromosomes in the parent
GenomeArrays will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.
- otherfloat, int, or
- Returns
GenomeArraynew
GenomeArrayafter the operation is applied
- chroms()¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).
- Parameters
- roi
GenomicSegmentorSegmentChain Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to vector rather than genome.
- roi
- Returns
numpy.ndarrayvector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
See also
plastid.genomics.roitools.SegmentChain.get_countsFetch a spliced vector of data covering a
SegmentChain
- iterchroms()[source]¶
Return an iterator of each chromosome strand array
- Yields
- numpy.ndarray
Positionwise counts on a chromosome strand
- lengths()¶
Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
Dictionary mapping chromosome names to chromosome lengths
- static like(other)[source]¶
Return a
GenomeArrayof same dimension as the input array- Parameters
- other
GenomeArray
- other
- Returns
- GenomeArray
empty
GenomeArrayof same size as other
- nonzero()[source]¶
Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.
- Returns
- dict
dict[chrom][strand] = numpy.ndarray of indices
- plot(chroms=None, strands=None, **plot_kw)[source]¶
Create a plot of coverage along tracks or chromosomes
- Parameters
- chromslist
Chromosomes to plot (default: all)
- strandslist
Strands to plot (default: all)
- plot_kw
A dictionary of keywords to pass to matplotlib
- Returns
- reset_sum()[source]¶
Reset the sum of the
GenomeArrayto the sum of all positions in the array
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()is set to True
- to_bedgraph(fh, trackname, strand, printer=None, **kwargs)[source]¶
Write the contents of the GenomeArray to a bedGraph file
See the bedGraph spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- to_variable_step(fh, trackname, strand, printer=None, **kwargs)[source]¶
Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.
See the Wiggle spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- class plastid.genomics.genome_array.MutableAbstractGenomeArray(chr_lengths=None, strands=None)[source]¶
Bases:
plastid.genomics.genome_array.AbstractGenomeArrayAbstract base class for
GenomeArray-like objects whose values can be changedMethods
chroms()Return a list of chromosomes in the GenomeArray
lengths()Return a dictionary mapping chromosome names to lengths.
Reset sum to total mapped reads in the GenomeArray
set_normalize([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val)Set sum used for normalization to an arbitrary value (e.g.
strands()Return a tuple of strands in the GenomeArray
sum()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- chroms()¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- lengths()¶
Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
Dictionary mapping chromosome names to chromosome lengths
- abstract reset_sum()¶
Reset sum to total mapped reads in the GenomeArray
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()is set to True
- class plastid.genomics.genome_array.SparseGenomeArray(chr_lengths=None, strands=None, min_chr_size=10000000)[source]¶
Bases:
plastid.genomics.genome_array.GenomeArrayA memory-efficient sublcass of
GenomeArrayusing sparse internal representation. Note, savings in memory may come at a cost in performance when repeatedly getting/setting values, compared to aGenomeArray.- Parameters
- chr_lengthsdict, optional
Dictionary mapping chromosome names to lengths. Suppyling this parameter yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed (Default: {} )
- min_chr_sizeint,optional
If chr_lengths is not supplied, min_chr_size is the default first guess of a chromosome size. If your genome has large chromosomes, it is much much better, speed-wise, to provide a chr_lengths dict than to provide a guess here that is too small. (Default: %s)
- strandssequence
Sequence of strand names for the
GenomeArray. (Default: (‘+’,’-‘))
Methods
add_from_bowtie(fh, mapfunc[, min_length, ...])Import alignment data in native bowtie format to the current GenomeArray
add_from_wiggle(fh, strand)Import data from a Wiggle or bedGraph file to current GenomeArray
apply_operation(other, func[, mode])Apply a binary operator to a copy of self and to other elementwise.
chroms()Return a list of chromosomes in the GenomeArray
get(roi[, roi_order])Retrieve array of counts from a region of interest (roi) with values in vector ordered 5' to 3' relative to roi rather than genome (i.e.
Return an iterator of each chromosome strand array
keys()Return names of chromosomes in the GenomeArray
lengths()Return a dictionary mapping chromosome names to lengths.
like(other)Return a
SparseGenomeArrayof same dimension as the input arraynonzero()Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair.
plot([chroms, strands])Create a plot of coverage along tracks or chromosomes
Reset the sum of the
GenomeArrayto the sum of all positions in the arrayset_normalize([value])Toggle normalization of reported values to reads per million mapped in the dataset
set_sum(val)Set sum used for normalization to an arbitrary value (e.g.
strands()Return a tuple of strands in the GenomeArray
sum()Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
to_bedgraph(fh, trackname, strand[, printer])Write the contents of the GenomeArray to a bedGraph file
to_variable_step(fh, trackname, strand[, ...])Export the contents of the GenomeArray to a variable step Wiggle file.
- add_from_bowtie(fh, mapfunc, min_length=25, max_length=inf, **trans_args)¶
Import alignment data in native bowtie format to the current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to bowtie file
- mapfuncfunc
a mapping function that transforms alignment coordinates into a sequence of (
GenomicSegment,value) pairs- min_lengthint, optional
minimum length for read to be counted (Default: 25)
- max_lengthint or numpy.inf, optional
maximum length for read to be counted (Default: infinity)
- trans_argsdict, optional
Keyword arguments to pass to transformation function
See also
five_prime_mapmap reads to 5’ ends, with or without applying an offset
variable_five_prime_mapmap reads to 5’ ends, choosing an offset determined by read length
three_prime_mapmap reads to 3’ ends, with or without applying an offset
center_mapmap each read fractionally to every position in the read, optionally trimming positions from the ends first
- add_from_wiggle(fh, strand)¶
Import data from a Wiggle or bedGraph file to current GenomeArray
- Parameters
- fhfile-like
filehandle pointing to wiggle file
- strandstr
Strand to which data should be added. ‘+’, ‘-’, or ‘.’
- apply_operation(other, func, mode=None)[source]¶
Apply a binary operator to a copy of self and to other elementwise. other may be a scalar quantity or another
GenomeArray. In both cases, a newSparseGenomeArrayis returned, and self is left unmodified. Ifset_normalize()is set to True, it is disabled during the operation.- Parameters
- otherfloat, int, or
MutableAbstractGenomeArray Second argument to func
- funcfunc
Function to perform. This must take two arguments. a
numpy.ndarray(chromosome-strand) from theSparseGenomeArraywill be supplied as the first argument, and the corresponding chromosome-strand from other as the second. This operation will be applied over all chromosomes and strands.If mode is set to all, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully.
- modestr, choice of ‘same’, ‘all’, or ‘truncate’
Only relevant if other is a
MutableAbstractGenomeArrayIf ‘same’ each set of corresponding chromosomes must have the same dimensions in both parent
SparseGenomeArrays. In addition, bothSparseGenomeArraysmust have the same chromosomes, and the same strands.If ‘all’, corresponding chromosomes in the parent
SparseGenomeArrayswill be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output.If ‘truncate’,corresponding chromosomes in the parent
SparseGenomeArrayswill be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored.
- otherfloat, int, or
- Returns
SparseGenomeArraynew
SparseGenomeArrayafter the operation is applied
- chroms()¶
Return a list of chromosomes in the GenomeArray
- Returns
- list
Chromosome names as strings
- get(roi, roi_order=True)[source]¶
Retrieve array of counts from a region of interest (roi) with values in vector ordered 5’ to 3’ relative to roi rather than genome (i.e. are reversed for reverse-strand features).
- Parameters
- roi
GenomicSegmentorSegmentChain Region of interest in genome
- roi_orderbool, optional
If True (default) return vector of values 5’ to 3’ relative to roi rather than genome.
- roi
- Returns
numpy.ndarrayvector of numbers, each position corresponding to a position in roi, from 5’ to 3’ relative to roi
See also
SegmentChain.get_countsFetch a spliced vector of data covering a
SegmentChain
- iterchroms()¶
Return an iterator of each chromosome strand array
- Yields
- numpy.ndarray
Positionwise counts on a chromosome strand
- keys()¶
Return names of chromosomes in the GenomeArray
- lengths()[source]¶
Return a dictionary mapping chromosome names to lengths. In the case where two strands report different lengths for a chromosome, the max length is taken.
- Returns
- dict
mapping chromosome names to lengths
- static like(other)[source]¶
Return a
SparseGenomeArrayof same dimension as the input array- Parameters
- other
GenomeArrayorSparseGenomeArray
- other
- Returns
SparseGenomeArrayof same size as other
- nonzero()[source]¶
Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand.
- Returns
- dict
dict[chrom][strand] = numpy.ndarray of indices
- plot(chroms=None, strands=None, **plot_kw)¶
Create a plot of coverage along tracks or chromosomes
- Parameters
- chromslist
Chromosomes to plot (default: all)
- strandslist
Strands to plot (default: all)
- plot_kw
A dictionary of keywords to pass to matplotlib
- Returns
- reset_sum()¶
Reset the sum of the
GenomeArrayto the sum of all positions in the array
- set_normalize(value=True)¶
Toggle normalization of reported values to reads per million mapped in the dataset
- Parameters
- valuebool
If True, all values fetched will be normalized to reads per million. If False, all values will not be normalized.
- set_sum(val)¶
Set sum used for normalization to an arbitrary value (e.g. from another dataset)
- Parameters
- valint or float
a number
- strands()¶
Return a tuple of strands in the GenomeArray
- Returns
- tuple
Chromosome strands as strings
- sum()¶
Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray
- Returns
- int or float
Notes
The true (i.e. unnormalized) sum is always reported, even if
set_normalize()is set to True
- to_bedgraph(fh, trackname, strand, printer=None, **kwargs)¶
Write the contents of the GenomeArray to a bedGraph file
See the bedGraph spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- to_variable_step(fh, trackname, strand, printer=None, **kwargs)¶
Export the contents of the GenomeArray to a variable step Wiggle file. For sparse data, bedGraph can be more efficient format.
See the Wiggle spec for format details
- Parameters
- fhfile-like
Open filehandle to which data will be written
- tracknamestr
Name of browser track
- strandstr
Strand to export. ‘+’, ‘-’, or ‘.’
- printerfile-like, optional
Something implementing a write() method for output
- **kwargs
Any other key-value pairs to include in track definition line
- plastid.genomics.genome_array.center_map(feature, **kwargs)[source]¶
Center-mapping function used as an argument to
GenomeArray.add_from_bowtie(). A user-specified number of bases is optionally removed from each side of each read alignment, and the N remaining bases are each apportioned 1/N of the read count, so that the entire read is counted once.- Parameters
- feature
SegmentChain Ungapped genomic alignment
- kwargs[‘value’]float, optional
Total value to divide over aligning positions (Default: 1.0)
- kwargs[‘nibble’]int, optional
Positions to remove from each end before mapping (Default: 0)
- kwargs[‘offset’]int, optional
Mapping offset, if any, from 5’ end of read (Default: 0)
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)
- plastid.genomics.genome_array.five_prime_map(feature, **kwargs)[source]¶
Fiveprime mapping function used as an argument to
GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment.- Parameters
- feature
SegmentChain Ungapped genomic alignment
- kwargs[‘value’]float or int, optional
Value to apportion (Default: 1)
- kwargs[‘offset’]int, optional
Mapping offset, if any, from 5’ toward 3’ end of read
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)
- plastid.genomics.genome_array.three_prime_map(feature, **kwargs)[source]¶
Threeprime mapping function used as an argument to
GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the threeprime end of the alignment.- Parameters
- feature
SegmentChain Ungapped genomic alignment
- kwargs[‘value’]float or int, optional
Value to apportion (Default: 1)
- kwargs[‘offset’]int, optional
Mapping offset, if any, from 3’ toward 5’ end of read.
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)
- plastid.genomics.genome_array.variable_five_prime_map(feature, **kwargs)[source]¶
Fiveprime variable mapping function used as an argument to
GenomeArray.add_from_bowtie(). Reads are mapped at a user-specified offset from the fiveprime end of the alignment. The offset for a read of a given length is supplied in kwargs[offset][readlen]- Parameters
- feature
SegmentChain Ungapped genomic alignment
- kwargs[‘offset’]dict, required
Dictionary mapping read lengths to offsets
- kwargs[‘value’]float or int, optional
Value to apportion (Default: 1)
- feature
- Returns
- list
tuples of (GenomicSegment,float value over segment)