Source code for plastid.genomics.genome_array

#!/usr/bin/env python
"""GenomeArrays map quantitative data or :term:`read alignments` to genomic positions.

.. contents::
   :local:

Summary
-------

GenomeArrays are typically accessed using |SegmentChains|.

The :meth:`~plastid.genomics.roitools.SegmentChain.get_counts` method of
|SegmentChain| and |Transcript| fetches arrays of data from GenomeArrays. The
result is a :class:`numpy array <numpy.ndarray>` in which each position
corresponds to a position in the |SegmentChain|, moving from 5' to 3', in the
spliced |SegmentChain| or |Transcript|. For a |Transcript| called
`my_transcript`::

    >>> # open a BAM file, map reads 15 nucleotides from 3' end
    >>> genome_array = BAMGenomeArray("some_file.bam")
    >>> genome_array.set_mapping(ThreePrimeMapFactory(offset=15))

    >>> my_data = my_transcript.get_counts(genome_array)
    >>> my_data
        array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
                76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
               402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
               375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
               117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
               423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
                86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
               274, 205, 185, 209, 275, 190, 434,  10, 241])


Alternatively, |SegmentChains| and |GenomicSegments| can be used directly as
dictionary keys::

    >>> my_data = genome_array[my_transcript]
    >>> my_data
        array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
                76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
               402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
               375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
               117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
               423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
                86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
               274, 205, 185, 209, 275, 190, 434,  10, 241])


Module contents
---------------

Several implementations are provided. These are tailored to data stored in 
specific formats, or for specific purposes. For example, mutable GenomeArrays
allow values to be changed in-place, e.g. via mathematical operations or manual
setting.


====================   ==================================   =====================   ===================================================
**Implementation**     **Format of data**                   **Mutable**             **Use of mapping functions**
--------------------   ----------------------------------   ---------------------   ---------------------------------------------------
|BAMGenomeArray|       `BAM`_ file(s)                       No                      Inline; may be changed without reloading data

|BigWigGenomeArray|    `BigWig`_ file(s)                    No                      n/a

|GenomeArray|          `bowtie`_, `wiggle`_, `bedGraph`_,   Yes                     On import from `bowtie`_ files only. Data must be
                       or data in memory                                            reimported to change mapping

|SparseGenomeArray|    `bowtie`_, `wiggle`_, `bedGraph`_,   Yes                     On import from `bowtie`_ files only. Data must be
                       or data in memory                                            reimported to change mapping
====================   ==================================   =====================   ===================================================


Extended summary & examples
---------------------------


Mapping functions
.................

GenomeArrays use :term:`Mapping functions <mapping rule>` to extract the biology
of interest from read alignments, and to vectorize these data over genomic
positions.

Depending on the data type and the purpose of the experiment, different functions
may be appropriate. The following mapping rules are provided, although users are
encouraged to provide their own if needed. These include:


================================   ===================================================    =====================================
**Mapping strategy**               **Description**                                        **Sample use**
--------------------------------   ---------------------------------------------------    -------------------------------------

Fiveprime                          Each read alignment is mapped to its 5' end, or        RNA-seq, ribosome profiling, DMS-seq
                                   at a fixed distance from its 5' end. 

Variable fiveprime                 Each read alignment is mapped at a fixed distance      Ribosome profiling
                                   from its 5' end, where the distance is determined
                                   by the length of the alignment

Stratified variable fiveprime      Like variable fiveprime mapping, except reads are      Ribosome profiling
                                   stratified by length into separate rows of an array

Threeprime                         Each read alignment is mapped to its 5' end, or        RNA-seq, ribosome profiling
                                   at a fixed distance from its 5' end. 

Center/entire                      Zero or more positions are trimmed from each end       Ribosome profiling with micrococcal
                                   of the read alignment, and the remaining `N`           nuclease, RNA-seq
                                   positions are incremented by `1/N`
================================   ===================================================    =====================================


Mapping functions are rarely called directly. Instead, they are invoked via
:meth:`BAMGenomeArray.set_mapping() <plastid.genomics.genome_array.BAMGenomeArray.set_mapping>`:

 .. code-block:: python

    # open BAM file
    >>> ga = BAMGenomeArray("some_file.bam")

    # set mapping 15 bases from 3' end of read
    >>> ga.set_mapping(ThreePrimeMapFactory(offset=15))

    # set mapping to 5' end of read
    >>> ga.set_mapping(FivePrimeMapFactory())

    # cut 3 nt off each end of the read, and map fractionally over the rest
    >>> ga.set_mapping(CenterMapFactory(nibble=12))

    # access reads in some region
    >>> ga[GenomicSegment("chrI",10000,100000,"+")]
    # output is numpy array

For further details see :mod:`~plastid.genomics.map_factories` and
:doc:`/concepts/mapping_rules`


Export to browser tracks
........................

GenomeArrays can export data to `wiggle`_ or `bedGraph`_ formats. To export 
forward-strand data to `some_file.wig`::

    >>> genome_array.to_bedgraph(open("some_file.wig","w"),"my_track","+")


Normalization
.............
        
GenomeArrays also offer utilities for summing and normalization::

    >>> genome_array.sum() # get sum
    71292045
    
    # set sum to an arbitrary value- can be useful for normalizing between
    # multiple alignment files
    >>> genome_array.set_sum(61241234324)
    >>> genome_array.sum()
    61241234324
    
    # reset sum to data in file
    >>> genome_array.reset_sum()
    >>> genome_array.sum()
    71292045
    
    # with set_normalize(), all arrays will be normalized to counts per million
    # relative to the current sum
    >>> genome_array.set_normalize(True)
    >>> my_transcript.get_counts(genome_array)
    array([ 1.33254699,  1.86556579,  1.22033251,  6.88716392,  1.40268104,
            1.02395716,  4.32025761,  3.04381786,  6.26998426,  2.86146933,
            3.43656855,  3.70307795,  5.54059012,  1.06603759,  5.42837563,
            4.47455253,  0.60315285,  2.39858458,  1.41670785,  2.13207518,
            0.813555  ,  2.34247734,  0.74342095,  5.56864374,  3.19811278,
            0.61717966,  5.63877779,  3.99764097,  6.732869  ,  0.77147457,
            0.12624129,  0.71536733,  5.02159813,  4.3062308 ,  2.0198607 ,
            2.5528795 ,  4.10985545,  6.38219874,  5.6528046 ,  5.26005391,
            6.21387702,  6.19985021,  2.74925484,  1.86556579,  3.33838088,
            3.75918519,  0.98187673,  1.27643975,  2.41261139,  6.87313711,
            4.85327641,  4.12388226,  1.64113682,  5.94736762,  2.52482588,
            0.19637535,  5.61072417,  5.80709952,  3.60489028,  1.78140492,
            4.82522279,  4.3062308 ,  6.10166253,  2.13207518,  6.42427917,
            5.93334081,  0.84160863,  2.4266382 ,  5.07770537,  3.88542649,
            5.5125365 ,  4.1659627 ,  5.28810753,  5.00757132,  0.95382311,
            2.00583389,  6.55052047,  0.09818767,  1.2063057 ,  6.31206469,
            3.2261664 ,  0.86966225,  3.8994533 ,  0.91174268,  4.85327641,
            1.44476147,  2.59495993,  0.09818767,  4.90938365,  2.27234329,
            6.46635961,  3.84334606,  2.87549614,  2.59495993,  2.93160338,
            3.85737287,  2.66509398,  6.08763572,  0.1402681 ,  3.38046131])

    # similarly, with set_normalize() on, browser tracks we'll be normalized 
    >>> genome_array.to_bedgraph(open("some_file_normalized.wig","w"),"my_normalized_track","+")

"""
__date__ = "May 3, 2011"
__author__ = "joshua"
from abc import abstractmethod
import itertools
import operator
import copy
import numpy
import scipy.sparse
import pysam

from collections import OrderedDict

from plastid.readers.wiggle import WiggleReader
from plastid.readers.bowtie import BowtieReader
from plastid.readers.bigwig import BigWigReader
from plastid.genomics.roitools import GenomicSegment, SegmentChain
from plastid.util.services.mini2to3 import xrange, ifilter
from plastid.util.services.exceptions import DataWarning, warn
from plastid.util.io.openers import NullWriter, multiopen

from plastid.genomics.map_factories import *

MIN_CHR_SIZE = int(10 * 1e6)  # 10 Mb minimum size for unspecified chromosomes

#===============================================================================
# INDEX: Mapping functions for GenomeArray and SparseGenomeArray.
#        these are used by add_from_bowtie() and add_from_tagalign()
#        to map read alignments to specific sites.
#
#        See function documentation for more details
#===============================================================================


[docs]def center_map(feature, **kwargs): """Center-mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`. A user-specified number of bases is optionally removed from each side of each read alignment, and the `N` remaining bases are each apportioned `1/N` of the read count, so that the entire read is counted once. Parameters ---------- feature : |SegmentChain| Ungapped genomic alignment kwargs['value'] : float, optional Total value to divide over aligning positions (Default: `1.0`) kwargs['nibble'] : int, optional Positions to remove from each end before mapping (Default: `0`) kwargs['offset'] : int, optional Mapping offset, if any, from 5' end of read (Default: `0`) Returns ------- list tuples of `(GenomicSegment,float value over segment)` """ nibble = kwargs["nibble"] read_length = feature.length if read_length <= 2 * nibble: warn( "File contains read alignments shorter (%s nt) than `2*'nibble'` value of %s nt. Ignoring these." % (read_length, 2 * nibble), DataWarning ) return [] # yapf: disable span = feature.spanning_segment strand = span.strand chrom = span.chrom offset = kwargs.get("offset", 0) value = float(kwargs.get("value", 1.0)) sign = -1 if strand == "-" else 1 start = span.start + nibble + sign * offset end = span.end - nibble + sign * offset seg = GenomicSegment(chrom, start, end, strand) frac = value / len(seg) return [(seg, frac)]
[docs]def five_prime_map(feature, **kwargs): """Fiveprime mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`. Reads are mapped at a user-specified offset from the fiveprime end of the alignment. Parameters ---------- feature : |SegmentChain| Ungapped genomic alignment kwargs['value'] : float or int, optional Value to apportion (Default: `1`) kwargs['offset'] : int, optional Mapping offset, if any, from 5' toward 3' end of read Returns ------- list tuples of `(GenomicSegment,float value over segment)` """ offset = kwargs.get("offset", 0) read_length = feature.length if offset > read_length: warn( "File contains read alignments shorter (%s nt) than offset (%s nt). Ignoring." % (read_length, offset), DataWarning ) return [] value = kwargs.get("value", 1.0) span = feature.spanning_segment strand = span.strand if strand in ("+", "."): start = span.start + offset else: start = span.end - 1 - offset seg = GenomicSegment(span.chrom, start, start + 1, strand) return [(seg, value)]
[docs]def three_prime_map(feature, **kwargs): """Threeprime mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`. Reads are mapped at a user-specified offset from the threeprime end of the alignment. Parameters ---------- feature : |SegmentChain| Ungapped genomic alignment kwargs['value'] : float or int, optional Value to apportion (Default: `1`) kwargs['offset'] : int, optional Mapping offset, if any, from 3' toward 5' end of read. Returns ------- list tuples of `(GenomicSegment,float value over segment)` """ offset = kwargs.get("offset", 0) read_length = feature.length if offset > read_length: warn( "File contains read alignments shorter (%s nt) than offset (%s nt). Ignoring." % (read_length, offset), DataWarning ) return [] value = kwargs.get("value", 1.0) span = feature.spanning_segment strand = span.strand if strand in ("+", "."): start = span.end - 1 - offset else: start = span.start + offset seg = GenomicSegment(span.chrom, start, start + 1, strand) return [(seg, value)]
[docs]def variable_five_prime_map(feature, **kwargs): """Fiveprime variable mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`. Reads are mapped at a user-specified offset from the fiveprime end of the alignment. The offset for a read of a given length is supplied in `kwargs[offset][readlen]` Parameters ---------- feature : |SegmentChain| Ungapped genomic alignment kwargs['offset'] : dict, required Dictionary mapping read lengths to offsets kwargs['value'] : float or int, optional Value to apportion (Default: `1`) Returns ------- list tuples of `(GenomicSegment,float value over segment)` """ span = feature.spanning_segment strand = span.strand value = kwargs.get("value", 1.0) offset = kwargs["offset"].get(len(span), kwargs["offset"].get("default", None)) feature_length = feature.length if offset is None: warn("No offset for reads of length %s. Ignoring." % feature_length, DataWarning) return [] if offset >= feature_length: warn("Offset (%s nt) longer than read length %s. Ignoring" % (offset, feature_length)) if strand in ("+", "."): start = span.start + offset else: start = span.end - 1 - offset seg = GenomicSegment(span.chrom, start, start + 1, strand) return [(seg, value)]
#=============================================================================== # GenomeArray classes #=============================================================================== _DEFAULT_STRANDS = ("+", "-") class AbstractGenomeArray(object): """Abstract base class for all |GenomeArray|-like objects""" def __init__(self, chr_lengths=None, strands=None): """ Parameters ---------- chr_lengths : dict Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed. strands : sequence, optional Strands for the |AbstractGenomeArray|. (Default: `("+","-")`) """ self._chroms = {} self._strands = _DEFAULT_STRANDS if strands is None else strands self._sum = None self._normalize = False def __str__(self): return repr(self) def __repr__(self): stmp = "<%s len=%s sum=%s chroms=" % (self.__class__.__name__, len(self), self.sum()) stmp += ",".join(self.chroms()) stmp += " strands=%s" % ",".join(self.strands()) stmp += ">" return stmp def __contains__(self, chrom): """Return `True` if `chrom` is defined in the array, otherwise False Returns ------- bool """ return chrom in self._chroms def __len__(self): """Return size of GenomeArray in nucleotide positions (not base pairs). To obtain size in base pairs, divide length by two. Returns ------- int """ return len(self._strands) * sum(self.lengths().values()) @abstractmethod def __getitem__(self, roi): """Retrieve array of counts from a region of interest. The values in the returned array are in 5' to 3' with respect to `seg` rather than the genome (i.e. are reversed for reverse-strand features). Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome Returns ------- :class:`numpy.ndarray` vector of numbers, each position corresponding to a position in `seg`, from 5' to 3' relative to `seg` See also -------- plastid.genomics.roitools.SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ pass @abstractmethod def reset_sum(self): """Reset sum to total mapped reads in the GenomeArray""" def set_sum(self, val): """Set sum used for normalization to an arbitrary value (e.g. from another dataset) Parameters ---------- val : int or float a number """ self._sum = val def sum(self): """Return the total number of aligned reads or the sum of the quantitative data across all positions in the GenomeArray Notes ----- The true (i.e. unnormalized) sum is always reported, even if :meth:`set_normalize` is set to True Returns ------- int or float """ if self._sum is None: self.reset_sum() return self._sum def set_normalize(self, value=True): """Toggle normalization of reported values to reads per million mapped in the dataset Parameters ---------- value : bool If `True`, all values fetched will be normalized to reads per million. If `False`, all values will not be normalized. """ assert value in (True, False) self._normalize = value def chroms(self): """Return a list of chromosomes in the GenomeArray Returns ------- list Chromosome names as strings """ return self._chroms.keys() def strands(self): """Return a tuple of strands in the GenomeArray Returns ------- tuple Chromosome strands as strings """ return self._strands def lengths(self): """Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken. Returns ------- dict Dictionary mapping chromosome names to chromosome lengths """ d_out = {}.fromkeys(self.keys()) for key in d_out: d_out[key] = max([len(self._chroms[key][X]) for X in self.strands()]) return d_out
[docs]class MutableAbstractGenomeArray(AbstractGenomeArray): """Abstract base class for |GenomeArray|-like objects whose values can be changed """ @abstractmethod def __setitem__(self, seg, val): """Set values in |MutableAbstractGenomeArray| over a region of interest. Parameters ---------- seg : |GenomicSegment| or |SegmentChain| Region of interest val : int, float, or :class:`numpy.ndarray` Scalar or vector of values to set in array over `seg`. If a vector, values should be ordered 5'-to-3' relative to `seg` rather than (i.e. for a reverse-strand feature, position 0 in the vector would correspond to seg.end) in the genome """ pass
[docs]class BAMGenomeArray(AbstractGenomeArray): """ BAMGenomeArray(*bamfiles,mapping=CenterMapFactory()) A GenomeArray for :term:`read alignments` in `BAM`_ files. When a user requests :term:`read alignments` or :term:`counts` at a set of genomic positions, |BAMGenomeArray| uses a :term:`mapping rule` to determine which reads from the `BAM`_ file should be returned and how these reads should be mapped to positions in a vector. Reads are not mapped until requested, and mapping rules are changeable at runtime via :meth:`BAMGenomeArray.set_mapping` Parameters ---------- bamfile : One or more filenames or open :class:`pysam.AlignmentFile` Filename(s) of `BAM`_ files, or open :class:`pysam.AlignmentFile` to include in array. Note: all `BAM`_ files must be sorted and indexed by `samtools`_. mapping : func :term:`mapping function` that determines how each read alignment is mapped to a count at a genomic position. Examples include mapping reads to their fiveprime ends, mapping reads to their threeprime ends, or mapping somewhere in the middle. Factories to produce such functions are provided. See references below. (Default: :func:`CenterMapFactory`) Attributes ---------- map_fn : function :term:`mapping function` used to convert :term:`read alignments` to :term:`counts` bamfiles : list `BAM`_ files or :term:`read alignments`, as a list of :py:class:`pysam.AlignmentFile` Notes ----- The alignment data in |BAMGenomeArray| objects is immutable. If you need to change the data in-place, the |BAMGenomeArray| can be converted to a |GenomeArray| or |SparseGenomeArray| via :meth:`~BAMGenomeArray.to_genome_array` """ def __init__(self, *bamfiles, **kwargs): #mapping=None): """Create a |BAMGenomeArray| Parameters ---------- bamfile : One or more filenames or open :class:`pysam.AlignmentFile` Filename(s) of `BAM`_ files, or open :class:`pysam.AlignmentFile` to include in array. Note: all `BAM`_ files must be sorted and indexed by `samtools`_. mapping : func :term:`mapping function` that determines how each read alignment is mapped to a count at a genomic position. Examples include mapping reads to their fiveprime ends, mapping reads to their threeprime ends, or mapping somewhere in the middle. Factories to produce such functions are provided. See references below. (Default: :func:`CenterMapFactory`) See also -------- plastid.genomics.map_factories.FivePrimeMapFactory map reads to 5' ends, with or without applying an offset plastid.genomics.map_factories.VariableFivePrimeMapFactory map reads to 5' ends, choosing an offset determined by read length plastid.genomics.map_factories.ThreePrimeMapFactory map reads to 3' ends, with or without applying an offset plastid.genomics.map_factories.CenterMapFactory map each read fractionally to every position in the read, optionally trimming positions from the ends first """ if len(bamfiles) == 1 and isinstance(bamfiles[0], list): bamfiles = bamfiles[0] bamfiles = list(multiopen(bamfiles, fn=pysam.AlignmentFile, args=("rb", ))) #bamfiles = [pysam.AlignmentFile(X,"rb") if isinstance(X,str) else X for X in bamfiles] self.bamfiles = bamfiles self.map_fn = kwargs.get("mapping", CenterMapFactory()) # if mapping is None else mapping self._strands = ("+", "-", ".") self._normalize = False self._chr_lengths = {} for bamfile in self.bamfiles: for k, v in zip(bamfile.references, bamfile.lengths): self._chr_lengths[k] = max(self._chr_lengths.get(k, 0), v) self._chroms = sorted(list(self._chr_lengths.keys())) self._filters = OrderedDict() self._update() def __del__(self): for bamfile in self.bamfiles: bamfile.close()
[docs] def reset_sum(self): """Reset the sum to the total number of mapped reads in the |BAMGenomeArray| Notes ----- Filters are not applied in this summation. See :meth:`BAMGenomeArray.set_sum` to manually set a sum, which e.g. could correspond to a total number of filtered reads. """ self._sum = sum([X.mapped for X in self.bamfiles])
def _update(self): """Updates mapping function to suit mapping rules """ self.reset_sum()
[docs] def add_filter(self, name, func): """Apply a function to filter reads retrieved from regions before mapping and counting Parameters ---------- name : str A name for the filter. If not unique, will overwrite previous filter func : func Filter function. Function must take a :class:`pysam.AlignedSegment` as a single parameter, and return `True` if that read should be included in output, or `False` otherwise Notes ----- In Python, `lambda` functions do NOT have their own scope! We strongly recomend defining filter functions using the ``def`` syntax to avoid namespace collisions. See also -------- plastid.genomics.map_factories.SizeFilterFactory generate filter functions that gate read alignments on size """ self._filters[name] = func
[docs] def remove_filter(self, name): """Remove a generic filter Parameters ---------- name : str A name for the filter Returns ------- func the removed filter function """ retval = self._filters.pop(name) return retval
[docs] def chroms(self): """Returns a list of chromosomes Returns ------- list sorted chromosome names as strings """ return self._chroms
[docs] def lengths(self): """Returns a dictionary mapping chromosome names to lengths. Returns ------- dict mapping chromosome names to lengths """ return self._chr_lengths
[docs] def get_reads_and_counts(self, roi, roi_order=True): """Return :term:`read alignments` covering a |GenomicSegment|, and a count vector mapping reads to each positions in the |GenomicSegment|, following the rule specified by :meth:`~BAMGenomeArray.set_mapping`. Reads are strand-matched to `roi` by default. To obtain unstranded reads, set the value of `roi.strand` to `'.'` Parameters ---------- roi : |GenomicSegment| Region of interest Returns ------- list List of reads (as :class:`pysam.AlignedSegment`) covering region of interest numpy.ndarray Counts at each position of `roi`, under whatever mapping rule was set by :py:meth:`~BAMGenomeArray.set_mapping` Raises ------ ValueError if bamfiles not sorted or not indexed """ # fetch reads chrom = roi.chrom strand = roi.strand start = roi.start end = roi.end if chrom not in self.chroms(): # FIXME: generalize to N-D shape = [1] + getattr(self.map_fn, "shape", []) return [], numpy.zeros(shape) reads = itertools.chain.from_iterable( ( X.fetch( reference=chrom, start=start, end=end, # until_eof=True, # this could speed things up. need to test/investigate ) for X in self.bamfiles ) ) # filter by strand if strand == "+": reads = ifilter(lambda x: x.is_reverse is False, reads) elif strand == "-": reads = ifilter(lambda x: x.is_reverse is True, reads) # Pass through additional filters (e.g. size filters, if they have # been added) for my_filter in self._filters.values(): reads = ifilter(my_filter, reads) # retrieve selected parts of regions reads, count_array = self.map_fn(list(reads), roi) # normalize to reads per million if normalization flag is set if self._normalize is True: count_array = count_array / float(self.sum()) * 1e6 if roi_order == True and strand == "-": count_array = count_array[..., ::-1] return reads, count_array
[docs] def get_reads(self, roi): """Returns reads covering a |GenomicSegment|. Reads are strand-matched to `roi` by default, and are included or excluded depending upon the rule specified by :meth:`~BAMGenomeArray.set_mapping`. To obtain unstranded reads, set the value of `roi.strand` to `'.'` Parameters ---------- roi : |GenomicSegment| Region of interest Returns ------- list List of reads (as :class:`pysam.AlignedSegment`) covering region of interest, according to the rule set in :meth:`~BAMGenomeArray.set_mapping` Raises ------ ValueError if bamfiles are not sorted or not indexed """ reads, _ = self.get_reads_and_counts(roi) return reads
def __getitem__(self, roi): """Retrieve array of counts from a region of interest, following the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`. Values in the vector are ordered 5' to 3' relative to `roi` rather than the genome (i.e. are reversed for reverse-strand features). Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome Raises ------ ValueError if bamfiles not sorted or not indexed Returns ------- numpy.ndarray vector of numbers, each position corresponding to a position in `roi`, from 5' to 3' relative to `roi` See also -------- plastid.genomics.roitools.SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ return self.get(roi, roi_order=True)
[docs] def get(self, roi, roi_order=True): """Retrieve array of counts from a region of interest, following the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`. Values in the vector are ordered 5' to 3' relative to `roi` rather than the genome (i.e. are reversed for reverse-strand features). Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome roi_order : bool, optional If `True` (default) return vector of values 5' to 3' relative to vector rather than genome. Raises ------ ValueError if bamfiles not sorted or not indexed Returns ------- numpy.ndarray vector of numbers, each position corresponding to a position in `roi`, from 5' to 3' relative to `roi` See also -------- plastid.genomics.roitools.SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ if isinstance(roi, SegmentChain): return roi.get_counts(self) _, count_array = self.get_reads_and_counts(roi, roi_order=roi_order) return count_array
[docs] def get_mapping(self): """Return the docstring of the current mapping function """ return self.map_fn.__doc__
[docs] def set_mapping(self, mapping_function): """Change the mapping rule Parameters ---------- mapping : func Function that determines how each read alignment is mapped to a count at a position in the |BAMGenomeArray|. Examples include mapping reads to their fiveprime or threeprime ends, with or without offsets. Factories to produce such functions are provided. See references below. See also -------- plastid.genomics.map_factories.FivePrimeMapFactory map reads to 5' ends, with or without applying an offset plastid.genomics.map_factories.VariableFivePrimeMapFactory map reads to 5' ends, choosing an offset determined by read length plastid.genomics.map_factories.ThreePrimeMapFactory map reads to 3' ends, with or without applying an offset plastid.genomics.map_factories.CenterMapFactory map each read fractionally to every position in the read, optionally trimming positions from the ends first """ self.map_fn = mapping_function self._update()
[docs] def to_genome_array(self, array_type=None): """Converts |BAMGenomeArray| to a |GenomeArray| or |SparseGenomeArray| under the mapping rule set by :meth:`~BAMGenomeArray.set_mapping` Parameters ---------- array_type : class Type of GenomeArray to return. |GenomeArray| or |SparseGenomeArray|. (Default: |GenomeArray|) Returns ------- |GenomeArray| or |SparseGenomeArray| """ if array_type is None: array_type = GenomeArray ga = array_type(chr_lengths=self.lengths(), strands=self.strands()) for chrom in self.chroms(): for strand in self.strands(): seg = GenomicSegment(chrom, 0, self.lengths()[chrom] - 1, strand) ga[seg] = self[seg] return ga
[docs] def to_variable_step(self, fh, trackname, strand, window_size=100000, printer=None, **kwargs): """Write the contents of the |BAMGenomeArray| to a variableStep `Wiggle`_ file under the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`. See the `Wiggle spec <http://genome.ucsc.edu/goldenpath/help/wiggle.html>`_ for format details. Parameters ---------- fh : file-like Filehandle to write to trackname : str Name of browser track strand : str Strand of |BAMGenomeArray| to export. `'+'`, `'-'`, or `'.'` window_size : int Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient printer : file-like, optional Something implementing a write() method for output **kwargs Any other key-value pairs to include in track definition line """ assert strand in self.strands() printer = NullWriter() if printer is None else printer fh.write("track type=wiggle_0 name=%s" % trackname) if kwargs is not None: for k, v in sorted(kwargs.items(), key=lambda x: x[0]): fh.write(" %s=%s" % (k, v)) fh.write("\n") for chrom in sorted(self.chroms()): my_size = self.lengths()[chrom] printer.write("Writing chromosome %s..." % chrom) fh.write("variableStep chrom=%s span=1\n" % chrom) window_starts = xrange(0, self._chr_lengths[chrom], window_size) for my_start in window_starts: my_end = min(my_start + window_size, my_size) my_counts = self.get( GenomicSegment(chrom, my_start, my_end, strand), roi_order=False ) if my_counts.sum() > 0: for idx in my_counts.nonzero()[0]: genomic_x = my_start + idx val = my_counts[idx] fh.write("%s\t%s\n" % (genomic_x + 1, val))
[docs] def to_bedgraph(self, fh, trackname, strand, window_size=100000, printer=None, **kwargs): """Write the contents of the |BAMGenomeArray| to a `bedGraph`_ file under the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`. See the `bedGraph spec <https://cgwb.nci.nih.gov/goldenPath/help/bedgraph.html>`_ for format details. Parameters ---------- fh : file-like Filehandle to write to trackname : str Name of browser track strand : str Strand of |BAMGenomeArray| to export. `'+'`, `'-'`, or `'.'` window_size : int Size of chromosome/contig to process at a time. Larger values are faster but less memory-efficient printer : file-like, optional Something implementing a write() method for output **kwargs Any other key-value pairs to include in track definition line """ assert strand in self.strands() assert window_size > 0 printer = NullWriter() if printer is None else printer # write header fh.write("track type=bedGraph name=%s" % trackname) if kwargs is not None: for k, v in sorted(kwargs.items(), key=lambda x: x[0]): fh.write(" %s=%s" % (k, v)) fh.write("\n") for chrom in sorted(self.chroms()): printer.write("Writing chromosome %s..." % chrom) window_starts = xrange(0, self._chr_lengths[chrom], window_size) my_size = self.lengths()[chrom] for my_start in window_starts: my_end = min(my_start + window_size, my_size) my_counts = self.get( GenomicSegment(chrom, my_start, my_end, strand), roi_order=False ) if my_counts.sum() > 0: genomic_start_x = my_start last_val = my_counts[0] for x, val in enumerate(my_counts[1:]): if val != last_val: genomic_end_x = 1 + x + my_start #write line: chrom chromStart chromEnd dataValue. 0-based half-open if last_val > 0: fh.write( "%s\t%s\t%s\t%s\n" % (chrom, genomic_start_x, genomic_end_x, last_val) ) #update variables last_val = val genomic_start_x = genomic_end_x else: continue # write out last values for window if last_val > 0: fh.write("%s\t%s\t%s\t%s\n" % (chrom, genomic_start_x, my_end, last_val))
[docs]class BigWigGenomeArray(AbstractGenomeArray): """BigWigGenomeArray(maxmem=0) High-performance GenomeArray for `BigWig`_ files. Parameters ---------- maxmem : float Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. (Default: 0, No maximum) """ def __init__(self, maxmem=0, **kwargs): """Create a |BigWigGenomeArray|. `BigWig`_ files may be added to the array via :meth:`~BigWigGenomeArray.add_from_bigwig`. Parameters ---------- maxmem : float Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. (Default: 0, No maximum) """ self._strand_dict = {} self._normalize = False self._chromset = None self._sum = None self._lengths = None self._strands = [] self._maxmem = maxmem self.fill = 0.0 # fill def __getitem__(self, roi): """Retrieve array of counts from a region of interest. Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome Returns ------- numpy.ndarray vector of numbers, each position corresponding to a position in `roi`, from 5' to 3' relative to `roi` See also -------- plastid.genomics.roitools.SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ return self.get(roi, roi_order=True)
[docs] def get(self, roi, roi_order=True): """Retrieve array of counts from a region of interest. Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome roi_order : bool, optional If `True` (default) return vector of values 5' to 3' relative to vector rather than genome. Returns ------- numpy.ndarray vector of numbers, each position corresponding to a position in `roi`, from 5' to 3' relative to `roi` See also -------- plastid.genomics.roitools.SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ if isinstance(roi, SegmentChain): return roi.get_counts(self) strand = roi.strand sdict = self._strand_dict count_vec = numpy.zeros(len(roi)) if strand in sdict: for bw in sdict[strand]: # we flip strand later to save operations count_vec += bw.get(roi, roi_order=False) else: warnings.warn( "Strand '%s' not in BigWigGenomeArray (has %s)." % (strand, ", ".join(sdict.keys())), DataWarning ) if self._normalize is True: count_vec = count_vec / float(self.sum()) * 1e6 if roi_order == True and strand == "-": count_vec = count_vec[::-1] return count_vec
[docs] def strands(self): """Return a tuple of strands in the GenomeArray Returns ------- tuple Chromosome strands as strings """ return self._strands
[docs] def chroms(self): """Return a list of chromosomes in the GenomeArray Returns ------- list Chromosome names as strings """ if self._chromset is not None: return self._chromset else: chromset = [] for ltmp in self._strand_dict.values(): for bw in ltmp: chromset.extend(bw.chroms.keys()) self._chromset = set(chromset) return self._chromset
[docs] def lengths(self): """Return a dictionary mapping chromosome names to lengths. When two strands report different lengths for a chromosome, the max length is taken. Returns ------- dict Dictionary mapping chromosome names to chromosome lengths """ if self._lengths is not None: return self._lengths else: lengths = {} for ltmp in self._strand_dict.values(): for bw in ltmp: sizedict = bw.chroms for k, v in sizedict.items(): lengths[k] = max(lengths.get(k, 0), v) self._lengths = lengths return lengths
[docs] def add_from_bigwig(self, filename, strand): """Import additional data from a `BigWig`_ file Parameters ---------- filename : str Path to `BigWig`_ file strand : str Strand to which data should be added. `'+'`, `'-'`, or `'.'` """ self._chromset = None self._lengths = None self._sum = None bw = BigWigReader(filename, fill=self.fill, maxmem=self._maxmem) try: self._strand_dict[strand].append(bw) except KeyError: self._strand_dict[strand] = [bw] self._strands = sorted(self._strand_dict.keys())
[docs] def reset_sum(self): """Reset sum to total of data in the |BigWigGenomeArray|""" my_sum = 0 for ltmp in self._strand_dict.values(): for bw in ltmp: my_sum += bw.sum() self._sum = my_sum return my_sum
[docs] def to_genome_array(self): """Converts |BigWigGenomeArray| to a |GenomeArray| Returns ------- |GenomeArray| """ ga = GenomeArray(chr_lengths=self.lengths(), strands=self.strands()) for chrom, length in self.lengths().items(): for strand in self._strands: region = GenomicSegment(chrom, 0, length, strand) for reader in self._strand_dict.get(strand, []): old = ga.get(region, roi_order=False) new = old + reader.get_chromosome_counts(chrom) ga.__setitem__(region, new, roi_order=False) return ga
[docs]class GenomeArray(MutableAbstractGenomeArray): """Array-like data structure that maps numerical values (e.g. read :term:`counts` or conservation data, et c) to nucleotide positions in a genome. Data in the array may be manipulated manually and/or imported from `Wiggle`_ or `BedGraph`_ files, as well as `bowtie`_ alignments. |GenomeArray| supports basic mathematical operations elementwise, where elements are nucleotide positions. Parameters ---------- chr_lengths : dict or `None`, optional Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed. min_chr_size : int If chr_lengths is not supplied, `min_chr_size` is the default first guess of a chromosome size. If the genome has large chromosomes, it is much much better, speed-wise, to provide `chr_lengths` than to provide a guess here that is too small. strands : sequence Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`) """ def __init__(self, chr_lengths=None, strands=None, min_chr_size=MIN_CHR_SIZE): """Create a |GenomeArray| Parameters ---------- chr_lengths : dict or `None`, optional Dictionary mapping chromosome names to lengths. Suppyling this in advance yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed. min_chr_size : int If chr_lengths is not supplied, `min_chr_size` is the default first guess of a chromosome size. If the genome has large chromosomes, it is much much better, speed-wise, to provide `chr_lengths` than to provide a guess here that is too small. strands : sequence Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`) """ self._chroms = {} self._strands = _DEFAULT_STRANDS if strands is None else strands self.min_chr_size = min_chr_size self._sum = None self._normalize = False if chr_lengths is not None: for chrom in chr_lengths.keys(): self._chroms[chrom] = {} for strand in self._strands: l = chr_lengths[chrom] self._chroms[chrom][strand] = numpy.zeros(l)
[docs] def reset_sum(self): """Reset the sum of the |GenomeArray| to the sum of all positions in the array """ self._sum = sum([X.sum() for X in self.iterchroms()])
def _has_same_dimensions(self, other): """Return `True` if `self` and `other` have the chromosomes, strands, and chromosome lengths Parameters ---------- other : |GenomeArray| Returns ------- bool """ assert set(self.keys()) == set(other.keys()) assert self.strands() == other.strands() assert self.lengths() == other.lengths() def __eq__(self, other, tol=1e-10): """Test equality between `self` and `other`. To be equal, both |GenomeArrays| must have identical values at all nonzero positions in either array. Chromosome lengths need not be equal. Parameters ---------- other : |GenomeArray| or |SparseGenomeArray| Returns ------- bool """ snz = self.nonzero() onz = other.nonzero() for chrom in set(self.chroms()) | set(other.chroms()): for strand in set(self.strands()) | set(other.strands()): self_nonzero_vec = snz.get(chrom, {K: numpy.array([]) for K in self.strands()}).get(strand, numpy.array([])) other_nonzero_vec = onz.get(chrom, {K: numpy.array([]) for K in other.strands() }).get(strand, numpy.array([])) if len(self_nonzero_vec) != len(other_nonzero_vec): return False # indices of nonzero positions must be same if (self_nonzero_vec != other_nonzero_vec).any(): return False # values of nonzero positions must be same if len(self_nonzero_vec) > 0: test_seg = GenomicSegment( chrom, self_nonzero_vec.min(), self_nonzero_vec.max(), strand ) if not (self[test_seg] - other[test_seg] <= tol).all(): return False return True def __getitem__(self, roi): """Retrieve array of counts from a region of interest (`roi`) with values in vector ordered 5' to 3' relative to `roi` rather than genome (i.e. are reversed for reverse-strand features). Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome Returns ------- :class:`numpy.ndarray` vector of numbers, each position corresponding to a position in `roi`, from 5' to 3' relative to `roi` See also -------- plastid.genomics.roitools.SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ return self.get(roi, roi_order=True)
[docs] def get(self, roi, roi_order=True): """Retrieve array of counts from a region of interest (`roi`) with values in vector ordered 5' to 3' relative to `roi` rather than genome (i.e. are reversed for reverse-strand features). Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome roi_order : bool, optional If `True` (default) return vector of values 5' to 3' relative to vector rather than genome. Returns ------- :class:`numpy.ndarray` vector of numbers, each position corresponding to a position in `roi`, from 5' to 3' relative to `roi` See also -------- plastid.genomics.roitools.SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ if isinstance(roi, SegmentChain): return roi.get_counts(self) chrom = roi.chrom strand = roi.strand start = roi.start end = roi.end try: assert roi.end < len(self._chroms[chrom][strand]) except AssertionError: my_len = len(self._chroms[chrom][strand]) new_size = max(my_len + 10000, end + 10000) for my_strand in self.strands(): new_strand = copy.deepcopy(self._chroms[roi.chrom][my_strand]) new_strand.resize(new_size) self._chroms[chrom][my_strand] = new_strand except KeyError: assert strand in self.strands() if chrom not in self.keys(): self._chroms[chrom] = {} for my_strand in self.strands(): self._chroms[chrom][my_strand] = numpy.zeros(self.min_chr_size) vals = self._chroms[chrom][strand][start:end] if self._normalize is True: vals = 1e6 * vals / self.sum() if roi_order == True and strand == "-": vals = vals[::-1] return vals
def __setitem__(self, seg, val, roi_order=True): """Set values in the |GenomeArray| over a region of interest. If the `seg` is outside the bounds of the current |GenomeArray|, the |GenomeArray| will be automatically expanded to accomodate the genomic coordinates of `seg`. Parameters ---------- seg : |GenomicSegment| or |SegmentChain| Region of interest val : int, float, or :class:`numpy.ndarray` Scalar or vector of values to set in array over `seg`. If a vector, values should be ordered 5'-to-3' relative to `seg` rather than (i.e. for a reverse-strand feature, position 0 in the vector would correspond to seg.end) in the genome roi_order : bool, optional If `True` (default) and `val` is a vector, values in `val` are assorted to go 5' to 3' relative to `seg` rather than genome """ self._sum = None if isinstance(seg, SegmentChain): if isinstance(val, numpy.ndarray): if seg.spanning_segment.strand == "-": val = val[::-1] x = 0 for subseg in seg: if isinstance(val, numpy.ndarray): subval = val[x:x + len(subseg)] else: subval = val x += len(subseg) self.__setitem__(subseg, subval, roi_order=False) return old_normalize = self._normalize if old_normalize == True: warn( "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.", DataWarning ) self.set_normalize(False) chrom = seg.chrom strand = seg.strand start = seg.start end = seg.end if strand == "-" and isinstance(val, numpy.ndarray) and roi_order == True: val = val[::-1] try: assert end < len(self._chroms[seg.chrom][seg.strand]) except AssertionError: my_len = len(self._chroms[chrom][strand]) new_size = max(my_len + 10000, end + 10000) for my_strand in self.strands(): # copy then resize, because resize() can't work in-place with refs to array new_strand = copy.deepcopy(self._chroms[chrom][my_strand]) new_strand.resize(new_size) self._chroms[chrom][my_strand] = new_strand except KeyError: assert strand in self.strands() if chrom not in self.keys(): self._chroms[chrom] = {} for my_strand in self.strands(): self._chroms[chrom][my_strand] = numpy.zeros(self.min_chr_size) self._chroms[chrom][strand][start:end] = val self.set_normalize(old_normalize)
[docs] def keys(self): """Return names of chromosomes in the GenomeArray""" return self.chroms()
[docs] def iterchroms(self): """Return an iterator of each chromosome strand array Yields ------ numpy.ndarray Positionwise counts on a chromosome strand """ for chrom in self.keys(): for strand in self.strands(): yield self._chroms[chrom][strand]
#TODO: no unit test for this
[docs] def plot(self, chroms=None, strands=None, **plot_kw): """Create a plot of coverage along tracks or chromosomes Parameters ---------- chroms : list Chromosomes to plot (default: all) strands : list Strands to plot (default: all) plot_kw A dictionary of keywords to pass to matplotlib Returns ------- :py:class:`matplotlib.figure.Figure` """ import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt fig_kw = {'figsize': (8, 10)} colors = {"+": "blue", "-": "orange", ".": "darkgreen"} if chroms is None: chroms = self.keys() if strands is None: strands = self.strands() fig_kw.update(plot_kw) f, axes = plt.subplots(nrows=len(chroms), **fig_kw) plt.subplots_adjust(hspace=0.3) for n, chrom in enumerate(sorted(chroms)): x = numpy.arange(0, self.lengths()[chrom]) for strand in strands: multiplier = -1 if strand == "-" else 1 axes[n].plot( x, self._chroms[chrom][strand] * multiplier, label=strand, color=colors[strand] ) axes[n].set_title(chrom) axes[n].set_ylabel("Counts") axes[n].set_xlabel("Position (nt)") axes[n].ticklabel_format(useOffset=1, style='plain', axis='x') return f
[docs] def nonzero(self): """Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand. Returns ------- dict `dict[chrom][strand]` = numpy.ndarray of indices """ d_out = {} for key in self.keys(): d_out[key] = {} for strand in self.strands(): d_out[key][strand] = self._chroms[key][strand].nonzero()[0] return d_out
[docs] def apply_operation(self, other, func, mode="same"): """Applies a binary operator to a copy of `self` and to `other` elementwise. `other` may be a scalar quantity or another |GenomeArray|. In both cases, a new |GenomeArray| is returned, and `self` is left unmodified. If :meth:`set_normalize` is set to `True`, it is disabled during the operation. Parameters ---------- other : float, int, or |MutableAbstractGenomeArray| Second argument to `func` func : func Function to perform. This must take two arguments. a :py:class:`numpy.ndarray` (chromosome-strand) from the |GenomeArray| will be supplied as the first argument, and the corresponding chromosome-strand from `other` as the second. This operation will be applied over all chromosomes and strands. If mode is set to `all`, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully. mode : str, choice of `'same'`, `'all'`, or `'truncate'` Only relevant if `other` is a |MutableAbstractGenomeArray| If '`same`' each set of corresponding chromosomes must have the same dimensions in both parent |GenomeArray| s. In addition, both |GenomeArray| s must have the same chromosomes, and the same strands. If '`all`', corresponding chromosomes in the parent |GenomeArray| s will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output. If `'truncate'`,corresponding chromosomes in the parent |GenomeArray| s will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored. Returns ------- |GenomeArray| new |GenomeArray| after the operation is applied """ new_array = GenomeArray.like(self) old_normalize = self._normalize if old_normalize == True: warn( "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.", DataWarning ) if type(other) == self.__class__: if mode == "same": self._has_same_dimensions(other) for chrom in self.keys(): for strand in self.strands(): seg = GenomicSegment(chrom, 0, len(self._chroms[chrom][strand]), strand) new_array[seg] = func(self[seg], other[seg]) elif mode == "all": chroms = {}.fromkeys(set(self.keys()) | set(other.keys())) strands = set(self.strands()) | set(other.strands()) new_array = GenomeArray(chroms, strands=strands) for chrom in chroms: if chrom in self.keys() and chrom in other.keys(): for strand in strands: if strand in self.strands() and strand in other.strands(): sc = copy.deepcopy(self._chroms[chrom][strand]) oc = copy.deepcopy(other._chroms[chrom][strand]) if len(sc) > len(oc): oc.resize(len(sc), refcheck=False) elif len(oc) > len(sc): sc.resize(len(oc), refcheck=False) new_array._chroms[chrom][strand] = func(sc, oc) elif strand in self.strands(): new_array._chroms[chrom][strand] = func( copy.deepcopy(self._chroms[chrom][strand]), 0 ) else: new_array._chroms[chrom][strand] = func( copy.deepcopy(other._chroms[chrom][strand]), 0 ) elif chrom in self.keys(): for strand in strands: new_array._chroms[chrom][strand] = func( copy.deepcopy(self[chrom][strand]), 0 ) else: for strand in strands: new_array._chroms[chrom][strand] = func( copy.deepcopy(other[chrom][strand]), 0 ) elif mode == "truncate": my_strands = set(self.strands()) & set(other.strands()) my_chroms = set(self.chroms()) & set(other.chroms()) for chrom in my_chroms: for strand in my_strands: sc = copy.deepcopy(self._chroms[chrom][strand]) oc = copy.deepcopy(other._chroms[chrom][strand]) if len(sc) > len(oc): sc.resize(len(oc), refcheck=False) elif len(oc) > len(sc): oc.resize(len(sc), refcheck=False) new_array._chroms[chrom][strand] = func(sc, oc) else: raise ValueError("Mode not understood. Must be 'same', 'all', or 'truncate.'") else: for chrom in self.keys(): for strand in self.strands(): new_array._chroms[chrom][strand] = func(self._chroms[chrom][strand], other) self.set_normalize(old_normalize) return new_array
def __add__(self, other, mode="all"): """Add `other` to `self`, returning a new |GenomeArray| and leaving `self` unchanged. `other` may be a scalar quantity or another |GenomeArray|. In the latter case, addition is elementwise. Parameters ---------- other : float, int, or |MutableAbstractGenomeArray| mode : str, choice of `'same'`, `'all'`, or `'truncate'` Only relevant if `other` is a |MutableAbstractGenomeArray| If '`same`' each set of corresponding chromosomes must have the same dimensions in both parent |GenomeArrays|. In addition, both |GenomeArrays| must have the same chromosomes, and the same strands. If '`all`', corresponding chromosomes in the parent |GenomeArrays| will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output. If `'truncate'`,corresponding chromosomes in the parent |GenomeArrays| will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored. Returns ------- |GenomeArray| """ return self.apply_operation(other, operator.add, mode=mode) def __mul__(self, other, mode="all"): """Multiply `other` by `self`, returning a new |GenomeArray| and leaving `self` unchanged. `other` may be a scalar quantity or another |GenomeArray|. In the latter case, multiplication is elementwise. Parameters ---------- other : float, int, or |MutableAbstractGenomeArray| mode : str, choice of `'same'`, `'all'`, or `'truncate'` Only relevant if `other` is a |MutableAbstractGenomeArray| If '`same`' each set of corresponding chromosomes must have the same dimensions in both parent |GenomeArrays|. In addition, both |GenomeArrays| s must have the same chromosomes, and the same strands. If '`all`', corresponding chromosomes in the parent |GenomeArrays| will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output. If `'truncate'`,corresponding chromosomes in the parent |GenomeArrays| will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored. Returns ------- |GenomeArray| """ return self.apply_operation(other, operator.mul, mode=mode) def __sub__(self, other, mode="all"): """Subtract `other` from `self`, returning a new |GenomeArray| and leaving `self` unchanged. `other` may be a scalar quantity or another |GenomeArray|. In the latter case, subtraction is elementwise. Parameters ---------- other : float, int, or |MutableAbstractGenomeArray| mode : str, choice of `'same'`, `'all'`, or `'truncate'` Only relevant if `other` is a |MutableAbstractGenomeArray| If '`same`' each set of corresponding chromosomes must have the same dimensions in both parent |GenomeArrays|. In addition, both |GenomeArrays| s must have the same chromosomes, and the same strands. If '`all`', corresponding chromosomes in the parent |GenomeArrays| will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output. If `'truncate'`,corresponding chromosomes in the parent |GenomeArrays| will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored. Returns ------- |GenomeArray| """ return self.__add__(other * -1)
[docs] def add_from_bowtie(self, fh, mapfunc, min_length=25, max_length=numpy.inf, **trans_args): """Import alignment data in native `bowtie`_ format to the current GenomeArray Parameters ---------- fh : file-like filehandle pointing to `bowtie`_ file mapfunc : func a :term:`mapping function` that transforms alignment coordinates into a sequence of (|GenomicSegment|,value) pairs min_length : int, optional minimum length for read to be counted (Default: 25) max_length : int or numpy.inf, optional maximum length for read to be counted (Default: infinity) trans_args : dict, optional Keyword arguments to pass to transformation function See also -------- five_prime_map map reads to 5' ends, with or without applying an offset variable_five_prime_map map reads to 5' ends, choosing an offset determined by read length three_prime_map map reads to 3' ends, with or without applying an offset center_map map each read fractionally to every position in the read, optionally trimming positions from the ends first """ for feature in BowtieReader(fh): span_len = len(feature.spanning_segment) if span_len >= min_length and span_len <= max_length: tuples = mapfunc(feature, **trans_args) for seg, val in tuples: self[seg] += val self._sum = None
[docs] def add_from_wiggle(self, fh, strand): """Import data from a `Wiggle`_ or `bedGraph`_ file to current GenomeArray Parameters ---------- fh : file-like filehandle pointing to wiggle file strand : str Strand to which data should be added. `'+'`, `'-'`, or `'.'` """ assert strand in self.strands() for chrom, start, stop, val in WiggleReader(fh): seg = GenomicSegment(chrom, start, stop, strand) self[seg] += val self._sum = None
[docs] def to_variable_step(self, fh, trackname, strand, printer=None, **kwargs): """Export the contents of the GenomeArray to a variable step `Wiggle`_ file. For sparse data, `bedGraph`_ can be more efficient format. See the `Wiggle spec <http://genome.ucsc.edu/goldenpath/help/wiggle.html>`_ for format details Parameters ---------- fh : file-like Open filehandle to which data will be written trackname : str Name of browser track strand : str Strand to export. `'+'`, `'-'`, or `'.'` printer : file-like, optional Something implementing a write() method for output **kwargs Any other key-value pairs to include in track definition line """ assert strand in self.strands() printer = NullWriter() if printer is None else printer fh.write("track type=wiggle_0 name=%s" % trackname) if kwargs is not None: for k, v in sorted(kwargs.items(), key=lambda x: x[0]): fh.write(" %s=%s" % (k, v)) fh.write("\n") nonzero = self.nonzero() for chrom in sorted(nonzero): printer.write("Writing chromosome %s..." % chrom) fh.write("variableStep chrom=%s span=1\n" % chrom) indices = nonzero[chrom][strand] for idx in indices: val = self._chroms[chrom][strand][self._slicewrap(idx)] fh.write("%s\t%s\n" % (idx + 1, val))
[docs] def to_bedgraph(self, fh, trackname, strand, printer=None, **kwargs): """Write the contents of the GenomeArray to a `bedGraph`_ file See the `bedGraph spec <https://cgwb.nci.nih.gov/goldenPath/help/bedgraph.html>`_ for format details Parameters ---------- fh : file-like Open filehandle to which data will be written trackname : str Name of browser track strand : str Strand to export. `'+'`, `'-'`, or `'.'` printer : file-like, optional Something implementing a write() method for output **kwargs Any other key-value pairs to include in track definition line """ assert strand in self.strands() printer = NullWriter() if printer is None else printer fh.write("track type=bedGraph name=%s" % trackname) if kwargs is not None: for k, v in sorted(kwargs.items(), key=lambda x: x[0]): fh.write(" %s=%s" % (k, v)) fh.write("\n") nonzero = self.nonzero() for chrom in sorted(nonzero): printer.write("Writing chromosome %s..." % chrom) if self._chroms[chrom][strand].sum() > 0: last_val = 0 last_x = 0 nz = nonzero[chrom][strand] for x in range(nz.min(), nz.max() + 1): val = self._chroms[chrom][strand][self._slicewrap(x)] if val != last_val: # line is: chrom chromStart chromEnd dataValue fh.write("%s\t%s\t%s\t%s\n" % (chrom, last_x, x, last_val)) # prepare for next loop last_val = val last_x = x else: continue # write final line, not included in loop fh.write("%s\t%s\t%s\t%s\n" % (chrom, last_x, x + 1, last_val))
[docs] @staticmethod def like(other): """Return a |GenomeArray| of same dimension as the input array Parameters ---------- other : |GenomeArray| Returns ------- GenomeArray empty |GenomeArray| of same size as `other` """ return GenomeArray(other.lengths(), strands=other.strands())
def _slicewrap(self, x): """Helper function to wrap coordinates for VariableStep/`bedGraph`_ export""" return x
[docs]class SparseGenomeArray(GenomeArray): """A memory-efficient sublcass of |GenomeArray| using sparse internal representation. Note, savings in memory may come at a cost in performance when repeatedly getting/setting values, compared to a |GenomeArray|. Parameters ---------- chr_lengths : dict, optional Dictionary mapping chromosome names to lengths. Suppyling this parameter yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed (Default: `{}` ) min_chr_size : int,optional If `chr_lengths` is not supplied, `min_chr_size` is the default first guess of a chromosome size. If your genome has large chromosomes, it is much much better, speed-wise, to provide a chr_lengths dict than to provide a guess here that is too small. (Default: %s) strands : sequence Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`) """ def __init__(self, chr_lengths=None, strands=None, min_chr_size=MIN_CHR_SIZE): """Create a |SparseGenomeArray| Parameters ---------- chr_lengths : dict, optional Dictionary mapping chromosome names to lengths. Suppyling this parameter yields considerable speed improvements, as memory can be pre-allocated for each chromosomal position. If not provided, a minimum chromosome size will be guessed, and chromosomes re-sized as needed (Default: `{}` ) min_chr_size : int,optional If `chr_lengths` is not supplied, `min_chr_size` is the default first guess of a chromosome size. If your genome has large chromosomes, it is much much better, speed-wise, to provide a chr_lengths dict than to provide a guess here that is too small. (Default: %s) strands : sequence Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`) """ % MIN_CHR_SIZE self._chroms = {} self._strands = _DEFAULT_STRANDS if strands is None else strands self._sum = None self._normalize = False self.min_chr_size = min_chr_size if chr_lengths is not None: for chrom in chr_lengths.keys(): self._chroms[chrom] = {} for strand in self._strands: l = chr_lengths[chrom] self._chroms[chrom][strand] = scipy.sparse.dok_matrix((1, l))
[docs] def lengths(self): """Return a dictionary mapping chromosome names to lengths. In the case where two strands report different lengths for a chromosome, the max length is taken. Returns ------- dict mapping chromosome names to lengths """ d_out = {}.fromkeys(self.keys()) for key in d_out: d_out[key] = max([self._chroms[key][X].shape[1] for X in self.strands()]) return d_out
[docs] def get(self, roi, roi_order=True): """Retrieve array of counts from a region of interest (`roi`) with values in vector ordered 5' to 3' relative to `roi` rather than genome (i.e. are reversed for reverse-strand features). Parameters ---------- roi : |GenomicSegment| or |SegmentChain| Region of interest in genome roi_order : bool, optional If `True` (default) return vector of values 5' to 3' relative to `roi` rather than genome. Returns ------- :class:`numpy.ndarray` vector of numbers, each position corresponding to a position in `roi`, from 5' to 3' relative to `roi` See also -------- SegmentChain.get_counts Fetch a spliced vector of data covering a |SegmentChain| """ if isinstance(roi, SegmentChain): return roi.get_counts(self) if roi.chrom not in self: self._chroms[roi.chrom] = { K: copy.deepcopy(scipy.sparse.dok_matrix((1, self.min_chr_size))) for K in self.strands() } if roi.end > self._chroms[roi.chrom][roi.strand].shape[1]: for strand in self.strands(): self._chroms[roi.chrom][strand].resize((1, roi.end + 10000)) vals = self._chroms[roi.chrom][roi.strand][0, roi.start:roi.end] if self._normalize is True: vals = 1e6 * vals / self.sum() vals = vals.toarray().reshape(vals.shape[1], ) if roi.strand == "-" and roi_order == True: vals = vals[::-1] return vals
def __setitem__(self, seg, val, roi_order=True): """Set values in the |SparseGenomeArray| over a region of interest. If the `seg` is outside the bounds of the current |GenomeArray|, the |GenomeArray| will be automatically expanded to accomodate the genomic coordinates of `seg`. Parameters ---------- seg : |GenomicSegment| or |SegmentChain| Region of interest val : int, float, or :class:`numpy.ndarray` Scalar or vector of values to set in array over `seg`. If a vector, values should be ordered 5'-to-3' relative to `seg` rather than (i.e. for a reverse-strand feature, position 0 in the vector would correspond to seg.end) in the genome roi_order : bool, optional If `True` (default) and `val` is a vector, values in `val` are assorted to go 5' to 3' relative to `seg` rather than genome """ self._sum = None if isinstance(seg, SegmentChain): if isinstance(val, numpy.ndarray): if seg.spanning_segment.strand == "-": val = val[::-1] x = 0 for subseg in seg: if isinstance(val, numpy.ndarray): subval = val[x:x + len(subseg)] else: subval = val x += len(subseg) self.__setitem__(subseg, subval, roi_order=False) return old_normalize = self._normalize if isinstance(val, numpy.ndarray) and seg.strand == "-" and roi_order == True: val = val[::-1] if old_normalize == True: warn( "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.", DataWarning ) self.set_normalize(False) if seg.chrom not in self: self._chroms[seg.chrom] = { K: copy.deepcopy(scipy.sparse.dok_matrix((1, self.min_chr_size))) for K in self.strands() } if seg.end > self._chroms[seg.chrom][seg.strand].shape[1]: for strand in self.strands(): self._chroms[seg.chrom][strand].resize((1, seg.end + 10000)) self._chroms[seg.chrom][seg.strand][0, seg.start:seg.end] = val self.set_normalize(old_normalize) def __mul__(self, other, mode=None): """Multiply `other` by `self`, returning a new |SparseGenomeArray| and leaving `self` unchanged. `other` may be a scalar quantity or another |SparseGenomeArray|. In the latter case, multiplication is elementwise. Parameters ---------- other : float, int, or |MutableAbstractGenomeArray| mode : str, choice of `'same'`, `'all'`, or `'truncate'` Only relevant if `other` is a |MutableAbstractGenomeArray| If '`same`' each set of corresponding chromosomes must have the same dimensions in both parent |SparseGenomeArrays|. In addition, both |SparseGenomeArrays| must have the same chromosomes, and the same strands. If '`all`', corresponding chromosomes in the parent |SparseGenomeArrays| will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output. If `'truncate'`,corresponding chromosomes in the parent |SparseGenomeArrays| will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored. Returns ------- |SparseGenomeArray| """ if isinstance(other, GenomeArray): new_array = SparseGenomeArray.like(self) chroms = set(self.keys()) & set(other.keys()) strands = set(self.strands()) & set(other.strands()) for chrom in chroms: for strand in strands: new_array._chroms[chrom][strand] = self._chroms[chrom][strand].multiply( other._chroms[chrom][strand] ) return new_array else: return self.apply_operation(other, operator.mul, mode=mode)
[docs] def apply_operation(self, other, func, mode=None): """Apply a binary operator to a copy of `self` and to `other` elementwise. `other` may be a scalar quantity or another |GenomeArray|. In both cases, a new |SparseGenomeArray| is returned, and `self` is left unmodified. If :meth:`set_normalize` is set to `True`, it is disabled during the operation. Parameters ---------- other : float, int, or |MutableAbstractGenomeArray| Second argument to `func` func : func Function to perform. This must take two arguments. a :py:class:`numpy.ndarray` (chromosome-strand) from the |SparseGenomeArray| will be supplied as the first argument, and the corresponding chromosome-strand from `other` as the second. This operation will be applied over all chromosomes and strands. If mode is set to `all`, and a chromosome or strand is not present in one of self or other, zero will be supplied as the missing argument to func. func must handle this gracefully. mode : str, choice of `'same'`, `'all'`, or `'truncate'` Only relevant if `other` is a |MutableAbstractGenomeArray| If '`same`' each set of corresponding chromosomes must have the same dimensions in both parent |SparseGenomeArrays|. In addition, both |SparseGenomeArrays| must have the same chromosomes, and the same strands. If '`all`', corresponding chromosomes in the parent |SparseGenomeArrays| will be resized to the dimensions of the larger chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; all chromosomes and strands will be included in output. If `'truncate'`,corresponding chromosomes in the parent |SparseGenomeArrays| will be resized to the dimensions of the smaller chromosome before the operation is applied. Parents are not required to have the same chromosomes or strands; chromosomes or strands not common to both parents will be ignored. Returns ------- |SparseGenomeArray| new |SparseGenomeArray| after the operation is applied """ new_array = SparseGenomeArray.like(self) old_normalize = self._normalize if old_normalize == True: warn( "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.", DataWarning ) self.set_normalize(False) if isinstance(other, GenomeArray): chroms = {}.fromkeys(set(self.keys()) | set(other.keys()), 10) strands = set(self.strands()) | set(other.strands()) new_array = SparseGenomeArray(chroms, strands=strands) for chrom in chroms: if chrom in self.keys() and chrom in other.keys(): for strand in strands: if strand in self.strands() and strand in other.strands(): sc = self._chroms[chrom][strand] oc = other._chroms[chrom][strand] new_array._chroms[chrom][strand] = func(sc, oc) elif strand in self.strands(): new_array._chroms[chrom][strand] = func( copy.deepcopy(self._chroms[chrom][strand]), 0 ) else: new_array._chroms[chrom][strand] = func( copy.deepcopy(other._chroms[chrom][strand]), 0 ) elif chrom in self.keys(): for strand in strands: new_array._chroms[chrom][strand] = func( copy.deepcopy(self[chrom][strand]), 0 ) else: for strand in strands: new_array._chroms[chrom][strand] = func( copy.deepcopy(other[chrom][strand]), 0 ) else: for chrom in self.keys(): for strand in self.strands(): new_array._chroms[chrom][strand] = func(self._chroms[chrom][strand], other) self.set_normalize(old_normalize) return new_array
[docs] def nonzero(self): """Return the indices of chromosomal positions with non-zero values at each chromosome/strand pair. Results are returned as a hierarchical dictionary mapping chromosome names, to a dictionary of strands, which in turn map to to arrays of non-zero indices on that chromosome-strand. Returns ------- dict `dict[chrom][strand]` = numpy.ndarray of indices """ d_out = {} for key in self.keys(): d_out[key] = {} for strand in self.strands(): # need to sort because dok doens't guarantee sorted indices in nonzero() d_out[key][strand] = numpy.array(sorted(self._chroms[key][strand].nonzero()[1])) return d_out
def _slicewrap(self, x): """Helper function to wrap coordinates for VariableStep/`bedGraph`_ export""" return (0, x)
[docs] @staticmethod def like(other): """Return a |SparseGenomeArray| of same dimension as the input array Parameters ---------- other : |GenomeArray| or |SparseGenomeArray| Returns ------- |SparseGenomeArray| of same size as `other` """ return SparseGenomeArray(other.lengths(), strands=other.strands())