Source code for plastid.genomics.genome_array

#!/usr/bin/env python
"""GenomeArrays map quantitative data or :term:`read alignments` to genomic positions.

.. contents::
   :local:

Summary
-------

GenomeArrays are typically accessed using |SegmentChains|.

The :meth:`~plastid.genomics.roitools.SegmentChain.get_counts` method of
|SegmentChain| and |Transcript| fetches arrays of data from GenomeArrays. The
result is a :class:`numpy array <numpy.ndarray>` in which each position
corresponds to a position in the |SegmentChain|, moving from 5' to 3', in the
spliced |SegmentChain| or |Transcript|. For a |Transcript| called
`my_transcript`::

    >>> # open a BAM file, map reads 15 nucleotides from 3' end
    >>> genome_array = BAMGenomeArray("some_file.bam")
    >>> genome_array.set_mapping(ThreePrimeMapFactory(offset=15))

    >>> my_data = my_transcript.get_counts(genome_array)
    >>> my_data
        array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
                76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
               402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
               375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
               117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
               423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
                86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
               274, 205, 185, 209, 275, 190, 434,  10, 241])


Alternatively, |SegmentChains| and |GenomicSegments| can be used directly as
dictionary keys::

    >>> my_data = genome_array[my_transcript]
    >>> my_data
        array([ 95, 133,  87, 491, 100,  73, 308, 217, 447, 204, 245, 264, 395,
                76, 387, 319,  43, 171, 101, 152,  58, 167,  53, 397, 228,  44,
               402, 285, 480,  55,   9,  51, 358, 307, 144, 182, 293, 455, 403,
               375, 443, 442, 196, 133, 238, 268,  70,  91, 172, 490, 346, 294,
               117, 424, 180,  14, 400, 414, 257, 127, 344, 307, 435, 152, 458,
               423,  60, 173, 362, 277, 393, 297, 377, 357,  68, 143, 467,   7,
                86, 450, 230,  62, 278,  65, 346, 103, 185,   7, 350, 162, 461,
               274, 205, 185, 209, 275, 190, 434,  10, 241])


Module contents
---------------

Several implementations are provided. These are tailored to data stored in 
specific formats, or for specific purposes. For example, mutable GenomeArrays
allow values to be changed in-place, e.g. via mathematical operations or manual
setting.


====================   ==================================   =====================   ===================================================
**Implementation**     **Format of data**                   **Mutable**             **Use of mapping functions**
--------------------   ----------------------------------   ---------------------   ---------------------------------------------------
|BAMGenomeArray|       `BAM`_ file(s)                       No                      Inline; may be changed without reloading data

|BigWigGenomeArray|    `BigWig`_ file(s)                    No                      n/a

|GenomeArray|          `bowtie`_, `wiggle`_, `bedGraph`_,   Yes                     On import from `bowtie`_ files only. Data must be
                       or data in memory                                            reimported to change mapping

|SparseGenomeArray|    `bowtie`_, `wiggle`_, `bedGraph`_,   Yes                     On import from `bowtie`_ files only. Data must be
                       or data in memory                                            reimported to change mapping
====================   ==================================   =====================   ===================================================


Extended summary & examples
---------------------------


Mapping functions
.................

GenomeArrays use :term:`Mapping functions <mapping rule>` to extract the biology
of interest from read alignments, and to vectorize these data over genomic
positions.

Depending on the data type and the purpose of the experiment, different functions
may be appropriate. The following mapping rules are provided, although users are
encouraged to provide their own if needed. These include:


================================   ===================================================    =====================================
**Mapping strategy**               **Description**                                        **Sample use**
--------------------------------   ---------------------------------------------------    -------------------------------------

Fiveprime                          Each read alignment is mapped to its 5' end, or        RNA-seq, ribosome profiling, DMS-seq
                                   at a fixed distance from its 5' end. 

Variable fiveprime                 Each read alignment is mapped at a fixed distance      Ribosome profiling
                                   from its 5' end, where the distance is determined
                                   by the length of the alignment

Stratified variable fiveprime      Like variable fiveprime mapping, except reads are      Ribosome profiling
                                   stratified by length into separate rows of an array

Threeprime                         Each read alignment is mapped to its 5' end, or        RNA-seq, ribosome profiling
                                   at a fixed distance from its 5' end. 

Center/entire                      Zero or more positions are trimmed from each end       Ribosome profiling with micrococcal
                                   of the read alignment, and the remaining `N`           nuclease, RNA-seq
                                   positions are incremented by `1/N`
================================   ===================================================    =====================================


Mapping functions are rarely called directly. Instead, they are invoked via
:meth:`BAMGenomeArray.set_mapping() <plastid.genomics.genome_array.BAMGenomeArray.set_mapping>`:

 .. code-block:: python

    # open BAM file
    >>> ga = BAMGenomeArray("some_file.bam")

    # set mapping 15 bases from 3' end of read
    >>> ga.set_mapping(ThreePrimeMapFactory(offset=15))

    # set mapping to 5' end of read
    >>> ga.set_mapping(FivePrimeMapFactory())

    # cut 3 nt off each end of the read, and map fractionally over the rest
    >>> ga.set_mapping(CenterMapFactory(nibble=12))

    # access reads in some region
    >>> ga[GenomicSegment("chrI",10000,100000,"+")]
    # output is numpy array

For further details see :mod:`~plastid.genomics.map_factories` and
:doc:`/concepts/mapping_rules`


Export to browser tracks
........................

GenomeArrays can export data to `wiggle`_ or `bedGraph`_ formats. To export 
forward-strand data to `some_file.wig`::

    >>> genome_array.to_bedgraph(open("some_file.wig","w"),"my_track","+")


Normalization
.............
        
GenomeArrays also offer utilities for summing and normalization::

    >>> genome_array.sum() # get sum
    71292045
    
    # set sum to an arbitrary value- can be useful for normalizing between
    # multiple alignment files
    >>> genome_array.set_sum(61241234324)
    >>> genome_array.sum()
    61241234324
    
    # reset sum to data in file
    >>> genome_array.reset_sum()
    >>> genome_array.sum()
    71292045
    
    # with set_normalize(), all arrays will be normalized to counts per million
    # relative to the current sum
    >>> genome_array.set_normalize(True)
    >>> my_transcript.get_counts(genome_array)
    array([ 1.33254699,  1.86556579,  1.22033251,  6.88716392,  1.40268104,
            1.02395716,  4.32025761,  3.04381786,  6.26998426,  2.86146933,
            3.43656855,  3.70307795,  5.54059012,  1.06603759,  5.42837563,
            4.47455253,  0.60315285,  2.39858458,  1.41670785,  2.13207518,
            0.813555  ,  2.34247734,  0.74342095,  5.56864374,  3.19811278,
            0.61717966,  5.63877779,  3.99764097,  6.732869  ,  0.77147457,
            0.12624129,  0.71536733,  5.02159813,  4.3062308 ,  2.0198607 ,
            2.5528795 ,  4.10985545,  6.38219874,  5.6528046 ,  5.26005391,
            6.21387702,  6.19985021,  2.74925484,  1.86556579,  3.33838088,
            3.75918519,  0.98187673,  1.27643975,  2.41261139,  6.87313711,
            4.85327641,  4.12388226,  1.64113682,  5.94736762,  2.52482588,
            0.19637535,  5.61072417,  5.80709952,  3.60489028,  1.78140492,
            4.82522279,  4.3062308 ,  6.10166253,  2.13207518,  6.42427917,
            5.93334081,  0.84160863,  2.4266382 ,  5.07770537,  3.88542649,
            5.5125365 ,  4.1659627 ,  5.28810753,  5.00757132,  0.95382311,
            2.00583389,  6.55052047,  0.09818767,  1.2063057 ,  6.31206469,
            3.2261664 ,  0.86966225,  3.8994533 ,  0.91174268,  4.85327641,
            1.44476147,  2.59495993,  0.09818767,  4.90938365,  2.27234329,
            6.46635961,  3.84334606,  2.87549614,  2.59495993,  2.93160338,
            3.85737287,  2.66509398,  6.08763572,  0.1402681 ,  3.38046131])

    # similarly, with set_normalize() on, browser tracks we'll be normalized 
    >>> genome_array.to_bedgraph(open("some_file_normalized.wig","w"),"my_normalized_track","+")

"""
__date__ = "May 3, 2011"
__author__ = "joshua"
from abc import abstractmethod
import itertools
import operator
import copy
import numpy
import scipy.sparse
import pysam

from collections import OrderedDict

from plastid.readers.wiggle import WiggleReader
from plastid.readers.bowtie import BowtieReader
from plastid.readers.bigwig import BigWigReader
from plastid.genomics.roitools import GenomicSegment, SegmentChain
from plastid.util.services.mini2to3 import xrange, ifilter
from plastid.util.services.exceptions import DataWarning, warn
from plastid.util.io.openers import NullWriter, multiopen

from plastid.genomics.map_factories import *

MIN_CHR_SIZE = int(10 * 1e6)  # 10 Mb minimum size for unspecified chromosomes

#===============================================================================
# INDEX: Mapping functions for GenomeArray and SparseGenomeArray.
#        these are used by add_from_bowtie() and add_from_tagalign()
#        to map read alignments to specific sites.
#
#        See function documentation for more details
#===============================================================================


[docs]def center_map(feature, **kwargs):
    """Center-mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`.
    A user-specified number of bases is optionally removed from each side of each
    read alignment, and the `N` remaining bases are each apportioned `1/N`
    of the read count, so that the entire read is counted once.
    
    Parameters
    ----------
    feature : |SegmentChain|
        Ungapped genomic alignment
        
    kwargs['value'] : float, optional
        Total value to divide over aligning positions (Default: `1.0`)
        
    kwargs['nibble'] : int, optional
        Positions to remove from each end before mapping (Default: `0`)
        
    kwargs['offset'] : int, optional
        Mapping offset, if any, from 5' end of read (Default: `0`)
    
    Returns
    -------
    list
        tuples of `(GenomicSegment,float value over segment)`
    """
    nibble = kwargs["nibble"]
    read_length = feature.length
    if read_length <= 2 * nibble:
        warn(
            "File contains read alignments shorter (%s nt) than `2*'nibble'` value of %s nt. Ignoring these."
            % (read_length, 2 * nibble), DataWarning
        )
        return []

    # yapf: disable
    span = feature.spanning_segment
    strand = span.strand
    chrom  = span.chrom
    offset = kwargs.get("offset", 0)
    value  = float(kwargs.get("value", 1.0))
    sign   = -1 if strand == "-" else 1
    start  = span.start + nibble + sign * offset
    end    = span.end - nibble + sign * offset
    seg    = GenomicSegment(chrom, start, end, strand)
    frac   = value / len(seg)
    return [(seg, frac)]


[docs]def five_prime_map(feature, **kwargs):
    """Fiveprime mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`.
    Reads are mapped at a user-specified offset from the fiveprime end of the alignment.
    
    Parameters
    ----------
    feature : |SegmentChain|
        Ungapped genomic alignment
        
    kwargs['value'] : float or int, optional
        Value to apportion (Default: `1`)
        
    kwargs['offset'] : int, optional
        Mapping offset, if any, from 5' toward 3' end of read
    
    Returns
    -------
    list
        tuples of `(GenomicSegment,float value over segment)`
    """
    offset = kwargs.get("offset", 0)
    read_length = feature.length
    if offset > read_length:
        warn(
            "File contains read alignments shorter (%s nt) than offset (%s nt). Ignoring." %
            (read_length, offset), DataWarning
        )
        return []

    value = kwargs.get("value", 1.0)
    span = feature.spanning_segment
    strand = span.strand
    if strand in ("+", "."):
        start = span.start + offset
    else:
        start = span.end - 1 - offset
    seg = GenomicSegment(span.chrom, start, start + 1, strand)
    return [(seg, value)]


[docs]def three_prime_map(feature, **kwargs):
    """Threeprime mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`.
    Reads are mapped at a user-specified offset from the threeprime end of the alignment.
    
    Parameters
    ----------
    feature : |SegmentChain|
        Ungapped genomic alignment
        
    kwargs['value'] : float or int, optional
        Value to apportion (Default: `1`)
        
    kwargs['offset'] : int, optional
        Mapping offset, if any, from 3' toward 5' end of read.
    
    Returns
    -------
    list
        tuples of `(GenomicSegment,float value over segment)`
    """
    offset = kwargs.get("offset", 0)
    read_length = feature.length
    if offset > read_length:
        warn(
            "File contains read alignments shorter (%s nt) than offset (%s nt). Ignoring." %
            (read_length, offset), DataWarning
        )
        return []

    value = kwargs.get("value", 1.0)
    span = feature.spanning_segment
    strand = span.strand
    if strand in ("+", "."):
        start = span.end - 1 - offset
    else:
        start = span.start + offset
    seg = GenomicSegment(span.chrom, start, start + 1, strand)
    return [(seg, value)]


[docs]def variable_five_prime_map(feature, **kwargs):
    """Fiveprime variable mapping function used as an argument to :py:meth:`GenomeArray.add_from_bowtie`.
    Reads are mapped at a user-specified offset from the fiveprime end of the alignment.
    The offset for a read of a given length is supplied in `kwargs[offset][readlen]`
    
    Parameters
    ----------
    feature : |SegmentChain|
        Ungapped genomic alignment
        
    kwargs['offset'] : dict, required
        Dictionary mapping read lengths to offsets

    kwargs['value'] : float or int, optional
        Value to apportion (Default: `1`)
    
    Returns
    -------
    list
        tuples of `(GenomicSegment,float value over segment)`
    """
    span = feature.spanning_segment
    strand = span.strand
    value = kwargs.get("value", 1.0)
    offset = kwargs["offset"].get(len(span), kwargs["offset"].get("default", None))
    feature_length = feature.length
    if offset is None:
        warn("No offset for reads of length %s. Ignoring." % feature_length, DataWarning)
        return []
    if offset >= feature_length:
        warn("Offset (%s nt) longer than read length %s. Ignoring" % (offset, feature_length))

    if strand in ("+", "."):
        start = span.start + offset
    else:
        start = span.end - 1 - offset
    seg = GenomicSegment(span.chrom, start, start + 1, strand)
    return [(seg, value)]


#===============================================================================
# GenomeArray classes
#===============================================================================

_DEFAULT_STRANDS = ("+", "-")


class AbstractGenomeArray(object):
    """Abstract base class for all |GenomeArray|-like objects"""

    def __init__(self, chr_lengths=None, strands=None):
        """
        Parameters
        ----------
        chr_lengths : dict
            Dictionary mapping chromosome names to lengths. Suppyling this in
            advance yields considerable speed improvements, as memory can be
            pre-allocated for each chromosomal position. If not provided, a
            minimum chromosome size will be guessed, and chromosomes re-sized
            as needed.
        
        strands : sequence, optional
            Strands for the |AbstractGenomeArray|. (Default: `("+","-")`)
        """
        self._chroms = {}
        self._strands = _DEFAULT_STRANDS if strands is None else strands
        self._sum = None
        self._normalize = False

    def __str__(self):
        return repr(self)

    def __repr__(self):
        stmp = "<%s len=%s sum=%s chroms=" % (self.__class__.__name__, len(self), self.sum())
        stmp += ",".join(self.chroms())
        stmp += " strands=%s" % ",".join(self.strands())
        stmp += ">"
        return stmp

    def __contains__(self, chrom):
        """Return `True` if `chrom` is defined in the array, otherwise False
        
        Returns
        -------
        bool
        """
        return chrom in self._chroms

    def __len__(self):
        """Return size of GenomeArray in nucleotide positions (not base pairs).
        To obtain size in base pairs, divide length by two.
        
        Returns
        -------
        int
        """
        return len(self._strands) * sum(self.lengths().values())

    @abstractmethod
    def __getitem__(self, roi):
        """Retrieve array of counts from a region of interest. The values in
        the returned array are in 5' to 3' with respect to `seg` rather than
        the genome (i.e. are reversed for reverse-strand features).
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome
        
        Returns
        -------
        :class:`numpy.ndarray`
            vector of numbers, each position corresponding to a position
            in `seg`, from 5' to 3' relative to `seg`
        
        See also
        --------
        plastid.genomics.roitools.SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        pass

    @abstractmethod
    def reset_sum(self):
        """Reset sum to total mapped reads in the GenomeArray"""

    def set_sum(self, val):
        """Set sum used for normalization to an arbitrary value (e.g. from another dataset)
        
        Parameters
        ----------
        val : int or float
            a number
        """
        self._sum = val

    def sum(self):
        """Return the total number of aligned reads or the sum of the 
        quantitative data across all positions in the GenomeArray
        
        Notes
        -----
        The true (i.e. unnormalized) sum is always reported, even if 
        :meth:`set_normalize` is set to True
        
        Returns
        -------
        int or float
        """
        if self._sum is None:
            self.reset_sum()

        return self._sum

    def set_normalize(self, value=True):
        """Toggle normalization of reported values to reads per million mapped in the dataset
        
        Parameters
        ----------
        value : bool
            If `True`, all values fetched will be normalized to reads
            per million. If `False`, all values will not be normalized.
        """
        assert value in (True, False)
        self._normalize = value

    def chroms(self):
        """Return a list of chromosomes in the GenomeArray
        
        Returns
        -------
        list
            Chromosome names as strings
        """
        return self._chroms.keys()

    def strands(self):
        """Return a tuple of strands in the GenomeArray
        
        Returns
        -------
        tuple
            Chromosome strands as strings
        """
        return self._strands

    def lengths(self):
        """Return a dictionary mapping chromosome names to lengths.
        When two strands report different lengths for a chromosome, the
        max length is taken.
        
        Returns
        -------
        dict
            Dictionary mapping chromosome names to chromosome lengths
        """
        d_out = {}.fromkeys(self.keys())
        for key in d_out:
            d_out[key] = max([len(self._chroms[key][X]) for X in self.strands()])
        return d_out


[docs]class MutableAbstractGenomeArray(AbstractGenomeArray):
    """Abstract base class for |GenomeArray|-like objects whose values can be
    changed
    """

    @abstractmethod
    def __setitem__(self, seg, val):
        """Set values in |MutableAbstractGenomeArray| over a region of interest.
        
        Parameters
        ----------
        seg : |GenomicSegment| or |SegmentChain|
            Region of interest

        val : int, float, or :class:`numpy.ndarray`
            Scalar or vector of values to set in array over `seg`.
            If a vector, values should be ordered 5'-to-3' relative
            to `seg` rather than (i.e. for a reverse-strand feature,
            position 0 in the vector would correspond to seg.end)
            in the genome
        """
        pass


[docs]class BAMGenomeArray(AbstractGenomeArray):
    """
    BAMGenomeArray(*bamfiles,mapping=CenterMapFactory())
    
    A GenomeArray for :term:`read alignments` in `BAM`_ files.
    
    When a user requests :term:`read alignments` or :term:`counts` at
    a set of genomic positions, |BAMGenomeArray| uses a :term:`mapping rule`
    to determine which reads from the `BAM`_ file should be returned and how
    these reads should be mapped to positions in a vector. Reads
    are not mapped until requested, and mapping rules are changeable at
    runtime via :meth:`BAMGenomeArray.set_mapping`
    
    Parameters
    ----------
    bamfile : One or more filenames or open :class:`pysam.AlignmentFile`
        Filename(s) of `BAM`_ files, or open :class:`pysam.AlignmentFile`
        to include in array. Note: all `BAM`_ files must be sorted and indexed
        by `samtools`_. 
    
    mapping : func
        :term:`mapping function` that determines how each read alignment is mapped to a 
        count at a genomic position. Examples include mapping reads to their
        fiveprime ends, mapping reads to their threeprime ends, or mapping
        somewhere in the middle. Factories to produce such functions are provided.
        See references below. (Default: :func:`CenterMapFactory`)

    
    Attributes
    ----------
    map_fn : function
        :term:`mapping function` used to convert :term:`read alignments` to :term:`counts`
    
    bamfiles : list
        `BAM`_ files or :term:`read alignments`, as a list of :py:class:`pysam.AlignmentFile`

    
    Notes
    -----
    The alignment data in |BAMGenomeArray| objects is immutable. If you need
    to change the data in-place, the |BAMGenomeArray| can be converted to a
    |GenomeArray| or |SparseGenomeArray| via :meth:`~BAMGenomeArray.to_genome_array`
    """

    def __init__(self, *bamfiles, **kwargs):  #mapping=None):
        """Create a |BAMGenomeArray|
        
        Parameters
        ----------
        bamfile : One or more filenames or open :class:`pysam.AlignmentFile`
            Filename(s) of `BAM`_ files, or open :class:`pysam.AlignmentFile`
            to include in array. Note: all `BAM`_ files must be sorted and indexed
            by `samtools`_. 
        
        mapping : func
            :term:`mapping function` that determines how each read alignment is mapped to a 
            count at a genomic position. Examples include mapping reads to their
            fiveprime ends, mapping reads to their threeprime ends, or mapping
            somewhere in the middle. Factories to produce such functions are provided.
            See references below. (Default: :func:`CenterMapFactory`)

        See also
        --------
        plastid.genomics.map_factories.FivePrimeMapFactory
            map reads to 5' ends, with or without applying an offset
        
        plastid.genomics.map_factories.VariableFivePrimeMapFactory
            map reads to 5' ends, choosing an offset determined by read length
        
        plastid.genomics.map_factories.ThreePrimeMapFactory
            map reads to 3' ends, with or without applying an offset
        
        plastid.genomics.map_factories.CenterMapFactory
            map each read fractionally to every position in the read, optionally trimming positions from the ends first
        """
        if len(bamfiles) == 1 and isinstance(bamfiles[0], list):
            bamfiles = bamfiles[0]

        bamfiles = list(multiopen(bamfiles, fn=pysam.AlignmentFile, args=("rb", )))
        #bamfiles = [pysam.AlignmentFile(X,"rb") if isinstance(X,str) else X for X in bamfiles]
        self.bamfiles = bamfiles
        self.map_fn = kwargs.get("mapping", CenterMapFactory())  # if mapping is None else mapping
        self._strands = ("+", "-", ".")
        self._normalize = False

        self._chr_lengths = {}
        for bamfile in self.bamfiles:
            for k, v in zip(bamfile.references, bamfile.lengths):
                self._chr_lengths[k] = max(self._chr_lengths.get(k, 0), v)

        self._chroms = sorted(list(self._chr_lengths.keys()))

        self._filters = OrderedDict()
        self._update()

    def __del__(self):
        for bamfile in self.bamfiles:
            bamfile.close()

[docs]    def reset_sum(self):
        """Reset the sum to the total number of mapped reads in the |BAMGenomeArray|
        
        Notes
        -----
        Filters are not applied in this summation. See :meth:`BAMGenomeArray.set_sum`
        to manually set a sum, which e.g. could correspond to a total number of filtered
        reads.
        """
        self._sum = sum([X.mapped for X in self.bamfiles])

    def _update(self):
        """Updates mapping function to suit mapping rules
        """
        self.reset_sum()

[docs]    def add_filter(self, name, func):
        """Apply a function to filter reads retrieved from regions before mapping and counting
        
        Parameters
        ----------
        name : str
            A name for the filter. If not unique, will overwrite
            previous filter
        
        func : func
            Filter function. Function must take a
            :class:`pysam.AlignedSegment` as a single parameter, and return
            `True` if that read should be included in output, or `False` otherwise
        
        Notes
        -----
        In Python, `lambda` functions do NOT have their own scope! We strongly
        recomend defining filter functions using the ``def`` syntax to avoid
        namespace collisions.
        
        See also
        --------
        plastid.genomics.map_factories.SizeFilterFactory
            generate filter functions that gate read alignments on size
        """
        self._filters[name] = func

[docs]    def remove_filter(self, name):
        """Remove a generic filter
        
        Parameters
        ----------
        name : str
            A name for the filter
        
        Returns
        -------
        func
            the removed filter function
        """
        retval = self._filters.pop(name)
        return retval

[docs]    def chroms(self):
        """Returns a list of chromosomes
        
        Returns
        -------
        list
            sorted chromosome names as strings
        """
        return self._chroms

[docs]    def lengths(self):
        """Returns a dictionary mapping chromosome names to lengths. 
        
        Returns
        -------
        dict
            mapping chromosome names to lengths
        """
        return self._chr_lengths

[docs]    def get_reads_and_counts(self, roi, roi_order=True):
        """Return :term:`read alignments` covering a |GenomicSegment|, and a
        count vector mapping reads to each positions in the |GenomicSegment|,
        following the rule specified by :meth:`~BAMGenomeArray.set_mapping`.
        Reads are strand-matched to `roi` by default. To obtain unstranded reads,
        set the value of `roi.strand` to `'.'`
        
        Parameters
        ----------
        roi : |GenomicSegment|
            Region of interest

        
        Returns
        -------
        list
            List of reads (as :class:`pysam.AlignedSegment`)
            covering region of interest

        numpy.ndarray
            Counts at each position of `roi`, under whatever mapping rule
            was set by :py:meth:`~BAMGenomeArray.set_mapping`


        Raises
        ------
        ValueError
            if bamfiles not sorted or not indexed
        """
        # fetch reads
        chrom = roi.chrom
        strand = roi.strand
        start = roi.start
        end = roi.end

        if chrom not in self.chroms():
            # FIXME: generalize to N-D
            shape = [1] + getattr(self.map_fn, "shape", [])
            return [], numpy.zeros(shape)

        reads = itertools.chain.from_iterable(
            (
                X.fetch(
                    reference=chrom,
                    start=start,
                    end=end,
                    # until_eof=True, # this could speed things up. need to test/investigate
                ) for X in self.bamfiles
            )
        )

        # filter by strand
        if strand == "+":
            reads = ifilter(lambda x: x.is_reverse is False, reads)
        elif strand == "-":
            reads = ifilter(lambda x: x.is_reverse is True, reads)

        # Pass through additional filters (e.g. size filters, if they have
        # been added)
        for my_filter in self._filters.values():
            reads = ifilter(my_filter, reads)

        # retrieve selected parts of regions
        reads, count_array = self.map_fn(list(reads), roi)

        # normalize to reads per million if normalization flag is set
        if self._normalize is True:
            count_array = count_array / float(self.sum()) * 1e6

        if roi_order == True and strand == "-":
            count_array = count_array[..., ::-1]

        return reads, count_array

[docs]    def get_reads(self, roi):
        """Returns reads covering a |GenomicSegment|. Reads are strand-matched
        to `roi` by default, and are included or excluded depending upon the
        rule specified by :meth:`~BAMGenomeArray.set_mapping`. To obtain unstranded
        reads, set the value of `roi.strand` to `'.'`
        
        Parameters
        ----------
        roi : |GenomicSegment|
            Region of interest

        
        Returns
        -------
        list
            List of reads (as :class:`pysam.AlignedSegment`) covering region of interest,
            according to the rule set in :meth:`~BAMGenomeArray.set_mapping`


        Raises
        ------
        ValueError
            if bamfiles are not sorted or not indexed
        """
        reads, _ = self.get_reads_and_counts(roi)
        return reads

    def __getitem__(self, roi):
        """Retrieve array of counts from a region of interest, following
        the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`.
        Values in the vector are ordered 5' to 3' relative to `roi`
        rather than the genome (i.e. are reversed for reverse-strand
        features).
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome

        Raises
        ------
        ValueError
            if bamfiles not sorted or not indexed

        Returns
        -------
        numpy.ndarray
            vector of numbers, each position corresponding to a position
            in `roi`, from 5' to 3' relative to `roi`
        
        See also
        --------
        plastid.genomics.roitools.SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        return self.get(roi, roi_order=True)

[docs]    def get(self, roi, roi_order=True):
        """Retrieve array of counts from a region of interest, following
        the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`.
        Values in the vector are ordered 5' to 3' relative to `roi`
        rather than the genome (i.e. are reversed for reverse-strand
        features).
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome
        
        roi_order : bool, optional
            If `True` (default) return vector of values 5' to 3' 
            relative to vector rather than genome.

        Raises
        ------
        ValueError
            if bamfiles not sorted or not indexed

        Returns
        -------
        numpy.ndarray
            vector of numbers, each position corresponding to a position
            in `roi`, from 5' to 3' relative to `roi`
        
        See also
        --------
        plastid.genomics.roitools.SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        if isinstance(roi, SegmentChain):
            return roi.get_counts(self)

        _, count_array = self.get_reads_and_counts(roi, roi_order=roi_order)

        return count_array

[docs]    def get_mapping(self):
        """Return the docstring of the current mapping function
        """
        return self.map_fn.__doc__

[docs]    def set_mapping(self, mapping_function):
        """Change the mapping rule
        
        Parameters
        ----------
        mapping : func
            Function that determines how each read alignment is mapped to a 
            count at a position in the |BAMGenomeArray|. Examples include
            mapping reads to their fiveprime or threeprime ends, with or
            without offsets. Factories to produce such functions are provided.
            See references below.

            
        See also
        --------
        plastid.genomics.map_factories.FivePrimeMapFactory
            map reads to 5' ends, with or without applying an offset
        
        plastid.genomics.map_factories.VariableFivePrimeMapFactory
            map reads to 5' ends, choosing an offset determined by read length
        
        plastid.genomics.map_factories.ThreePrimeMapFactory
            map reads to 3' ends, with or without applying an offset
        
        plastid.genomics.map_factories.CenterMapFactory
            map each read fractionally to every position in the read, optionally trimming positions from the ends first
        """
        self.map_fn = mapping_function
        self._update()

[docs]    def to_genome_array(self, array_type=None):
        """Converts |BAMGenomeArray| to a |GenomeArray| or |SparseGenomeArray|
        under the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`

        Parameters
        ----------
        array_type : class
            Type of GenomeArray to return. |GenomeArray| or |SparseGenomeArray|.
            (Default: |GenomeArray|)
        
        Returns
        -------
        |GenomeArray| or |SparseGenomeArray|
        """
        if array_type is None:
            array_type = GenomeArray

        ga = array_type(chr_lengths=self.lengths(), strands=self.strands())
        for chrom in self.chroms():
            for strand in self.strands():
                seg = GenomicSegment(chrom, 0, self.lengths()[chrom] - 1, strand)
                ga[seg] = self[seg]

        return ga

[docs]    def to_variable_step(self, fh, trackname, strand, window_size=100000, printer=None, **kwargs):
        """Write the contents of the |BAMGenomeArray| to a variableStep `Wiggle`_ file
        under the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`.
        
        See the `Wiggle spec <http://genome.ucsc.edu/goldenpath/help/wiggle.html>`_ for format details.
        
        Parameters
        ----------
        fh : file-like
            Filehandle to write to
        
        trackname : str
            Name of browser track
          
        strand : str
            Strand of |BAMGenomeArray| to export. `'+'`, `'-'`, or `'.'`
        
        window_size : int
            Size of chromosome/contig to process at a time.
            Larger values are faster but less memory-efficient
        
        printer : file-like, optional
            Something implementing a write() method for output

        **kwargs
            Any other key-value pairs to include in track definition line
        """
        assert strand in self.strands()
        printer = NullWriter() if printer is None else printer
        fh.write("track type=wiggle_0 name=%s" % trackname)
        if kwargs is not None:
            for k, v in sorted(kwargs.items(), key=lambda x: x[0]):
                fh.write(" %s=%s" % (k, v))
        fh.write("\n")

        for chrom in sorted(self.chroms()):
            my_size = self.lengths()[chrom]
            printer.write("Writing chromosome %s..." % chrom)
            fh.write("variableStep chrom=%s span=1\n" % chrom)
            window_starts = xrange(0, self._chr_lengths[chrom], window_size)
            for my_start in window_starts:
                my_end = min(my_start + window_size, my_size)
                my_counts = self.get(
                    GenomicSegment(chrom, my_start, my_end, strand), roi_order=False
                )
                if my_counts.sum() > 0:
                    for idx in my_counts.nonzero()[0]:
                        genomic_x = my_start + idx
                        val = my_counts[idx]
                        fh.write("%s\t%s\n" % (genomic_x + 1, val))

[docs]    def to_bedgraph(self, fh, trackname, strand, window_size=100000, printer=None, **kwargs):
        """Write the contents of the |BAMGenomeArray| to a `bedGraph`_ file
        under the mapping rule set by :meth:`~BAMGenomeArray.set_mapping`.
        
        See the `bedGraph spec <https://cgwb.nci.nih.gov/goldenPath/help/bedgraph.html>`_
        for format details.
            
        Parameters
        ----------
        fh : file-like
            Filehandle to write to
        
        trackname : str
            Name of browser track
          
        strand : str
            Strand of |BAMGenomeArray| to export. `'+'`, `'-'`, or `'.'`
        
        window_size : int
            Size of chromosome/contig to process at a time.
            Larger values are faster but less memory-efficient
         
        printer : file-like, optional
            Something implementing a write() method for output

        **kwargs
            Any other key-value pairs to include in track definition line
        """
        assert strand in self.strands()
        assert window_size > 0
        printer = NullWriter() if printer is None else printer
        # write header
        fh.write("track type=bedGraph name=%s" % trackname)
        if kwargs is not None:
            for k, v in sorted(kwargs.items(), key=lambda x: x[0]):
                fh.write(" %s=%s" % (k, v))
        fh.write("\n")

        for chrom in sorted(self.chroms()):
            printer.write("Writing chromosome %s..." % chrom)
            window_starts = xrange(0, self._chr_lengths[chrom], window_size)
            my_size = self.lengths()[chrom]
            for my_start in window_starts:
                my_end = min(my_start + window_size, my_size)
                my_counts = self.get(
                    GenomicSegment(chrom, my_start, my_end, strand),
                    roi_order=False
                )

                if my_counts.sum() > 0:
                    genomic_start_x = my_start
                    last_val = my_counts[0]

                    for x, val in enumerate(my_counts[1:]):
                        if val != last_val:
                            genomic_end_x = 1 + x + my_start
                            #write line: chrom chromStart chromEnd dataValue. 0-based half-open
                            if last_val > 0:
                                fh.write(
                                    "%s\t%s\t%s\t%s\n" %
                                    (chrom, genomic_start_x, genomic_end_x, last_val)
                                )

                            #update variables
                            last_val = val
                            genomic_start_x = genomic_end_x
                        else:
                            continue
                    # write out last values for window
                    if last_val > 0:
                        fh.write("%s\t%s\t%s\t%s\n" % (chrom, genomic_start_x, my_end, last_val))


[docs]class BigWigGenomeArray(AbstractGenomeArray):
    """BigWigGenomeArray(maxmem=0)
    
    High-performance GenomeArray for `BigWig`_ files.

    Parameters
    ----------
    maxmem : float
        Maximum desired memory footprint for C objects, in megabytes.
        May be temporarily exceeded if large queries are requested.
        (Default: 0, No maximum)        
    
    """

    def __init__(self, maxmem=0, **kwargs):
        """Create a |BigWigGenomeArray|.
        
        `BigWig`_ files may be added to the array via
        :meth:`~BigWigGenomeArray.add_from_bigwig`.
        
        Parameters
        ----------
        maxmem : float
            Maximum desired memory footprint for C objects, in megabytes.
            May be temporarily exceeded if large queries are requested.
            (Default: 0, No maximum)        
        """
        self._strand_dict = {}
        self._normalize = False
        self._chromset = None
        self._sum = None
        self._lengths = None
        self._strands = []
        self._maxmem = maxmem
        self.fill = 0.0  # fill

    def __getitem__(self, roi):
        """Retrieve array of counts from a region of interest.
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome

        Returns
        -------
        numpy.ndarray
            vector of numbers, each position corresponding to a position
            in `roi`, from 5' to 3' relative to `roi`
        
        See also
        --------
        plastid.genomics.roitools.SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        return self.get(roi, roi_order=True)

[docs]    def get(self, roi, roi_order=True):
        """Retrieve array of counts from a region of interest.
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome
        
        roi_order : bool, optional
            If `True` (default) return vector of values 5' to 3' 
            relative to vector rather than genome.

        Returns
        -------
        numpy.ndarray
            vector of numbers, each position corresponding to a position
            in `roi`, from 5' to 3' relative to `roi`
        
        See also
        --------
        plastid.genomics.roitools.SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        if isinstance(roi, SegmentChain):
            return roi.get_counts(self)

        strand = roi.strand
        sdict = self._strand_dict

        count_vec = numpy.zeros(len(roi))
        if strand in sdict:
            for bw in sdict[strand]:
                 # we flip strand later to save operations
                count_vec += bw.get(roi, roi_order=False)
        else:
            warnings.warn(
                "Strand '%s' not in BigWigGenomeArray (has %s)." %
                (strand, ", ".join(sdict.keys())), DataWarning
            )

        if self._normalize is True:
            count_vec = count_vec / float(self.sum()) * 1e6

        if roi_order == True and strand == "-":
            count_vec = count_vec[::-1]

        return count_vec

[docs]    def strands(self):
        """Return a tuple of strands in the GenomeArray
        
        Returns
        -------
        tuple
            Chromosome strands as strings
        """
        return self._strands

[docs]    def chroms(self):
        """Return a list of chromosomes in the GenomeArray
        
        Returns
        -------
        list
            Chromosome names as strings
        """
        if self._chromset is not None:
            return self._chromset
        else:
            chromset = []
            for ltmp in self._strand_dict.values():
                for bw in ltmp:
                    chromset.extend(bw.chroms.keys())

            self._chromset = set(chromset)
            return self._chromset

[docs]    def lengths(self):
        """Return a dictionary mapping chromosome names to lengths.
        When two strands report different lengths for a chromosome, the
        max length is taken.
        
        Returns
        -------
        dict
            Dictionary mapping chromosome names to chromosome lengths
        """
        if self._lengths is not None:
            return self._lengths
        else:
            lengths = {}
            for ltmp in self._strand_dict.values():
                for bw in ltmp:
                    sizedict = bw.chroms
                    for k, v in sizedict.items():
                        lengths[k] = max(lengths.get(k, 0), v)

            self._lengths = lengths
            return lengths

[docs]    def add_from_bigwig(self, filename, strand):
        """Import additional data from a `BigWig`_ file
        
        Parameters
        ----------
        filename : str
            Path to `BigWig`_ file
        
        strand : str
            Strand to which data should be added. `'+'`, `'-'`, or `'.'`
        """
        self._chromset = None
        self._lengths = None
        self._sum = None
        bw = BigWigReader(filename, fill=self.fill, maxmem=self._maxmem)

        try:
            self._strand_dict[strand].append(bw)
        except KeyError:
            self._strand_dict[strand] = [bw]

        self._strands = sorted(self._strand_dict.keys())

[docs]    def reset_sum(self):
        """Reset sum to total of data in the |BigWigGenomeArray|"""
        my_sum = 0
        for ltmp in self._strand_dict.values():
            for bw in ltmp:
                my_sum += bw.sum()

        self._sum = my_sum
        return my_sum

[docs]    def to_genome_array(self):
        """Converts |BigWigGenomeArray| to a |GenomeArray|

        Returns
        -------
        |GenomeArray|
        """
        ga = GenomeArray(chr_lengths=self.lengths(), strands=self.strands())
        for chrom, length in self.lengths().items():
            for strand in self._strands:
                region = GenomicSegment(chrom, 0, length, strand)
                for reader in self._strand_dict.get(strand, []):
                    old = ga.get(region, roi_order=False)
                    new = old + reader.get_chromosome_counts(chrom)
                    ga.__setitem__(region, new, roi_order=False)

        return ga


[docs]class GenomeArray(MutableAbstractGenomeArray):
    """Array-like data structure that maps numerical values (e.g. read :term:`counts`
    or conservation data, et c) to nucleotide positions in a genome.

    Data in the array may be manipulated manually and/or imported from `Wiggle`_
    or `BedGraph`_ files, as well as `bowtie`_ alignments.
    
    |GenomeArray| supports basic mathematical operations elementwise, where elements
    are nucleotide positions.
    
    Parameters
    ----------
    chr_lengths : dict or `None`, optional
        Dictionary mapping chromosome names to lengths.
        Suppyling this in advance yields considerable
        speed improvements, as memory can be pre-allocated
        for each chromosomal position. If not provided,
        a minimum chromosome size will be guessed, and
        chromosomes re-sized as needed.
    
    min_chr_size : int
        If chr_lengths is not supplied, `min_chr_size` is 
        the default first guess of a chromosome size. If
        the genome has large chromosomes, it is much
        much better, speed-wise, to provide `chr_lengths`
        than to provide a guess here that is too small.
    
    strands : sequence
        Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`)    
    """

    def __init__(self, chr_lengths=None, strands=None, min_chr_size=MIN_CHR_SIZE):
        """Create a |GenomeArray|
        
        Parameters
        ----------
        chr_lengths : dict or `None`, optional
            Dictionary mapping chromosome names to lengths.
            Suppyling this in advance yields considerable
            speed improvements, as memory can be pre-allocated
            for each chromosomal position. If not provided,
            a minimum chromosome size will be guessed, and
            chromosomes re-sized as needed.
        
        min_chr_size : int
            If chr_lengths is not supplied, `min_chr_size` is 
            the default first guess of a chromosome size. If
            the genome has large chromosomes, it is much
            much better, speed-wise, to provide `chr_lengths`
            than to provide a guess here that is too small.
        
        strands : sequence
            Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`)
        """
        self._chroms = {}
        self._strands = _DEFAULT_STRANDS if strands is None else strands
        self.min_chr_size = min_chr_size
        self._sum = None
        self._normalize = False
        if chr_lengths is not None:
            for chrom in chr_lengths.keys():
                self._chroms[chrom] = {}
                for strand in self._strands:
                    l = chr_lengths[chrom]
                    self._chroms[chrom][strand] = numpy.zeros(l)

[docs]    def reset_sum(self):
        """Reset the sum of the |GenomeArray| to the sum of all positions in the array
        """
        self._sum = sum([X.sum() for X in self.iterchroms()])

    def _has_same_dimensions(self, other):
        """Return `True` if `self` and `other` have the chromosomes, strands, and chromosome lengths
        
        Parameters
        ----------
        other : |GenomeArray|
        
        
        Returns
        -------
        bool
        """
        assert set(self.keys()) == set(other.keys())
        assert self.strands() == other.strands()
        assert self.lengths() == other.lengths()

    def __eq__(self, other, tol=1e-10):
        """Test equality between `self` and `other`.
        
        To be equal, both |GenomeArrays| must have identical values at all
        nonzero positions in either array. Chromosome lengths need not be equal. 
        
        Parameters
        ----------
        other : |GenomeArray| or |SparseGenomeArray|
        
        
        Returns
        -------
        bool
        """
        snz = self.nonzero()
        onz = other.nonzero()
        for chrom in set(self.chroms()) | set(other.chroms()):
            for strand in set(self.strands()) | set(other.strands()):
                self_nonzero_vec = snz.get(chrom,
                                           {K: numpy.array([])
                                            for K in self.strands()}).get(strand, numpy.array([]))
                other_nonzero_vec = onz.get(chrom, {K: numpy.array([])
                                                    for K in other.strands()
                                                    }).get(strand, numpy.array([]))

                if len(self_nonzero_vec) != len(other_nonzero_vec):
                    return False

                # indices of nonzero positions must be same
                if (self_nonzero_vec != other_nonzero_vec).any():
                    return False

                # values of nonzero positions must be same
                if len(self_nonzero_vec) > 0:
                    test_seg = GenomicSegment(
                        chrom, self_nonzero_vec.min(), self_nonzero_vec.max(), strand
                    )
                    if not (self[test_seg] - other[test_seg] <= tol).all():
                        return False

        return True

    def __getitem__(self, roi):
        """Retrieve array of counts from a region of interest (`roi`)
        with values in vector ordered 5' to 3' relative to `roi`
        rather than genome (i.e. are reversed for reverse-strand
        features).
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome
        
        Returns
        -------
        :class:`numpy.ndarray`
            vector of numbers, each position corresponding to a position
            in `roi`, from 5' to 3' relative to `roi`
        
        See also
        --------
        plastid.genomics.roitools.SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        return self.get(roi, roi_order=True)

[docs]    def get(self, roi, roi_order=True):
        """Retrieve array of counts from a region of interest (`roi`)
        with values in vector ordered 5' to 3' relative to `roi`
        rather than genome (i.e. are reversed for reverse-strand
        features).
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome
        
        roi_order : bool, optional
            If `True` (default) return vector of values 5' to 3' 
            relative to vector rather than genome.

        Returns
        -------
        :class:`numpy.ndarray`
            vector of numbers, each position corresponding to a position
            in `roi`, from 5' to 3' relative to `roi`
        
        See also
        --------
        plastid.genomics.roitools.SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        if isinstance(roi, SegmentChain):
            return roi.get_counts(self)

        chrom = roi.chrom
        strand = roi.strand
        start = roi.start
        end = roi.end
        try:
            assert roi.end < len(self._chroms[chrom][strand])
        except AssertionError:
            my_len = len(self._chroms[chrom][strand])
            new_size = max(my_len + 10000, end + 10000)
            for my_strand in self.strands():
                new_strand = copy.deepcopy(self._chroms[roi.chrom][my_strand])
                new_strand.resize(new_size)
                self._chroms[chrom][my_strand] = new_strand
        except KeyError:
            assert strand in self.strands()
            if chrom not in self.keys():
                self._chroms[chrom] = {}
                for my_strand in self.strands():
                    self._chroms[chrom][my_strand] = numpy.zeros(self.min_chr_size)

        vals = self._chroms[chrom][strand][start:end]
        if self._normalize is True:
            vals = 1e6 * vals / self.sum()

        if roi_order == True and strand == "-":
            vals = vals[::-1]

        return vals

    def __setitem__(self, seg, val, roi_order=True):
        """Set values in the |GenomeArray| over a region of interest.
        
        If the `seg` is outside the bounds of the current |GenomeArray|,
        the |GenomeArray| will be automatically expanded to accomodate
        the genomic coordinates of `seg`.

        Parameters
        ----------
        seg : |GenomicSegment| or |SegmentChain|
            Region of interest

        val : int, float, or :class:`numpy.ndarray`
            Scalar or vector of values to set in array over `seg`.
            If a vector, values should be ordered 5'-to-3' relative
            to `seg` rather than (i.e. for a reverse-strand feature,
            position 0 in the vector would correspond to seg.end)
            in the genome

        roi_order : bool, optional
            If `True` (default) and `val` is a vector, values in `val`
            are assorted to go 5' to 3' relative to `seg` rather than
            genome
        """
        self._sum = None

        if isinstance(seg, SegmentChain):
            if isinstance(val, numpy.ndarray):
                if seg.spanning_segment.strand == "-":
                    val = val[::-1]

            x = 0
            for subseg in seg:
                if isinstance(val, numpy.ndarray):
                    subval = val[x:x + len(subseg)]
                else:
                    subval = val
                x += len(subseg)
                self.__setitem__(subseg, subval, roi_order=False)

            return

        old_normalize = self._normalize
        if old_normalize == True:
            warn(
                "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.",
                DataWarning
            )

        self.set_normalize(False)

        chrom = seg.chrom
        strand = seg.strand
        start = seg.start
        end = seg.end
        if strand == "-" and isinstance(val, numpy.ndarray) and roi_order == True:
            val = val[::-1]

        try:
            assert end < len(self._chroms[seg.chrom][seg.strand])
        except AssertionError:
            my_len = len(self._chroms[chrom][strand])
            new_size = max(my_len + 10000, end + 10000)
            for my_strand in self.strands():
                # copy then resize, because resize() can't work in-place with refs to array
                new_strand = copy.deepcopy(self._chroms[chrom][my_strand])
                new_strand.resize(new_size)
                self._chroms[chrom][my_strand] = new_strand
        except KeyError:
            assert strand in self.strands()
            if chrom not in self.keys():
                self._chroms[chrom] = {}
                for my_strand in self.strands():
                    self._chroms[chrom][my_strand] = numpy.zeros(self.min_chr_size)

        self._chroms[chrom][strand][start:end] = val
        self.set_normalize(old_normalize)

[docs]    def keys(self):
        """Return names of chromosomes in the GenomeArray"""
        return self.chroms()

[docs]    def iterchroms(self):
        """Return an iterator of each chromosome strand array
        
        Yields
        ------
        numpy.ndarray
            Positionwise counts on a chromosome strand
        """
        for chrom in self.keys():
            for strand in self.strands():
                yield self._chroms[chrom][strand]

    #TODO: no unit test for this
[docs]    def plot(self, chroms=None, strands=None, **plot_kw):
        """Create a plot of coverage along tracks or chromosomes
        
        Parameters
        ----------
        chroms : list
            Chromosomes to plot (default: all)
            
        strands : list
            Strands to plot (default: all)
            
        plot_kw
            A dictionary of keywords to pass to matplotlib
        
        
        Returns
        -------
        :py:class:`matplotlib.figure.Figure`
        """
        import matplotlib
        matplotlib.use("Agg")
        import matplotlib.pyplot as plt

        fig_kw = {'figsize': (8, 10)}
        colors = {"+": "blue", "-": "orange", ".": "darkgreen"}

        if chroms is None:
            chroms = self.keys()

        if strands is None:
            strands = self.strands()

        fig_kw.update(plot_kw)
        f, axes = plt.subplots(nrows=len(chroms), **fig_kw)
        plt.subplots_adjust(hspace=0.3)
        for n, chrom in enumerate(sorted(chroms)):
            x = numpy.arange(0, self.lengths()[chrom])
            for strand in strands:
                multiplier = -1 if strand == "-" else 1
                axes[n].plot(
                    x, self._chroms[chrom][strand] * multiplier, label=strand, color=colors[strand]
                )
                axes[n].set_title(chrom)
                axes[n].set_ylabel("Counts")
                axes[n].set_xlabel("Position (nt)")
                axes[n].ticklabel_format(useOffset=1, style='plain', axis='x')

        return f

[docs]    def nonzero(self):
        """Return the indices of chromosomal positions with non-zero values
        at each chromosome/strand pair. Results are returned as a hierarchical
        dictionary mapping chromosome names, to a dictionary of strands,
        which in turn map to  to arrays of non-zero indices on that chromosome-strand.
        
        Returns
        -------
        dict
            `dict[chrom][strand]` = numpy.ndarray of indices
        """
        d_out = {}
        for key in self.keys():
            d_out[key] = {}
            for strand in self.strands():
                d_out[key][strand] = self._chroms[key][strand].nonzero()[0]
        return d_out

[docs]    def apply_operation(self, other, func, mode="same"):
        """Applies a binary operator to a copy of `self` and to `other` elementwise.
        `other` may be a scalar quantity or another |GenomeArray|. In both cases,
        a new |GenomeArray| is returned, and `self` is left unmodified.
        If :meth:`set_normalize` is set to `True`, it is disabled during the
        operation.
        
        Parameters
        ----------
        other : float, int, or |MutableAbstractGenomeArray|
            Second argument to `func`
        
        func : func
            Function to perform. This must take two arguments.
            a :py:class:`numpy.ndarray` (chromosome-strand) from the
            |GenomeArray| will be supplied as the first argument, and the
            corresponding chromosome-strand from `other` as the second.
            This operation will be applied over all chromosomes and strands.
            
            If mode is set to `all`, and a chromosome or strand is not
            present in one of self or other, zero will be supplied
            as the missing argument to func. func must handle this
            gracefully.
        
        mode : str, choice of `'same'`, `'all'`, or `'truncate'`
            Only relevant if `other` is a |MutableAbstractGenomeArray|
        
            If '`same`' each set of corresponding chromosomes must
            have the same dimensions in both parent |GenomeArray| s.
            In addition, both |GenomeArray| s must have the same
            chromosomes, and the same strands.

            If '`all`', corresponding chromosomes in the parent
            |GenomeArray| s will be resized to the dimensions of
            the larger chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; all chromosomes and strands will be
            included in output. 
                    
            If `'truncate'`,corresponding chromosomes in the parent
            |GenomeArray| s will be resized to the dimensions of
            the smaller chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; chromosomes or strands not common
            to both parents will be ignored.    
        
        Returns
        -------
        |GenomeArray|
            new |GenomeArray| after the operation is applied
        """
        new_array = GenomeArray.like(self)
        old_normalize = self._normalize
        if old_normalize == True:
            warn(
                "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.",
                DataWarning
            )

        if type(other) == self.__class__:
            if mode == "same":
                self._has_same_dimensions(other)
                for chrom in self.keys():
                    for strand in self.strands():
                        seg = GenomicSegment(chrom, 0, len(self._chroms[chrom][strand]), strand)
                        new_array[seg] = func(self[seg], other[seg])
            elif mode == "all":
                chroms = {}.fromkeys(set(self.keys()) | set(other.keys()))
                strands = set(self.strands()) | set(other.strands())
                new_array = GenomeArray(chroms, strands=strands)
                for chrom in chroms:
                    if chrom in self.keys() and chrom in other.keys():
                        for strand in strands:
                            if strand in self.strands() and strand in other.strands():
                                sc = copy.deepcopy(self._chroms[chrom][strand])
                                oc = copy.deepcopy(other._chroms[chrom][strand])
                                if len(sc) > len(oc):
                                    oc.resize(len(sc), refcheck=False)
                                elif len(oc) > len(sc):
                                    sc.resize(len(oc), refcheck=False)
                                new_array._chroms[chrom][strand] = func(sc, oc)
                            elif strand in self.strands():
                                new_array._chroms[chrom][strand] = func(
                                    copy.deepcopy(self._chroms[chrom][strand]), 0
                                )
                            else:
                                new_array._chroms[chrom][strand] = func(
                                    copy.deepcopy(other._chroms[chrom][strand]), 0
                                )
                    elif chrom in self.keys():
                        for strand in strands:
                            new_array._chroms[chrom][strand] = func(
                                copy.deepcopy(self[chrom][strand]), 0
                            )
                    else:
                        for strand in strands:
                            new_array._chroms[chrom][strand] = func(
                                copy.deepcopy(other[chrom][strand]), 0
                            )
            elif mode == "truncate":
                my_strands = set(self.strands()) & set(other.strands())
                my_chroms = set(self.chroms()) & set(other.chroms())
                for chrom in my_chroms:
                    for strand in my_strands:
                        sc = copy.deepcopy(self._chroms[chrom][strand])
                        oc = copy.deepcopy(other._chroms[chrom][strand])
                        if len(sc) > len(oc):
                            sc.resize(len(oc), refcheck=False)
                        elif len(oc) > len(sc):
                            oc.resize(len(sc), refcheck=False)
                        new_array._chroms[chrom][strand] = func(sc, oc)
            else:
                raise ValueError("Mode not understood. Must be 'same', 'all', or 'truncate.'")
        else:
            for chrom in self.keys():
                for strand in self.strands():
                    new_array._chroms[chrom][strand] = func(self._chroms[chrom][strand], other)

        self.set_normalize(old_normalize)
        return new_array

    def __add__(self, other, mode="all"):
        """Add `other` to `self`, returning a new |GenomeArray|
        and leaving `self` unchanged. `other` may be a scalar quantity or
        another |GenomeArray|. In the latter case, addition is elementwise.
        
        Parameters
        ----------
        other : float, int, or |MutableAbstractGenomeArray|
        
        mode : str, choice of `'same'`, `'all'`, or `'truncate'`
            Only relevant if `other` is a |MutableAbstractGenomeArray|
        
            If '`same`' each set of corresponding chromosomes must
            have the same dimensions in both parent |GenomeArrays|.
            In addition, both |GenomeArrays| must have the same
            chromosomes, and the same strands.

            If '`all`', corresponding chromosomes in the parent
            |GenomeArrays| will be resized to the dimensions of
            the larger chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; all chromosomes and strands will be
            included in output. 
                    
            If `'truncate'`,corresponding chromosomes in the parent
            |GenomeArrays| will be resized to the dimensions of
            the smaller chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; chromosomes or strands not common
            to both parents will be ignored.     
        
        Returns
        -------
        |GenomeArray|
        """
        return self.apply_operation(other, operator.add, mode=mode)

    def __mul__(self, other, mode="all"):
        """Multiply `other` by `self`, returning a new |GenomeArray|
        and leaving `self` unchanged. `other` may be a scalar quantity or
        another |GenomeArray|. In the latter case, multiplication is elementwise.
        
        Parameters
        ----------
        other : float, int, or |MutableAbstractGenomeArray|
        
        mode : str, choice of `'same'`, `'all'`, or `'truncate'`
            Only relevant if `other` is a |MutableAbstractGenomeArray|
        
            If '`same`' each set of corresponding chromosomes must
            have the same dimensions in both parent |GenomeArrays|.
            In addition, both |GenomeArrays| s must have the same
            chromosomes, and the same strands.

            If '`all`', corresponding chromosomes in the parent
            |GenomeArrays| will be resized to the dimensions of
            the larger chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; all chromosomes and strands will be
            included in output. 
                    
            If `'truncate'`,corresponding chromosomes in the parent
            |GenomeArrays| will be resized to the dimensions of
            the smaller chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; chromosomes or strands not common
            to both parents will be ignored.                 
        
        Returns
        -------
        |GenomeArray|
        """
        return self.apply_operation(other, operator.mul, mode=mode)

    def __sub__(self, other, mode="all"):
        """Subtract `other` from `self`, returning a new |GenomeArray|
        and leaving `self` unchanged. `other` may be a scalar quantity or
        another |GenomeArray|. In the latter case, subtraction is elementwise.
        
        Parameters
        ----------
        other : float, int, or |MutableAbstractGenomeArray|
        
        mode : str, choice of `'same'`, `'all'`, or `'truncate'`
            Only relevant if `other` is a |MutableAbstractGenomeArray|
        
            If '`same`' each set of corresponding chromosomes must
            have the same dimensions in both parent |GenomeArrays|.
            In addition, both |GenomeArrays| s must have the same
            chromosomes, and the same strands.

            If '`all`', corresponding chromosomes in the parent
            |GenomeArrays| will be resized to the dimensions of
            the larger chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; all chromosomes and strands will be
            included in output. 
                    
            If `'truncate'`,corresponding chromosomes in the parent
            |GenomeArrays| will be resized to the dimensions of
            the smaller chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; chromosomes or strands not common
            to both parents will be ignored.                 
        
        Returns
        -------
        |GenomeArray|
        """
        return self.__add__(other * -1)

[docs]    def add_from_bowtie(self, fh, mapfunc, min_length=25, max_length=numpy.inf, **trans_args):
        """Import alignment data in native `bowtie`_ format to the current GenomeArray
        
        Parameters
        ----------
        fh : file-like
            filehandle pointing to `bowtie`_ file
        
        mapfunc : func
            a :term:`mapping function` that transforms alignment
            coordinates into a sequence of (|GenomicSegment|,value) pairs

        min_length : int, optional
            minimum length for read to be counted
            (Default: 25)
        
        max_length : int or numpy.inf, optional
            maximum length for read to be counted
            (Default: infinity)
        
        trans_args : dict, optional
            Keyword arguments to pass to transformation
            function

            
        See also
        --------
        five_prime_map
            map reads to 5' ends, with or without applying an offset
        
        variable_five_prime_map
            map reads to 5' ends, choosing an offset determined by read length
        
        three_prime_map
            map reads to 3' ends, with or without applying an offset
        
        center_map
            map each read fractionally to every position in the read, optionally
            trimming positions from the ends first            
        """
        for feature in BowtieReader(fh):
            span_len = len(feature.spanning_segment)
            if span_len >= min_length and span_len <= max_length:
                tuples = mapfunc(feature, **trans_args)
                for seg, val in tuples:
                    self[seg] += val

        self._sum = None

[docs]    def add_from_wiggle(self, fh, strand):
        """Import data from a `Wiggle`_ or `bedGraph`_ file to current GenomeArray
        
        Parameters
        ----------
        fh : file-like
            filehandle pointing to wiggle file
        
        strand : str
            Strand to which data should be added. `'+'`, `'-'`, or `'.'`
        """
        assert strand in self.strands()
        for chrom, start, stop, val in WiggleReader(fh):
            seg = GenomicSegment(chrom, start, stop, strand)
            self[seg] += val

        self._sum = None

[docs]    def to_variable_step(self, fh, trackname, strand, printer=None, **kwargs):
        """Export the contents of the GenomeArray to a variable step
        `Wiggle`_ file. For sparse data, `bedGraph`_ can be more efficient format.
        
        See the `Wiggle spec <http://genome.ucsc.edu/goldenpath/help/wiggle.html>`_
        for format details
        
        Parameters
        ----------
        fh : file-like
            Open filehandle to which data will be written
        
        trackname : str
            Name of browser track
        
        strand : str
            Strand to export. `'+'`, `'-'`, or `'.'`
        
        printer : file-like, optional
            Something implementing a write() method for output
            
        **kwargs
            Any other key-value pairs to include in track definition line
        """
        assert strand in self.strands()
        printer = NullWriter() if printer is None else printer
        fh.write("track type=wiggle_0 name=%s" % trackname)
        if kwargs is not None:
            for k, v in sorted(kwargs.items(), key=lambda x: x[0]):
                fh.write(" %s=%s" % (k, v))
        fh.write("\n")
        nonzero = self.nonzero()
        for chrom in sorted(nonzero):
            printer.write("Writing chromosome %s..." % chrom)
            fh.write("variableStep chrom=%s span=1\n" % chrom)
            indices = nonzero[chrom][strand]
            for idx in indices:
                val = self._chroms[chrom][strand][self._slicewrap(idx)]
                fh.write("%s\t%s\n" % (idx + 1, val))

[docs]    def to_bedgraph(self, fh, trackname, strand, printer=None, **kwargs):
        """Write the contents of the GenomeArray to a `bedGraph`_ file
        
        See the `bedGraph spec <https://cgwb.nci.nih.gov/goldenPath/help/bedgraph.html>`_
        for format details
            
        Parameters
        ----------
        fh : file-like
            Open filehandle to which data will be written
        
        trackname : str
            Name of browser track
        
        strand : str
            Strand to export. `'+'`, `'-'`, or `'.'`

        printer : file-like, optional
            Something implementing a write() method for output       

        **kwargs
            Any other key-value pairs to include in track definition line
        """
        assert strand in self.strands()
        printer = NullWriter() if printer is None else printer
        fh.write("track type=bedGraph name=%s" % trackname)
        if kwargs is not None:
            for k, v in sorted(kwargs.items(), key=lambda x: x[0]):
                fh.write(" %s=%s" % (k, v))
        fh.write("\n")
        nonzero = self.nonzero()
        for chrom in sorted(nonzero):
            printer.write("Writing chromosome %s..." % chrom)
            if self._chroms[chrom][strand].sum() > 0:
                last_val = 0
                last_x = 0
                nz = nonzero[chrom][strand]
                for x in range(nz.min(), nz.max() + 1):
                    val = self._chroms[chrom][strand][self._slicewrap(x)]
                    if val != last_val:
                        # line is: chrom chromStart chromEnd dataValue
                        fh.write("%s\t%s\t%s\t%s\n" % (chrom, last_x, x, last_val))
                        # prepare for next loop
                        last_val = val
                        last_x = x
                    else:
                        continue
                # write final line, not included in loop
                fh.write("%s\t%s\t%s\t%s\n" % (chrom, last_x, x + 1, last_val))

[docs]    @staticmethod
    def like(other):
        """Return a |GenomeArray| of same dimension as the input array
        
        Parameters
        ----------
        other : |GenomeArray|
        
        Returns
        -------
        GenomeArray
            empty |GenomeArray| of same size as `other`
        """
        return GenomeArray(other.lengths(), strands=other.strands())

    def _slicewrap(self, x):
        """Helper function to wrap coordinates for VariableStep/`bedGraph`_ export"""
        return x


[docs]class SparseGenomeArray(GenomeArray):
    """A memory-efficient sublcass of |GenomeArray| using sparse internal representation.
    Note, savings in memory may come at a cost in performance when repeatedly
    getting/setting values, compared to a |GenomeArray|.
    

    Parameters
    ----------
    chr_lengths : dict, optional
        Dictionary mapping chromosome names to lengths.
        Suppyling this parameter yields considerable
        speed improvements, as memory can be pre-allocated
        for each chromosomal position. If not provided,
        a minimum chromosome size will be guessed, and
        chromosomes re-sized as needed (Default: `{}` )
    
    min_chr_size : int,optional
        If `chr_lengths` is not supplied, `min_chr_size` is 
        the default first guess of a chromosome size. If
        your genome has large chromosomes, it is much
        much better, speed-wise, to provide a chr_lengths
        dict than to provide a guess here that is too small.
        (Default: %s)
    
    strands : sequence
        Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`)    
    """

    def __init__(self, chr_lengths=None, strands=None, min_chr_size=MIN_CHR_SIZE):
        """Create a |SparseGenomeArray|

        Parameters
        ----------
        chr_lengths : dict, optional
            Dictionary mapping chromosome names to lengths.
            Suppyling this parameter yields considerable
            speed improvements, as memory can be pre-allocated
            for each chromosomal position. If not provided,
            a minimum chromosome size will be guessed, and
            chromosomes re-sized as needed (Default: `{}` )
        
        min_chr_size : int,optional
            If `chr_lengths` is not supplied, `min_chr_size` is 
            the default first guess of a chromosome size. If
            your genome has large chromosomes, it is much
            much better, speed-wise, to provide a chr_lengths
            dict than to provide a guess here that is too small.
            (Default: %s)
        
        strands : sequence
            Sequence of strand names for the |GenomeArray|. (Default: `('+','-')`)
        """ % MIN_CHR_SIZE
        self._chroms = {}
        self._strands = _DEFAULT_STRANDS if strands is None else strands
        self._sum = None
        self._normalize = False
        self.min_chr_size = min_chr_size
        if chr_lengths is not None:
            for chrom in chr_lengths.keys():
                self._chroms[chrom] = {}
                for strand in self._strands:
                    l = chr_lengths[chrom]
                    self._chroms[chrom][strand] = scipy.sparse.dok_matrix((1, l))

[docs]    def lengths(self):
        """Return a dictionary mapping chromosome names to lengths. In the
        case where two strands report different lengths for a chromosome, the
        max length is taken.
        
        Returns
        -------
        dict
            mapping chromosome names to lengths
        """
        d_out = {}.fromkeys(self.keys())
        for key in d_out:
            d_out[key] = max([self._chroms[key][X].shape[1] for X in self.strands()])

        return d_out

[docs]    def get(self, roi, roi_order=True):
        """Retrieve array of counts from a region of interest (`roi`)
        with values in vector ordered 5' to 3' relative to `roi`
        rather than genome (i.e. are reversed for reverse-strand
        features).
        
        Parameters
        ----------
        roi : |GenomicSegment| or |SegmentChain|
            Region of interest in genome
        
        roi_order : bool, optional
            If `True` (default) return vector of values 5' to 3' 
            relative to `roi` rather than genome.

        Returns
        -------
        :class:`numpy.ndarray`
            vector of numbers, each position corresponding to a position
            in `roi`, from 5' to 3' relative to `roi`
        
        See also
        --------
        SegmentChain.get_counts
            Fetch a spliced vector of data covering a |SegmentChain|
        """
        if isinstance(roi, SegmentChain):
            return roi.get_counts(self)

        if roi.chrom not in self:
            self._chroms[roi.chrom] = {
                K: copy.deepcopy(scipy.sparse.dok_matrix((1, self.min_chr_size)))
                for K in self.strands()
            }
        if roi.end > self._chroms[roi.chrom][roi.strand].shape[1]:
            for strand in self.strands():
                self._chroms[roi.chrom][strand].resize((1, roi.end + 10000))

        vals = self._chroms[roi.chrom][roi.strand][0, roi.start:roi.end]
        if self._normalize is True:
            vals = 1e6 * vals / self.sum()

        vals = vals.toarray().reshape(vals.shape[1], )
        if roi.strand == "-" and roi_order == True:
            vals = vals[::-1]

        return vals

    def __setitem__(self, seg, val, roi_order=True):
        """Set values in the |SparseGenomeArray| over a region of interest.
        
        If the `seg` is outside the bounds of the current |GenomeArray|,
        the |GenomeArray| will be automatically expanded to accomodate
        the genomic coordinates of `seg`.

        Parameters
        ----------
        seg : |GenomicSegment| or |SegmentChain|
            Region of interest

        val : int, float, or :class:`numpy.ndarray`
            Scalar or vector of values to set in array over `seg`.
            If a vector, values should be ordered 5'-to-3' relative
            to `seg` rather than (i.e. for a reverse-strand feature,
            position 0 in the vector would correspond to seg.end)
            in the genome

        roi_order : bool, optional
            If `True` (default) and `val` is a vector, values in `val`
            are assorted to go 5' to 3' relative to `seg` rather than
            genome
       """
        self._sum = None

        if isinstance(seg, SegmentChain):
            if isinstance(val, numpy.ndarray):
                if seg.spanning_segment.strand == "-":
                    val = val[::-1]

            x = 0
            for subseg in seg:
                if isinstance(val, numpy.ndarray):
                    subval = val[x:x + len(subseg)]
                else:
                    subval = val
                x += len(subseg)
                self.__setitem__(subseg, subval, roi_order=False)

            return

        old_normalize = self._normalize

        if isinstance(val, numpy.ndarray) and seg.strand == "-" and roi_order == True:
            val = val[::-1]

        if old_normalize == True:
            warn(
                "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.",
                DataWarning
            )
        self.set_normalize(False)

        if seg.chrom not in self:
            self._chroms[seg.chrom] = {
                K: copy.deepcopy(scipy.sparse.dok_matrix((1, self.min_chr_size)))
                for K in self.strands()
            }
        if seg.end > self._chroms[seg.chrom][seg.strand].shape[1]:
            for strand in self.strands():
                self._chroms[seg.chrom][strand].resize((1, seg.end + 10000))

        self._chroms[seg.chrom][seg.strand][0, seg.start:seg.end] = val
        self.set_normalize(old_normalize)

    def __mul__(self, other, mode=None):
        """Multiply `other` by `self`, returning a new |SparseGenomeArray|
        and leaving `self` unchanged. `other` may be a scalar quantity or
        another |SparseGenomeArray|. In the latter case, multiplication is elementwise.
        
        Parameters
        ----------
        other : float, int, or |MutableAbstractGenomeArray|
        
        mode : str, choice of `'same'`, `'all'`, or `'truncate'`
            Only relevant if `other` is a |MutableAbstractGenomeArray|
        
            If '`same`' each set of corresponding chromosomes must
            have the same dimensions in both parent |SparseGenomeArrays|.
            In addition, both |SparseGenomeArrays| must have the same
            chromosomes, and the same strands.

            If '`all`', corresponding chromosomes in the parent
            |SparseGenomeArrays| will be resized to the dimensions of
            the larger chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; all chromosomes and strands will be
            included in output. 
                    
            If `'truncate'`,corresponding chromosomes in the parent
            |SparseGenomeArrays| will be resized to the dimensions of
            the smaller chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; chromosomes or strands not common
            to both parents will be ignored.                 
        
        Returns
        -------
        |SparseGenomeArray|
        """
        if isinstance(other, GenomeArray):
            new_array = SparseGenomeArray.like(self)
            chroms = set(self.keys()) & set(other.keys())
            strands = set(self.strands()) & set(other.strands())
            for chrom in chroms:
                for strand in strands:
                    new_array._chroms[chrom][strand] = self._chroms[chrom][strand].multiply(
                        other._chroms[chrom][strand]
                    )

            return new_array
        else:
            return self.apply_operation(other, operator.mul, mode=mode)

[docs]    def apply_operation(self, other, func, mode=None):
        """Apply a binary operator to a copy of `self` and to `other` elementwise.
        `other` may be a scalar quantity or another |GenomeArray|. In both cases,
        a new |SparseGenomeArray| is returned, and `self` is left unmodified.
        If :meth:`set_normalize` is set to `True`, it is disabled during the
        operation.
        
        Parameters
        ----------
        other : float, int, or |MutableAbstractGenomeArray|
            Second argument to `func`
        
        func : func
            Function to perform. This must take two arguments.
            a :py:class:`numpy.ndarray` (chromosome-strand) from the
            |SparseGenomeArray| will be supplied as the first argument, and the
            corresponding chromosome-strand from `other` as the second.
            This operation will be applied over all chromosomes and strands.
            
            If mode is set to `all`, and a chromosome or strand is not
            present in one of self or other, zero will be supplied
            as the missing argument to func. func must handle this
            gracefully.
        
        mode : str, choice of `'same'`, `'all'`, or `'truncate'`
            Only relevant if `other` is a |MutableAbstractGenomeArray|
        
            If '`same`' each set of corresponding chromosomes must
            have the same dimensions in both parent |SparseGenomeArrays|.
            In addition, both |SparseGenomeArrays| must have the same
            chromosomes, and the same strands.

            If '`all`', corresponding chromosomes in the parent
            |SparseGenomeArrays| will be resized to the dimensions of
            the larger chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; all chromosomes and strands will be
            included in output. 
                    
            If `'truncate'`,corresponding chromosomes in the parent
            |SparseGenomeArrays| will be resized to the dimensions of
            the smaller chromosome before the operation is applied.
            Parents are not required to have the same chromosomes
            or strands; chromosomes or strands not common
            to both parents will be ignored.   
        
        Returns
        -------
        |SparseGenomeArray|
            new |SparseGenomeArray| after the operation is applied
        """
        new_array = SparseGenomeArray.like(self)

        old_normalize = self._normalize
        if old_normalize == True:
            warn(
                "Temporarily turning off normalization during value set. It will be re-enabled automatically when complete.",
                DataWarning
            )
        self.set_normalize(False)

        if isinstance(other, GenomeArray):
            chroms = {}.fromkeys(set(self.keys()) | set(other.keys()), 10)
            strands = set(self.strands()) | set(other.strands())
            new_array = SparseGenomeArray(chroms, strands=strands)
            for chrom in chroms:
                if chrom in self.keys() and chrom in other.keys():
                    for strand in strands:
                        if strand in self.strands() and strand in other.strands():
                            sc = self._chroms[chrom][strand]
                            oc = other._chroms[chrom][strand]
                            new_array._chroms[chrom][strand] = func(sc, oc)
                        elif strand in self.strands():
                            new_array._chroms[chrom][strand] = func(
                                copy.deepcopy(self._chroms[chrom][strand]), 0
                            )
                        else:
                            new_array._chroms[chrom][strand] = func(
                                copy.deepcopy(other._chroms[chrom][strand]), 0
                            )
                elif chrom in self.keys():
                    for strand in strands:
                        new_array._chroms[chrom][strand] = func(
                            copy.deepcopy(self[chrom][strand]), 0
                        )
                else:
                    for strand in strands:
                        new_array._chroms[chrom][strand] = func(
                            copy.deepcopy(other[chrom][strand]), 0
                        )
        else:
            for chrom in self.keys():
                for strand in self.strands():
                    new_array._chroms[chrom][strand] = func(self._chroms[chrom][strand], other)

        self.set_normalize(old_normalize)
        return new_array

[docs]    def nonzero(self):
        """Return the indices of chromosomal positions with non-zero values
        at each chromosome/strand pair. Results are returned as a hierarchical
        dictionary mapping chromosome names, to a dictionary of strands,
        which in turn map to  to arrays of non-zero indices on that chromosome-strand.
        
        Returns
        -------
        dict
            `dict[chrom][strand]` = numpy.ndarray of indices
        """
        d_out = {}
        for key in self.keys():
            d_out[key] = {}
            for strand in self.strands():
                # need to sort because dok doens't guarantee sorted indices in nonzero()
                d_out[key][strand] = numpy.array(sorted(self._chroms[key][strand].nonzero()[1]))

        return d_out

    def _slicewrap(self, x):
        """Helper function to wrap coordinates for VariableStep/`bedGraph`_ export"""
        return (0, x)

[docs]    @staticmethod
    def like(other):
        """Return a |SparseGenomeArray| of same dimension as the input array
        
        Parameters
        ----------
        other : |GenomeArray| or |SparseGenomeArray|
        
        Returns
        -------
        |SparseGenomeArray|
            of same size as `other`
        """
        return SparseGenomeArray(other.lengths(), strands=other.strands())