Source code for plastid.bin.metagene

#!/usr/bin/env python
"""Performs :term:`metagene` analyses. The workflow is separated into the
following subprograms:

Generate
    A :term:`metagene` profile is a position-wise average over all genes
    in the vicinity of an interesting landmark (e.g. a start codon). Because
    genes can have multiple transcript isoforms that may cover different
    genomic positions, which transcript positions (and therefore which
    genomic positions) to include in the average can be ambiguous when
    the isoforms are not knnow.
    
    To handle this problem, we define for each gene the :term:`maximal spanning window`
    over which every position at a given distance from the landmark of interest
    (e.g. a start or stop codon) maps to the same genomic coordinates in all
    transcript isoforms. The windows are defined by the following algorithm: 
    
     #. Transcripts are grouped by gene.
    
     #. Landmarks are detected on each transcript for each gene. For genes in
        which all transcripts do not share the same genomic coordinate for the
        landmark of interest, no window can be defined, and that gene is
        excluded from further analysis.
    
     #. For each set of transcripts that passes step (2), the maximal spanning
        window is created by aligning the set of transcripts at the landmark, and
        bidirectionally growing the maximal spanning window until either:
        
           - the next nucleotide position added no longer corresponds to 
             the same genomic position in all transcripts
            
           - the window reaches the maximum user-specified size

    **Note**: if annotations are supplied as `BED`_ files, transcripts cannot be
    grouped by gene, because `BED`_ files don't contain this information.
    In this case one ROI is generated per transcript.
    
    
    .. Rubric :: Output files
    
    OUTBASE_rois.txt
        A tab-delimited text file describing the maximal spanning window for
        each gene, with columns as follows:
        
        ================   ==================================================
        **Column**         **Contains**
        ----------------   --------------------------------------------------

        alignment_offset   Offset to align window to all other windows in the
                           file, if the window happens to be shorter on the 5'
                           end than specified in ``--flank_upstream``. Typically
                           this is `0`.

        region_id          ID of region (e.g. gene) from which window was made
        
        region             maximal spanning window, formatted as
                           `chromosome:start-end:(strand)`
        
        window_size        with of window
        
        zero_point         distance from 5' end of window to landmark
        ================   ==================================================
        
    
    OUTBASE_rois.bed
        Maximal spanning windows in `BED`_ format for visualization in
        a :term:`genome browser`. The thickly-rendered portion of a window
        indicates its landmark

    where `OUTBASE` is supplied by the user.
    
    
Count
    This program generates :term:`metagene` profiles from a dataset of
    :term:`counts` or :term:`alignments`, taking the following steps:
    
     1. The **raw counts** at each position in each :term:`maximal spanning window`
        (from the ``generate`` subprogram) fetched as a raw count vector for the
        window.

     2. A **normalized count vector** is created for each window by dividing
        its raw count vector by the total number of counts occurring within a
        user-defined normalization region within the window.
    
     3. A **metagene average** is created by taking aligning all of the
        normalized count vectors, and taking the median normalized counts
        over all vectors at each nucleotide position. Count vectors deriving
        from windows that don't meet a minimum count threshold (set via the
        ``--norm_region`` option) are excluded.
    
    
    .. Rubric :: Output files

    Raw count vectors, normalized count vectors, and metagene profiles are all
    saved as tab-delimited text files, for subsequent plotting, filtering,
    or reanalysis.
    
    OUTBASE_metagene_profile.txt
        Tab-delimited table of metagene profile, containing the following
        columns:

        ================   ==================================================
        **Column**         **Contains**
        ----------------   --------------------------------------------------
        x                  Distance in nucleotides from the landmark
        
        metagene_average   Value of metagene average at that position
        
        regions_counted    Number of maximal spanning windows included at
                           that position in the average. i.e. windows that
                           both met the threshold set by ``--min_counts`` and
                           were not masked at that position by a :term:`mask file`
        ================   ==================================================        
        
    OUTBASE_rawcounts.txt
        Saved if ``--keep`` is specified. Table of raw counts. Each row is a
        maximal spanning window for a gene, and each column a nucleotide position
        in that window. All windows are aligned at the landmark.
    
    OUTBASE_normcounts.txt
        Saved if ``--keep`` is specified. Table of normalized counts, produced
        by dividing each row in the raw count table by the of counts in that
        row within the columns specified by ``--normalize_over``.

    OUTBASE_mask.txt
        Saved if ``--keep`` is specified. Matrix of masks indicating which
        cells in ``normcounts`` were excluded from computations.

    OUTBASE_metagene_overview.[png | svg | pdf | et c...]
        Metagene average plotted above a heatmap of normalized counts,
        in which each row of pixels is a maximal spanning window for a gene,
        and rows are sorted by the column in which they reach their
        max value. To facilitate visualization, colors in the heatmap are scaled
        from 0 to the 95th percentile of non-zero numbers in the normalized counts
        
    `OUTBASE` is supplied by the user.

    
Chart
    One or more metagene profiles generated by the ``count`` subprogram,
    for example, on different datasets, are plotted against each other. 


See command-line help for each subprogram for details on parameters for each 
"""
__author__ = "joshua"

import gc
import sys
import warnings
import argparse
import inspect
import numpy
import pandas as pd
from plastid.genomics.roitools import SegmentChain, positions_to_segments
from plastid.util.io.filters import NameDateWriter
from plastid.util.io.openers import get_short_name, argsopener, NullWriter
from plastid.util.scriptlib.help_formatters import format_module_docstring
from plastid.util.services.exceptions import (
    ArgumentWarning,
    DataWarning,
    FileFormatWarning,
)
from plastid.util.scriptlib.argparsers import (
    AnnotationParser,
    AlignmentParser,
    PlottingParser,
    MaskParser,
    BaseParser,
)
from plastid.readers.bigbed import BigBedReader

warnings.simplefilter("once", DataWarning)

printer = NameDateWriter(get_short_name(inspect.stack()[-1][1]))

#===============================================================================
# helper functions to generate/handle maximum spanning windows / ROIs
#===============================================================================


[docs]def window_landmark(region, flank_upstream=50, flank_downstream=50, ref_delta=0, landmark=0):
    """Define a window surrounding a landmark in a region, if the region has such a landmark,
    (e.g. a start codon in a transcript), accounting for splicing of the region,
    if the region is discontinuous
    
    Parameters
    ----------
    transcript : |SegmentChain| or |Transcript|
        Region on which to generate a window surrounding a landmark
    
    landmark : int
        Position of the landmark within `region`

    flank_upstream : int
        Nucleotides upstream of `landmark` to include in window
    
    flank_downstream : int
        Nucleotides downstream of `landmark` to include in window
    
    ref_delta : int
        Offset from `landmark` to the reference point. If 0, the landmark
        is the reference point. Default: 0
    
    
    Returns
    -------
    |SegmentChain|
        Window of `region` surrounding landmark
    
    int
        alignment offset to the window start, if `region` itself wasn't long
        enough in the 5' direction to include the entire distance specified by
        `flank_upstream`. Use this to align windows generated around similar
        landmarks from different `regions` (e.g. windows surrounding start codons
        in various transcripts).

    (str, int, str)
        Genomic coordinate of reference point as *(chromosome name, coordinate, strand)*
    """
    if landmark + ref_delta >= flank_upstream:
        fiveprime_offset = 0
        my_start = landmark + ref_delta - flank_upstream
    else:
        fiveprime_offset = flank_upstream - landmark
        my_start = 0

    my_end = min(region.length, landmark + ref_delta + flank_downstream)
    roi = region.get_subchain(my_start, my_end)
    span = region.spanning_segment
    chrom = span.chrom
    strand = span.strand

    if landmark + ref_delta == region.length:
        if span.strand == "+":
            ref_point = (chrom, span.end, strand)
        else:
            ref_point = (chrom, span.start - 1, strand)
    else:
        ref_point = region.get_genomic_coordinate(landmark + ref_delta)

    return roi, fiveprime_offset, ref_point


[docs]def window_cds_start(transcript, flank_upstream, flank_downstream, ref_delta=0):
    """Returns a window surrounding a start codon.
    
    Parameters
    ----------
    transcript : |Transcript|
        Transcript on which to generate window
    
    flank_upstream : int
        Nucleotide length upstream of start codon to include in window,
        if `transcript` has a start codon
    
    flank_downstream : int
        Nucleotide length downstream of start codon to include in window,
        if `transcript` has a start codon
    
    ref_delta : int, optional
        Offset from  start codon to the reference point. If `0`, the landmark
        is the reference point. (Default: `0`)
    
    Returns
    -------
    |SegmentChain|
        Window surrounding start codon if `transcript` is coding. Otherwise,
        zero-length |SegmentChain| 
    
    int
        Alignment offset to the window start, if `transcript` itself wasn't long
        enough in the 5' direction to include the entire distance specified by
        `flank_upstream`. Use this to align this window to other windows generated
        around start codons in other transcripts. If transcript is not coding,
        returns :obj:`numpy.nan`

    (str, int, str)
        Genomic coordinate of reference point as *(chromosome name, coordinate, strand)*.
        If `transcript` has no start codon, returns :obj:`numpy.nan`
    """
    if transcript.cds_start is None:
        return SegmentChain(), numpy.nan, numpy.nan

    return window_landmark(
        transcript,
        flank_upstream,
        flank_downstream,
        ref_delta=ref_delta,
        landmark=transcript.cds_start
    )


[docs]def window_cds_stop(transcript, flank_upstream, flank_downstream, ref_delta=0):
    """Returns a window surrounding a stop codon.

    Parameters
    ----------
    transcript : |Transcript|
        Transcript on which to generate window
    
    flank_upstream : int
        Nucleotide length upstream of stop codon to include in window,
        if `transcript` has a stop codon
    
    flank_downstream : int
        Nucleotide length downstream of stop codon to include in window,
        if `transcript` has a stop codon
    
    ref_delta : int, optional
        Offset from  stop codon to the reference point. If `0`, the landmark
        is the reference point. (Default: `0`)
    
    Returns
    -------
    |SegmentChain|
        Window surrounding stop codon if transcript is coding. Otherwise,
        zero-length |SegmentChain| 
    
    int
        alignment offset to the window start, if `transcript` itself wasn't long
        enough in the 5' direction to include the entire distance specified by
        `flank_upstream`. Use this to align this window to other windows generated
        around stop codons in other transcripts. If transcript is not coding,
        returns :obj:`numpy.nan`

    (str, int, str)
        Genomic coordinate of reference point as *(chromosome name, coordinate, strand)*.
        If `transcript` has no stop codon, returns :obj:`numpy.nan`
    """
    if transcript.cds_start is None:
        return SegmentChain(), numpy.nan, numpy.nan

    return window_landmark(
        transcript,
        flank_upstream,
        flank_downstream,
        ref_delta=ref_delta,
        landmark=transcript.cds_end - 3
    )


[docs]def maximal_spanning_window(
        regions,
        mask_hash,
        flank_upstream,
        flank_downstream,
        window_func=window_cds_start,
        name=None,
        printer=NullWriter()
):
    """Create a maximal spanning window over `regions` surrounding a landmark,
    
    The maximal spanning window is created by:
    
     #. Applying `window_func` to each `region` in `regions` to create a sub-window
        of `region` that surrounds a landmark identified by `window_func`, 
        with up to `flank_upstream` bases 5' of the landmark, and `flank_downstream`
        bases 3` of the landmark.

     #. If the landmark in all regions corresponds to the same genomic position,
        a maximal spanning window is created by starting at the landmark,
        and growing the window in the 5' and 3' directions along all regions
        until either:
        
          - the next nucleotide position added no longer corresponds to 
            the same genomic position in all regions 
            
          - the window reaches the maximum size specified by`flank_upstream`
            (in 5' direction) or `flank_downstream` (in 3' direction)
        
    
    Parameters
    ----------
    regions : list
        List of |SegmentChains| or |Transcripts|
    
    mask_hash : |GenomeHash|
        |GenomeHash| containing regions to exclude from analysis
    
    flank_upstream : int
        Number of nucleotides upstream of landmark to include in maximal
        spanning window, if possible

    flank_downstream: int
        Number of nucleotides downstream of landmark to include in maximal
        spanning window, if possible
    
    window_func : func, optional
        Function that defines a landmark in an individual region, and builds
        a window around that landmark over that region.  As examples,
        :func:`window_cds_start` and :func:`window_cds_stop` are provided,
        though any function that meets the following criteria can be used:
        
         1. It must take the same parameters as :func:`window_cds_start`
            
         2. It must return the same types as :func:`window_cds_start`

        Such functions could choose arbitrary features as landmarks, such as
        peaks in ribosome density, nucleic acid sequence features, transcript
        start or end sites, or any property that can be deduced from a
        |Transcript|. (Default: :func:`window_cds_start`)
    
    name : str or None, optional
        Name for maximal spanning window, to which it's `ID` attribute will 
        be set. If `None`, a name will be generated. 
        
    printer : file-like, optional
        filehandle to write logging info to (Default: :func:`NullWriter`)
    
    
    Returns
    -------
    SegmentChain
        Maximal spanning window, if `regions` share the same landmark. Otherwise,
        0-length |SegmentChain|
    
    int or :obj:`numpy.nan`
        Alignment offset to the window start, if the maximal spanning window
        itself is not long enough in the 5' direction to include the entire
        distance specified by `flank_upstream`. Use this to align this window
        to other maximal spanning windows.
        
        If `regions` do not share the same landmark, :obj:`numpy.nan`
    """
    refpoints = []
    window_size = flank_upstream + flank_downstream

    # find common positions
    position_matrix = numpy.tile(numpy.nan, (len(regions), window_size))
    for n, region in enumerate(regions):
        try:
            my_roi, my_offset, genomic_refpoint = window_func(
                region, flank_upstream, flank_downstream
            )
            refpoints.append(genomic_refpoint)

            if genomic_refpoint is not numpy.nan and len(my_roi) > 0:
                pos_list = my_roi.get_position_list()  # ascending list of positions
                my_len = len(pos_list)
                assert my_offset + my_len <= window_size
                if my_roi.spanning_segment.strand == "+":
                    position_matrix[n, my_offset:my_offset + my_len] = pos_list
                else:
                    my_len = len(pos_list)
                    position_matrix[n, my_offset:my_offset + my_len] = pos_list[::-1]

        except IndexError:
            warnings.warn(
                "IndexError finding common positions at region '%s'. Ignoring region: " %
                region.get_name()
            )

    # continue only if refpoints all match
    if len(set(refpoints)) == 1 and numpy.nan not in refpoints:
        new_shared_positions = []
        if len(set(refpoints)) == 1:
            for i in range(0, position_matrix.shape[1]):
                col = position_matrix[:, i]
                if len(set(col)) == 1 and not numpy.isnan(col[0]):
                    new_shared_positions.append(int(col[0]))

        # continue only if there exist positions shared between all regions
        if len(set(new_shared_positions)) > 0:

            # define new ROI covering all positions common to all transcripts
            new_roi = SegmentChain(
                *positions_to_segments(regions[0].chrom, regions[0].strand, new_shared_positions)
            )
            if name is None:
                name = new_roi.get_name()

            new_roi.attr["ID"] = name

            if new_roi.spanning_segment.strand == "+":
                new_roi.attr["thickstart"] = genomic_refpoint[1]
                new_roi.attr["thickend"] = genomic_refpoint[1] + 1
            else:
                new_roi.attr["thickstart"] = genomic_refpoint[1]
                new_roi.attr["thickend"] = genomic_refpoint[1] + 1

            # having made sure that refpoint is same for all transcripts,
            # we use last ROI and last offset to find new offset
            # this fails if ref point is at the 3' end of the roi,
            # due to quirks of half-open coordinate systems
            # so we test it explicitly
            if flank_upstream - my_offset == my_roi.length:
                new_offset = my_offset
            else:
                zero_point_roi = new_roi.get_segmentchain_coordinate(*genomic_refpoint)
                new_offset = flank_upstream - zero_point_roi

            masks = mask_hash.get_overlapping_features(new_roi)
            mask_segs = []
            for mask in masks:
                mask_segs.extend(mask.segments)

            new_roi.add_masks(*mask_segs)

            return new_roi, new_offset

    return SegmentChain(), numpy.nan


#===============================================================================
# Subprograms
#===============================================================================


[docs]def group_regions_make_windows(
        source,
        mask_hash,
        flank_upstream,
        flank_downstream,
        window_func=window_cds_start,
        is_sorted=False,
        group_by="gene_id",
        printer=NullWriter()
):
    """Group regions of interest by a shared attribute, and generate
    maximal spanning windows for them. Results are given in a table
    suitable for use in ``count`` subprogram. Windows are generated by
    the following algorithm:

     1. Transcripts are grouped by `group_by` attribute (default: by gene).
        If all transcripts in a group share the same
        genomic coordinate for the landmark of interest (for example, if all 
        share the same start codon), then the set of transcripts is
        included in the analysis. If not, the set of transcripts and their
        associated gene are excluded from further processing.
    
     2. For each set of transcripts that pass step (1), the maximal spanning
        window is created by aligning the set of transcripts at the landmark, and
        adding nucleotide positions in transcript coordinates to the growing
        window in both 5' and 3' directions until either:
        
          - the next nucleotide position added is no longer corresponds to 
            the same genomic position in all transcripts
            
          - the window reaches the maximum user-specified size
    
    Parameters
    ----------
    source : list or generator
        Source of |SegmentChain| or |Transcript| objects, preferably with `gene_id` and
        `transcript_id` (e.g. transcripts assembled from a `GTF2`_ or `GFF3`_
        file), so that transcripts can be grouped by gene when making maximal
        spanning windows.
    
    mask_hash : |GenomeHash|
        |GenomeHash| containing regions to exclude from analysis
    
    flank_upstream : int
        Number of nucleotides upstream of landmark to include in windows
        (in transcript coordinates)

    flank_downstream: int
        Number of nucleotides downstream of landmark to include in windows
        (in transcript coordinates)
    
    window_func : func, optional
        Function that defines a landmark in an individual transcript, and builds
        a window around that landmark over that region.  As examples,
        :func:`window_cds_start` and :func:`window_cds_stop` are provided,
        though any function that meets the following criteria can be used:
        
          1. It must take the same parameters as :func:`window_cds_start`
            
          2. It must return the same types as :func:`window_cds_start`

        Such functions could choose arbitrary features as landmarks, such as
        peaks in ribosome density, nucleic acid sequence features, transcript
        start or end sites, or any property that can be deduced from a
        |Transcript|. (Default: :func:`window_cds_start`)
    
    group_by : str, optional
        Attribute by which |SegmentChains| or |Transcripts| should be grouped
        before generating maximal spanning windows (Default: `"gene_id"`)

    is_sorted : bool, optional
        Input file is sorted and/or `tabix`_-indexed. If `True`, 
        :func:`group_regions_make_windows` will take advantage of this to save memory.
        (Default: `False`)

    printer : file-like, optional
        filehandle to write logging info to (Default: :func:`NullWriter`)
    
    Returns
    -------
    :class:`pandas.DataFrame`
        A :class:`pandas.DataFrame` containing the following columns describing the
        maximal spanning windows:

            ====================   ==================================================
            *Column*               *Contains*
            --------------------   --------------------------------------------------
    
            alignment_offset       Offset to align window to all other windows in the
                                   file from the 5' end, if the window happens to be
                                   shorter on the 5' end than specified in
                                   `flank_upstream`
    
            region_id              ID of region, given by shared value of `group_by`
                                   parameter (i.e. by default is 'gene_id')
            
            region                 Maximal spanning window, formatted as
                                   `chromosome:start-end:(strand)`

            region_length          Length of maximal spanning window
            
            region_bed             Maximal spanning window, formatted as a `BED`_ line
            
            window_size            Requested length of maximal spanning window.
                                   May be larger than actual window if
                                   `alignment_offset` or `threeprime_offset` is 
                                   nonzero
            
            zero_point             Distance from 5' end of window to landmark,
                                   including `alignment_offset`
            
            threeprime_offset      Offset to align window to all other windows
                                   the file from the 3' end, if the window happens
                                   to be shorter on the 3' end than specified in
                                   `flank_downstream`
            ====================   ==================================================        
    
    list
        List of |SegmentChain| representing each window. These data are also
        represented as strings in the :class:`pandas.DataFrame`
    
    Notes
    -----         
    Not all genes will be included in the output if, for example, there isn't a
    position set common to all transcripts surrounding the landmark
    """
    import itertools
    window_size = flank_upstream + flank_downstream

    dtmp = {
        "region_id"        : [],
        "region"           : [],
        "region_length"    : [],
        "masked"           : [],
        "alignment_offset" : [],
        "window_size"      : [],
        "zero_point"       : [],
        "region_bed"       : [],
        "threeprime_offset": [],
    } # yapf: disable

    transcripts = []
    group_transcript = {}
    last_chrom = None
    do_loop = True
    source = iter(source)
    c = -1

    # to save memory, we process one chromosome at a time if input file is sorted
    # knowing that at that moment all transcript parts are assembled
    while do_loop == True:
        try:
            tx = next(source)
        except StopIteration:
            do_loop = False

        try:
            # end of chromosome or end of file
            if (is_sorted and tx.spanning_segment.chrom != last_chrom) or do_loop == False:
                last_chrom = tx.spanning_segment.chrom
                if do_loop == True:
                    source = itertools.chain([tx], source)

                for tx_chain in transcripts:
                    # if attr is missing, use transcript name, which should be unique
                    attr = tx_chain.attr
                    if group_by == "gene_id":
                        if "gene_id" in attr:
                            group_attr = attr["gene_id"]
                        else:
                            group_attr = tx_chain.get_gene()
                            warnings.warn(
                                "Region '%s' has no gene_id. Inferring gene_id to be '%s'" %
                                (tx_chain.get_name(), group_attr), DataWarning
                            )
                    else:
                        if group_by in attr:
                            group_attr = attr[group_by]
                        else:
                            warnings.warn(
                                "Region '%s' has no attribute '%s', and will not be grouped. Using region name as default group."
                                % (tx_chain.get_name(), group_by), DataWarning
                            )
                            group_attr = tx_chain.get_name()

                    try:
                        group_transcript[group_attr].append(tx_chain)
                    except KeyError:
                        group_transcript[group_attr] = [tx_chain]

                # for each gene, find maximal window in which all points
                # are represented in all transcripts. return window and offset
                for region_id, tx_list in group_transcript.items():
                    c += 1
                    if c % 1000 == 1:
                        printer.write(
                            "Processed %s genes, included %s ..." %
                            (c, len(list(dtmp.values())[0]))
                        )

                    name = region_id  # name regions after gene id
                    max_spanning_window, offset = maximal_spanning_window(
                        tx_list,
                        mask_hash,
                        flank_upstream,
                        flank_downstream,
                        window_func=window_func,
                        name=name,
                        printer=printer
                    )

                    if len(max_spanning_window) > 0:
                        mask_chain = max_spanning_window.get_masks_as_segmentchain()
                        dtmp["region_id"].append(region_id)
                        dtmp["window_size"].append(window_size)

                        # need to cast `region` to string to keep numpy from converting to array
                        dtmp["region"].append(str(max_spanning_window))
                        dtmp["masked"].append(str(mask_chain))
                        dtmp["alignment_offset"].append(offset)
                        dtmp["zero_point"].append(flank_upstream)
                        dtmp["region_bed"].append(max_spanning_window.as_bed())
                        dtmp["region_length"].append(max_spanning_window.length)
                        dtmp["threeprime_offset"
                             ].append(window_size - offset - max_spanning_window.length)

                # clean up
                del transcripts
                del group_transcript
                gc.collect()
                del gc.garbage[:]
                transcripts = []
                group_transcript = {}

            else:
                transcripts.append(tx)

        except UnboundLocalError:
            pass

    df = pd.DataFrame(dtmp)
    df.sort_values(["region_id"], inplace=True)
    printer.write("Processed %s genes total. Included %s." % (c + 1, len(df)))

    # Warn in case of annotation problems
    if (df["alignment_offset"] == flank_upstream).all():
        warnings.warn(
            "All maximal spanning windows lack flanks upstream of reference landmark. This occurs e.g. for start codons when annotation files don't contain UTR data. Please check your annotation file.",
            DataWarning
        )

    # N.b. This warning will only be invoked for zero-length landmarks
    # e.g. won't work for stop codons, which are 3nt wide
    if (df["threeprime_offset"] == flank_downstream).all():
        warnings.warn(
            "All maximal spanning windows lack flanks downstream of reference landmark. This occurs e.g. for stop codons when annotation files don't contain UTR data. Please check your annotation file.",
            DataWarning
        )

    return df


_NORM_START_DEFAULT = 20
_NORM_END_DEFAULT = 50


def _get_norm_region(roi_table, args):
    """Helper function to get normalization region from current and deprecated
    command-line arguments.  This function will be removed in plastid v0.6.1,
    when the deprecated command-line arguments will also be removed.


    Parameters
    ----------
    roi_table : :class:`pandas.DataFrame`
        DataFrame of output from |metagene| generate subprogram
        
    args : :class:`argparse.NameSpace`
        Command-line arguments to |metagene| or |psite| count subprograms
        
        
    Returns
    -------
    (int, int)
        Start and end of normalization region, with respect to start of window,
        including `alignment_offset`.
    """
    # TODO: remove --norm_region in Plastid v0.6.1
    flank_upstream = roi_table["zero_point"][0]
    if args.normalize_over is not None:
        norm_start, norm_end = args.normalize_over
        norm_start += flank_upstream
        norm_end += flank_upstream
        if args.norm_region is not None:
            warnings.warn(
                "`--normalize_over` replaces `--norm_region`, which is deprecated. Ignoring `--norm_region` and using `--normalize_over`. See `metagene count --help` for differences.",
                ArgumentWarning
            )
    elif args.norm_region is not None:
        warnings.warn(
            "`--norm_region` is deprecated and will be removed in plastid v0.5. Use `--normalize_over` instead. See `metagene count --help` for differences.",
            ArgumentWarning
        )
        norm_start, norm_end = args.norm_region
    else:
        norm_start = _NORM_START_DEFAULT
        norm_end = _NORM_END_DEFAULT

    return norm_start, norm_end


[docs]def do_count(args, alignment_parser, plot_parser, printer=NullWriter()):
    """Calculate a metagene average over maximal spanning windows specified in `roi_table`, taking the following steps:

     1. The **raw counts** at each position in each :term:`maximal spanning window`
        (from the ``generate`` subprogram) are totaled to create a raw count vector
        for the ROI.

     2. A **normalized count vector** is created fore each window by dividing
        its raw count vector by the total number of counts occurring within a
        user-defined normalization window within the window.
    
     3. A **metagene average** is created by taking aligning all of the
        normalized count vectors, and taking the median normalized counts
        over all vectors at each nucleotide position. Count vectors deriving
        from ROIs that don't meet a minimum count threshold (set via the
        ``--norm_region`` option) are excluded.
            
    Parameters
    ----------
    args : :class:`argparse.Namespace`
        Namespace containing arguments

    printer : file-like, optional
        Anything implementing a ``write()`` method, for logging purposes.

    
        
    Returns
    -------
    :py:class:`numpy.ndarray`
        raw counts at each position (column) in each window (row)
    
    :py:class:`numpy.ndarray`
        counts at each position (column) in each window (row), normalized by
        the total number of counts in that row from `norm_start` to `norm_end`
    
    :class:`pandas.DataFrame`
        Metagene profile of median normalized counts at each position across
        all windows, and the number of windows included in the calculation of each
        median
    """
    import matplotlib
    matplotlib.use("Agg")
    import matplotlib.pyplot as plt
    from plastid.plotting.plots import profile_heatmap

    plot_parser.set_style_from_args(args)

    # yapf: disable
    outbase = args.outbase
    count_fn     = "%s_rawcounts.txt.gz" % outbase
    normcount_fn = "%s_normcounts.txt.gz" % outbase
    mask_fn      = "%s_mask.txt.gz" % outbase
    profile_fn   = "%s_metagene_profile.txt" % outbase
    fig_fn       = "%s_metagene_overview.%s" % (outbase,args.figformat)
    # yapf: enable

    printer.write("Opening ROI file %s ..." % args.roi_file)
    with open(args.roi_file) as fh:
        roi_table = pd.read_table(fh, sep="\t", comment="#", index_col=None, header=0)
        fh.close()

    # wrapper to deal with multiple command-line arguments
    norm_start, norm_end = _get_norm_region(roi_table, args)

    #norm_start, norm_end = args.norm_region
    min_counts = args.min_counts

    # open count files
    ga = alignment_parser.get_genome_array_from_args(args, printer=printer)

    # following value are identical for all genes, so 0th val is fine
    window_size = roi_table["window_size"][0]
    upstream_flank = roi_table["zero_point"][0]
    cshape = (len(roi_table), window_size)

    # by default, mask everything
    counts = numpy.ma.MaskedArray(numpy.tile(numpy.nan, cshape), mask=numpy.tile(True, cshape))

    for i, row in roi_table.iterrows():
        if i % 1000 == 1:
            printer.write("Counted %s ROIs ..." % (i))

        roi = SegmentChain.from_str(row["region"])
        mask = SegmentChain.from_str(row["masked"])
        roi.add_masks(*mask)
        offset = int(round((row["alignment_offset"])))
        assert offset + roi.length <= window_size

        # take away from masked array
        mvec = roi.get_masked_counts(ga)
        counts.data[i, offset:offset + roi.length] = mvec.data
        counts.mask[i, offset:offset + roi.length] = mvec.mask
        #counts[i,offset:offset+roi.length] = roi.get_masked_counts(ga)

    printer.write("Counted %s ROIs total." % (i + 1))

    denominator = numpy.nansum(counts[:, norm_start:norm_end], axis=1)
    row_select = denominator >= min_counts

    norm_counts = (counts.T.astype(float) / denominator).T
    norm_counts = numpy.ma.MaskedArray(norm_counts, mask=counts.mask)
    norm_counts.mask[numpy.isnan(norm_counts)] = True
    norm_counts.mask[numpy.isinf(norm_counts)] = True

    if args.keep == True:
        printer.write("Saving counts to %s ..." % count_fn)
        numpy.savetxt(count_fn, counts, delimiter="\t", fmt='%.8f')
        printer.write("Saving normalized counts to %s ..." % normcount_fn)
        numpy.savetxt(normcount_fn, norm_counts, delimiter="\t")
        printer.write("Saving masks used in profile building to %s ..." % mask_fn)
        numpy.savetxt(mask_fn, norm_counts.mask, delimiter="\t")

    try:
        if args.use_mean == True:
            pfunc = numpy.ma.mean
        else:
            pfunc = numpy.ma.median

        profile = pfunc(norm_counts[row_select], axis=0)
    except IndexError:
        profile = numpy.zeros(norm_counts.shape[0])
    except ValueError:
        profile = numpy.zeros(norm_counts.shape[0])

    if profile.sum() == 0:
        printer.write(
            "Metagene profile is zero at all positions. %s ROIs made the minimum count cutoff." %
            row_select.sum()
        )
        printer.write("Consider lowering --min_counts (currently %s)." % min_counts)

    num_genes = ((~norm_counts.mask)[row_select]).sum(0)
    profile_table = pd.DataFrame(
        {
            "metagene_average": profile,
            "regions_counted": num_genes,
            "x": numpy.arange(-upstream_flank, window_size - upstream_flank),
        }
    )

    printer.write("Saving metagene profile to %s ..." % profile_fn)
    with argsopener(profile_fn, args, "w") as profile_out:
        profile_table.to_csv(
            profile_out,
            sep="\t",
            header=True,
            index=False,
            na_rep="nan",
            columns=["x", "metagene_average", "regions_counted"],
        )
        profile_out.close()

    # plot
    printer.write("Plotting to %s ..." % fig_fn)
    rs = norm_counts[row_select, :]
    p95 = numpy.nanpercentile(rs[rs > 0], 95)
    im_args = {
        "interpolation": "none",
        "vmin": 0,
        "vmax": p95,  # limit color space for better plotting
    }
    if args.cmap is not None:
        im_args["cmap"] = args.cmap

    plot_args = {}
    plot_args["color"] = plot_parser.get_colors_from_args(args, 1)[0]

    fig = plot_parser.get_figure_from_args(args)
    ax = plt.gca()
    fig, ax = profile_heatmap(
        norm_counts[row_select],
        profile=profile_table["metagene_average"],
        axes=ax,
        x=profile_table["x"],
        im_args=im_args,
        plot_args=plot_args
    )

    title = args.title if args.title is not None else "Metagene overview for %s" % outbase
    fig.suptitle(title)

    ax["main"].set_ylabel("Normalized ribosome density (au), by gene")
    landmark = args.landmark
    if args.landmark is not None:
        ax["main"].set_xlabel("Distance (nt) from %s" % landmark)

    printer.write("Saving image to %s ..." % fig_fn)
    fig.savefig(fig_fn, dpi=args.dpi, bbox_inches="tight")

    return counts, norm_counts, profile_table


[docs]def do_chart(args, plot_parser, printer=NullWriter()):
    """Plot metagene profiles against one another
    
    Parameters
    ----------
    args : :class:`argparse.Namespace`
        Namespace containing arguments

    printer : file-like, optional
        Anything implementing a ``write()`` method, for logging purposes.
    
    Returns
    -------
    :py:class:`matplotlib.Figure`
    """
    import matplotlib
    matplotlib.use("Agg")
    import matplotlib.pyplot as plt

    if len(args.labels) == len(args.infiles):
        samples = {
            K: pd.read_table(V, sep="\t", comment="#", header=0, index_col=None)
            for K, V in zip(args.labels, args.infiles)
        }
    else:
        if len(args.labels) > 0:
            warnings.warn(
                "Expected %s labels supplied for %s infiles; found only %s. Ignoring labels" %
                (len(args.infiles), len(args.infiles), len(args.labels)), ArgumentWarning
            )
        samples = {
            get_short_name(V): pd.read_table(V, sep="\t", comment="#", header=0, index_col=None)
            for V in args.infiles
        }

    plot_parser.set_style_from_args(args)
    title = args.title
    colors = plot_parser.get_colors_from_args(args, len(args.infiles))

    fig = plot_parser.get_figure_from_args(args)
    ax = plt.gca()
    min_x = numpy.inf
    max_x = -numpy.inf
    for n, (k, v) in enumerate(samples.items()):
        plt.plot(v["x"], v["metagene_average"], label=k, color=colors[n])
        min_x = min(min_x, min(v["x"]))
        max_x = max(max_x, max(v["x"]))

    plt.xlim(min_x, max_x)
    ylim = ax.get_ylim()
    plt.ylim(0, ylim[1])

    plt.xlabel("Distance from %s (nt)" % args.landmark)
    plt.ylabel("Normalized read density (au)")

    if title is not None:
        plt.title(title)

    plt.legend(bbox_to_anchor=(1.02, 1.02), loc="upper left", borderaxespad=0)

    fn = "%s.%s" % (args.outbase, args.figformat)
    printer.write("Saving to %s ..." % fn)
    fig.savefig(fn, dpi=args.dpi, bbox_inches="tight")

    return fig


#===============================================================================
# PROGRAM BODY
#===============================================================================


[docs]def main(argv=sys.argv[1:]):
    """Command-line program
    
    Parameters
    ----------
    argv : list, optional
        A list of command-line arguments, which will be processed
        as if the script were called from the command line if
        :py:func:`main` is called directly.

        Default: `sys.argv[1:]`. The command-line arguments, if the script is
        invoked from the command line
    """
    al = AlignmentParser(disabled=["normalize"])
    an = AnnotationParser()
    pp = PlottingParser()
    mp = MaskParser()
    bp = BaseParser()

    alignment_file_parser = al.get_parser()
    annotation_file_parser = an.get_parser()
    mask_file_parser = mp.get_parser()
    plotting_parser = pp.get_parser()
    base_parser = bp.get_parser()

    generator_help = "Create ROI file from genome annotation"
    generator_desc = format_module_docstring(group_regions_make_windows.__doc__)

    count_help = "Count reads falling into regions of interest, normalize, and average into a metagene profile"
    count_desc = format_module_docstring(do_count.__doc__)

    chart_help = "Plot metagene profiles"
    chart_desc = format_module_docstring(do_chart.__doc__)

    parser = argparse.ArgumentParser(
        description=format_module_docstring(__doc__),
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    subparsers = parser.add_subparsers(
        title="subcommands", description="choose one of the following", dest="program"
    )
    gparser = subparsers.add_parser(
        "generate",
        help=generator_help,
        description=generator_desc,
        parents=[base_parser, annotation_file_parser, mask_file_parser],
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    cparser = subparsers.add_parser(
        "count",
        help=count_help,
        description=count_desc,
        parents=[base_parser, alignment_file_parser, plotting_parser],
        formatter_class=argparse.RawDescriptionHelpFormatter
    )
    pparser = subparsers.add_parser(
        "chart",
        help=chart_help,
        description=chart_desc,
        parents=[base_parser, plotting_parser],
        formatter_class=argparse.RawDescriptionHelpFormatter
    )

    # generate subprogram options
    gparser.add_argument(
        "--landmark",
        type=str,
        choices=("cds_start", "cds_stop"),
        default="cds_start",
        help="Landmark around which to build metagene profile (Default: cds_start)"
    )
    gparser.add_argument(
        "--upstream",
        type=int,
        default=50,
        help="Nucleotides to include upstream of landmark (Default: 50)"
    )
    gparser.add_argument(
        "--downstream",
        type=int,
        default=50,
        help="Nucleotides to include downstream of landmark (Default: 50)"
    )
    gparser.add_argument("--group_by",type=str,default="gene_id",
                         help="Attribute (e.g. in GTF2/GFF3 column 9) by which to group regions "+ \
                              "before generating maximal spanning windows "+ \
                              "(Default: group transcripts by gene using 'gene_id' attribute from GTF2, or 'Parent' attribute in GFF3)")
    gparser.add_argument("outbase", type=str, help="Basename for output files")

    # count subprogram options
    cparser.add_argument(
        "roi_file",
        type=str,
        help="Text file containing maximal spanning windows and offsets, " +
        "generated by the ``generate`` subprogram."
    )
    cparser.add_argument(
        "--min_counts",
        type=int,
        default=10,
        metavar="N",
        help="Minimum counts required in normalization region " +
        "to be included in metagene average (Default: 10)"
    )
    cparser.add_argument(
        "--normalize_over",
        type=int,
        nargs=2,
        metavar="N",
        default=None,
        #default=(20,50),
        help="Portion of each window against which its individual raw count profile" +
        " will be normalized. Specify two integers, in nucleotide" +
        " distance from landmark (negative for upstream, positive for downstream. Surround negative numbers with quotes.). (Default: 20 50)"
    )
    cparser.add_argument(
        "--norm_region",
        type=int,
        nargs=2,
        metavar="N",
        default=None,
        help="Deprecated. Use ``--normalize_over`` instead. " +
        "Formerly, Portion of each window against which its individual raw count profile" +
        " will be normalized. Specify two integers, in nucleotide" +
        " distance, from 5\' end of window. (Default: 70 100)"
    )
    cparser.add_argument(
        "--landmark", type=str, default=None, help="Name of landmark at zero point, optional."
    )
    cparser.add_argument(
        "--use_mean",
        default=False,
        action="store_true",
        help="If supplied, use columnwise mean to generate profile (Default: use median)"
    )
    cparser.add_argument(
        "--keep",
        default=False,
        action="store_true",
        help="Save intermediate count files. Useful for additional computations (Default: False)"
    )
    cparser.add_argument("outbase", type=str, help="Basename for output files")

    # chart subprogram arguments
    pparser.add_argument("outbase", type=str, help="Basename for output file.")
    pparser.add_argument(
        "infiles",
        type=str,
        nargs="+",
        help="One or more metagene profiles, generated by the" +
        " ``count`` subprogram, which will be plotted together."
    )
    pparser.add_argument(
        "--labels",
        type=str,
        nargs="+",
        default=[],
        help="Sample names for each metagene profile (optional)."
    )
    pparser.add_argument(
        "--landmark",
        type=str,
        default=None,
        help="Name of landmark at zero point (e.g. 'CDS start' or 'CDS stop'; optional)"
    )

    args = parser.parse_args(argv)
    bp.get_base_ops_from_args(args)

    # 'generate' subprogram
    if args.program == "generate":
        printer.write("Generating ROI file ...")
        if args.landmark == "cds_start":
            map_function = window_cds_start
        elif args.landmark == "cds_stop":
            map_function = window_cds_stop

        # check sorting
        is_sorted = (args.sorted == True) or \
                    (args.tabix == True) or \
                    (args.annotation_format == "BigBed")

        # open annotations
        printer.write("Opening annotation files: %s ..." % ", ".join(args.annotation_files))

        annotation_message = """`metagene` relies upon relationships between
        transcripts/genes to make maximal spanning windows that cover them. The
        `%s` attribute used to group these is not found in your %s file.
        Consider either (1) using a GTF2 or GFF3 file, (2) creating an extended
        BED file with this additional column, or (3) creating a BigBed file
        containing this extra column.""".replace("        ", "").replace(
            "\n", " "
        ) % (args.group_by, args.annotation_format)

        if args.annotation_format == "BED":
            if not isinstance(args.bed_extra_columns,
                              list) or args.group_by not in args.bed_extra_columns:
                warnings.warn(annotation_message, FileFormatWarning)
        elif args.annotation_format == "BigBed":
            reader = BigBedReader(args.annotation_files[0])
            if args.group_by not in reader.extension_fields:
                warnings.warn(annotation_message, FileFormatWarning)

        transcripts = an.get_transcripts_from_args(args, printer=printer)
        mask_hash = mp.get_genome_hash_from_args(args)

        # get ROIs
        printer.write("Generating regions of interest ...")
        roi_table = group_regions_make_windows(
            transcripts,
            mask_hash,
            args.upstream,
            args.downstream,
            window_func=map_function,
            printer=printer,
            is_sorted=is_sorted,
            group_by=args.group_by
        )

        roi_file = "%s_rois.txt" % args.outbase
        bed_file = "%s_rois.bed" % args.outbase
        printer.write("Saving to ROIs %s ..." % roi_file)
        with argsopener(roi_file, args, "w") as roi_fh:
            roi_table.to_csv(
                roi_fh,
                sep="\t",
                header=True,
                index=False,
                na_rep="nan",
                columns=[
                    "region_id", "window_size", "region", "masked", "alignment_offset", "zero_point"
                ]
            )
            roi_fh.close()

        printer.write("Saving BED output as %s ..." % bed_file)
        with argsopener(bed_file, args, "w") as bed_fh:
            for roi in roi_table["region_bed"]:
                bed_fh.write(roi)

            bed_fh.close()

    # 'count' subprogram
    elif args.program == "count":
        do_count(args, al, pp, printer)

    # 'plot' subprogram
    elif args.program == "chart":
        assert len(args.labels) in (0, len(args.infiles))
        do_chart(args, pp, printer)

    printer.write("Done.")


if __name__ == "__main__":
    main()