plastid.bin.metagene module

Performs metagene analyses. The workflow is separated into the following subprograms:

Generate

A metagene profile is a position-wise average over all genes in the vicinity of an interesting landmark (e.g. a start codon). Because genes can have multiple transcript isoforms that may cover different genomic positions, which transcript positions (and therefore which genomic positions) to include in the average can be ambiguous when the isoforms are not knnow.

To handle this problem, we define for each gene the maximal spanning window over which every position at a given distance from the landmark of interest (e.g. a start or stop codon) maps to the same genomic coordinates in all transcript isoforms. The windows are defined by the following algorithm:

  1. Transcripts are grouped by gene.

  2. Landmarks are detected on each transcript for each gene. For genes in which all transcripts do not share the same genomic coordinate for the landmark of interest, no window can be defined, and that gene is excluded from further analysis.

  3. For each set of transcripts that passes step (2), the maximal spanning window is created by aligning the set of transcripts at the landmark, and bidirectionally growing the maximal spanning window until either:

    • the next nucleotide position added no longer corresponds to the same genomic position in all transcripts

    • the window reaches the maximum user-specified size

Note: if annotations are supplied as BED files, transcripts cannot be grouped by gene, because BED files don’t contain this information. In this case one ROI is generated per transcript.

Output files

OUTBASE_rois.txt

A tab-delimited text file describing the maximal spanning window for each gene, with columns as follows:

Column

Contains

alignment_offset

Offset to align window to all other windows in the file, if the window happens to be shorter on the 5’ end than specified in --flank_upstream. Typically this is 0.

region_id

ID of region (e.g. gene) from which window was made

region

maximal spanning window, formatted as chromosome:start-end:(strand)

window_size

with of window

zero_point

distance from 5’ end of window to landmark

OUTBASE_rois.bed

Maximal spanning windows in BED format for visualization in a genome browser. The thickly-rendered portion of a window indicates its landmark

where OUTBASE is supplied by the user.

Count

This program generates metagene profiles from a dataset of counts or alignments, taking the following steps:

  1. The raw counts at each position in each maximal spanning window (from the generate subprogram) fetched as a raw count vector for the window.

  2. A normalized count vector is created for each window by dividing its raw count vector by the total number of counts occurring within a user-defined normalization region within the window.

  3. A metagene average is created by taking aligning all of the normalized count vectors, and taking the median normalized counts over all vectors at each nucleotide position. Count vectors deriving from windows that don’t meet a minimum count threshold (set via the --norm_region option) are excluded.

Output files

Raw count vectors, normalized count vectors, and metagene profiles are all saved as tab-delimited text files, for subsequent plotting, filtering, or reanalysis.

OUTBASE_metagene_profile.txt

Tab-delimited table of metagene profile, containing the following columns:

Column

Contains

x

Distance in nucleotides from the landmark

metagene_average

Value of metagene average at that position

regions_counted

Number of maximal spanning windows included at that position in the average. i.e. windows that both met the threshold set by --min_counts and were not masked at that position by a mask file

OUTBASE_rawcounts.txt

Saved if --keep is specified. Table of raw counts. Each row is a maximal spanning window for a gene, and each column a nucleotide position in that window. All windows are aligned at the landmark.

OUTBASE_normcounts.txt

Saved if --keep is specified. Table of normalized counts, produced by dividing each row in the raw count table by the of counts in that row within the columns specified by --normalize_over.

OUTBASE_mask.txt

Saved if --keep is specified. Matrix of masks indicating which cells in normcounts were excluded from computations.

OUTBASE_metagene_overview.[png | svg | pdf | et c…]

Metagene average plotted above a heatmap of normalized counts, in which each row of pixels is a maximal spanning window for a gene, and rows are sorted by the column in which they reach their max value. To facilitate visualization, colors in the heatmap are scaled from 0 to the 95th percentile of non-zero numbers in the normalized counts

OUTBASE is supplied by the user.

Chart

One or more metagene profiles generated by the count subprogram, for example, on different datasets, are plotted against each other.

See command-line help for each subprogram for details on parameters for each

plastid.bin.metagene.do_chart(args, plot_parser, printer=NullWriter())[source]

Plot metagene profiles against one another

Parameters
argsargparse.Namespace

Namespace containing arguments

printerfile-like, optional

Anything implementing a write() method, for logging purposes.

Returns
matplotlib.Figure
plastid.bin.metagene.do_count(args, alignment_parser, plot_parser, printer=NullWriter())[source]

Calculate a metagene average over maximal spanning windows specified in roi_table, taking the following steps:

  1. The raw counts at each position in each maximal spanning window (from the generate subprogram) are totaled to create a raw count vector for the ROI.

  2. A normalized count vector is created fore each window by dividing its raw count vector by the total number of counts occurring within a user-defined normalization window within the window.

  3. A metagene average is created by taking aligning all of the normalized count vectors, and taking the median normalized counts over all vectors at each nucleotide position. Count vectors deriving from ROIs that don’t meet a minimum count threshold (set via the --norm_region option) are excluded.

Parameters
argsargparse.Namespace

Namespace containing arguments

printerfile-like, optional

Anything implementing a write() method, for logging purposes.

Returns
numpy.ndarray

raw counts at each position (column) in each window (row)

numpy.ndarray

counts at each position (column) in each window (row), normalized by the total number of counts in that row from norm_start to norm_end

pandas.DataFrame

Metagene profile of median normalized counts at each position across all windows, and the number of windows included in the calculation of each median

plastid.bin.metagene.group_regions_make_windows(source, mask_hash, flank_upstream, flank_downstream, window_func=<function window_cds_start>, is_sorted=False, group_by='gene_id', printer=NullWriter())[source]

Group regions of interest by a shared attribute, and generate maximal spanning windows for them. Results are given in a table suitable for use in count subprogram. Windows are generated by the following algorithm:

  1. Transcripts are grouped by group_by attribute (default: by gene). If all transcripts in a group share the same genomic coordinate for the landmark of interest (for example, if all share the same start codon), then the set of transcripts is included in the analysis. If not, the set of transcripts and their associated gene are excluded from further processing.

  2. For each set of transcripts that pass step (1), the maximal spanning window is created by aligning the set of transcripts at the landmark, and adding nucleotide positions in transcript coordinates to the growing window in both 5’ and 3’ directions until either:

    • the next nucleotide position added is no longer corresponds to the same genomic position in all transcripts

    • the window reaches the maximum user-specified size

Parameters
sourcelist or generator

Source of SegmentChain or Transcript objects, preferably with gene_id and transcript_id (e.g. transcripts assembled from a GTF2 or GFF3 file), so that transcripts can be grouped by gene when making maximal spanning windows.

mask_hashGenomeHash

GenomeHash containing regions to exclude from analysis

flank_upstreamint

Number of nucleotides upstream of landmark to include in windows (in transcript coordinates)

flank_downstream: int

Number of nucleotides downstream of landmark to include in windows (in transcript coordinates)

window_funcfunc, optional

Function that defines a landmark in an individual transcript, and builds a window around that landmark over that region. As examples, window_cds_start() and window_cds_stop() are provided, though any function that meets the following criteria can be used:

  1. It must take the same parameters as window_cds_start()

  2. It must return the same types as window_cds_start()

Such functions could choose arbitrary features as landmarks, such as peaks in ribosome density, nucleic acid sequence features, transcript start or end sites, or any property that can be deduced from a Transcript. (Default: window_cds_start())

group_bystr, optional

Attribute by which SegmentChains or Transcripts should be grouped before generating maximal spanning windows (Default: “gene_id”)

is_sortedbool, optional

Input file is sorted and/or tabix-indexed. If True, group_regions_make_windows() will take advantage of this to save memory. (Default: False)

printerfile-like, optional

filehandle to write logging info to (Default: NullWriter())

Returns
pandas.DataFrame

A pandas.DataFrame containing the following columns describing the maximal spanning windows:

Column

Contains

alignment_offset

Offset to align window to all other windows in the file from the 5’ end, if the window happens to be shorter on the 5’ end than specified in flank_upstream

region_id

ID of region, given by shared value of group_by parameter (i.e. by default is ‘gene_id’)

region

Maximal spanning window, formatted as chromosome:start-end:(strand)

region_length

Length of maximal spanning window

region_bed

Maximal spanning window, formatted as a BED line

window_size

Requested length of maximal spanning window. May be larger than actual window if alignment_offset or threeprime_offset is nonzero

zero_point

Distance from 5’ end of window to landmark, including alignment_offset

threeprime_offset

Offset to align window to all other windows the file from the 3’ end, if the window happens to be shorter on the 3’ end than specified in flank_downstream

list

List of SegmentChain representing each window. These data are also represented as strings in the pandas.DataFrame

Notes

Not all genes will be included in the output if, for example, there isn’t a position set common to all transcripts surrounding the landmark

plastid.bin.metagene.main(argv=['-T', '-E', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

Parameters
argvlist, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.

Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line

plastid.bin.metagene.maximal_spanning_window(regions, mask_hash, flank_upstream, flank_downstream, window_func=<function window_cds_start>, name=None, printer=NullWriter())[source]

Create a maximal spanning window over regions surrounding a landmark,

The maximal spanning window is created by:

  1. Applying window_func to each region in regions to create a sub-window of region that surrounds a landmark identified by window_func, with up to flank_upstream bases 5’ of the landmark, and flank_downstream bases 3` of the landmark.

  2. If the landmark in all regions corresponds to the same genomic position, a maximal spanning window is created by starting at the landmark, and growing the window in the 5’ and 3’ directions along all regions until either:

    • the next nucleotide position added no longer corresponds to the same genomic position in all regions

    • the window reaches the maximum size specified by`flank_upstream` (in 5’ direction) or flank_downstream (in 3’ direction)

Parameters
regionslist

List of SegmentChains or Transcripts

mask_hashGenomeHash

GenomeHash containing regions to exclude from analysis

flank_upstreamint

Number of nucleotides upstream of landmark to include in maximal spanning window, if possible

flank_downstream: int

Number of nucleotides downstream of landmark to include in maximal spanning window, if possible

window_funcfunc, optional

Function that defines a landmark in an individual region, and builds a window around that landmark over that region. As examples, window_cds_start() and window_cds_stop() are provided, though any function that meets the following criteria can be used:

  1. It must take the same parameters as window_cds_start()

  2. It must return the same types as window_cds_start()

Such functions could choose arbitrary features as landmarks, such as peaks in ribosome density, nucleic acid sequence features, transcript start or end sites, or any property that can be deduced from a Transcript. (Default: window_cds_start())

namestr or None, optional

Name for maximal spanning window, to which it’s ID attribute will be set. If None, a name will be generated.

printerfile-like, optional

filehandle to write logging info to (Default: NullWriter())

Returns
SegmentChain

Maximal spanning window, if regions share the same landmark. Otherwise, 0-length SegmentChain

int orobj:numpy.nan

Alignment offset to the window start, if the maximal spanning window itself is not long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other maximal spanning windows.

If regions do not share the same landmark, numpy.nan

plastid.bin.metagene.window_cds_start(transcript, flank_upstream, flank_downstream, ref_delta=0)[source]

Returns a window surrounding a start codon.

Parameters
transcriptTranscript

Transcript on which to generate window

flank_upstreamint

Nucleotide length upstream of start codon to include in window, if transcript has a start codon

flank_downstreamint

Nucleotide length downstream of start codon to include in window, if transcript has a start codon

ref_deltaint, optional

Offset from start codon to the reference point. If 0, the landmark is the reference point. (Default: 0)

Returns
SegmentChain

Window surrounding start codon if transcript is coding. Otherwise, zero-length SegmentChain

int

Alignment offset to the window start, if transcript itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other windows generated around start codons in other transcripts. If transcript is not coding, returns numpy.nan

(str, int, str)

Genomic coordinate of reference point as (chromosome name, coordinate, strand). If transcript has no start codon, returns numpy.nan

plastid.bin.metagene.window_cds_stop(transcript, flank_upstream, flank_downstream, ref_delta=0)[source]

Returns a window surrounding a stop codon.

Parameters
transcriptTranscript

Transcript on which to generate window

flank_upstreamint

Nucleotide length upstream of stop codon to include in window, if transcript has a stop codon

flank_downstreamint

Nucleotide length downstream of stop codon to include in window, if transcript has a stop codon

ref_deltaint, optional

Offset from stop codon to the reference point. If 0, the landmark is the reference point. (Default: 0)

Returns
SegmentChain

Window surrounding stop codon if transcript is coding. Otherwise, zero-length SegmentChain

int

alignment offset to the window start, if transcript itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other windows generated around stop codons in other transcripts. If transcript is not coding, returns numpy.nan

(str, int, str)

Genomic coordinate of reference point as (chromosome name, coordinate, strand). If transcript has no stop codon, returns numpy.nan

plastid.bin.metagene.window_landmark(region, flank_upstream=50, flank_downstream=50, ref_delta=0, landmark=0)[source]

Define a window surrounding a landmark in a region, if the region has such a landmark, (e.g. a start codon in a transcript), accounting for splicing of the region, if the region is discontinuous

Parameters
transcriptSegmentChain or Transcript

Region on which to generate a window surrounding a landmark

landmarkint

Position of the landmark within region

flank_upstreamint

Nucleotides upstream of landmark to include in window

flank_downstreamint

Nucleotides downstream of landmark to include in window

ref_deltaint

Offset from landmark to the reference point. If 0, the landmark is the reference point. Default: 0

Returns
SegmentChain

Window of region surrounding landmark

int

alignment offset to the window start, if region itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align windows generated around similar landmarks from different regions (e.g. windows surrounding start codons in various transcripts).

(str, int, str)

Genomic coordinate of reference point as (chromosome name, coordinate, strand)