plastid.bin.metagene module¶
Performs metagene analyses. The workflow is separated into the following subprograms:
- Generate
A metagene profile is a position-wise average over all genes in the vicinity of an interesting landmark (e.g. a start codon). Because genes can have multiple transcript isoforms that may cover different genomic positions, which transcript positions (and therefore which genomic positions) to include in the average can be ambiguous when the isoforms are not knnow.
To handle this problem, we define for each gene the maximal spanning window over which every position at a given distance from the landmark of interest (e.g. a start or stop codon) maps to the same genomic coordinates in all transcript isoforms. The windows are defined by the following algorithm:
Transcripts are grouped by gene.
Landmarks are detected on each transcript for each gene. For genes in which all transcripts do not share the same genomic coordinate for the landmark of interest, no window can be defined, and that gene is excluded from further analysis.
For each set of transcripts that passes step (2), the maximal spanning window is created by aligning the set of transcripts at the landmark, and bidirectionally growing the maximal spanning window until either:
the next nucleotide position added no longer corresponds to the same genomic position in all transcripts
the window reaches the maximum user-specified size
Note: if annotations are supplied as BED files, transcripts cannot be grouped by gene, because BED files don’t contain this information. In this case one ROI is generated per transcript.
Output files
- OUTBASE_rois.txt
A tab-delimited text file describing the maximal spanning window for each gene, with columns as follows:
Column
Contains
alignment_offset
Offset to align window to all other windows in the file, if the window happens to be shorter on the 5’ end than specified in
--flank_upstream
. Typically this is 0.region_id
ID of region (e.g. gene) from which window was made
region
maximal spanning window, formatted as chromosome:start-end:(strand)
window_size
with of window
zero_point
distance from 5’ end of window to landmark
- OUTBASE_rois.bed
Maximal spanning windows in BED format for visualization in a genome browser. The thickly-rendered portion of a window indicates its landmark
where OUTBASE is supplied by the user.
- Count
This program generates metagene profiles from a dataset of counts or alignments, taking the following steps:
The raw counts at each position in each maximal spanning window (from the
generate
subprogram) fetched as a raw count vector for the window.A normalized count vector is created for each window by dividing its raw count vector by the total number of counts occurring within a user-defined normalization region within the window.
A metagene average is created by taking aligning all of the normalized count vectors, and taking the median normalized counts over all vectors at each nucleotide position. Count vectors deriving from windows that don’t meet a minimum count threshold (set via the
--norm_region
option) are excluded.
Output files
Raw count vectors, normalized count vectors, and metagene profiles are all saved as tab-delimited text files, for subsequent plotting, filtering, or reanalysis.
- OUTBASE_metagene_profile.txt
Tab-delimited table of metagene profile, containing the following columns:
Column
Contains
x
Distance in nucleotides from the landmark
metagene_average
Value of metagene average at that position
regions_counted
Number of maximal spanning windows included at that position in the average. i.e. windows that both met the threshold set by
--min_counts
and were not masked at that position by a mask file- OUTBASE_rawcounts.txt
Saved if
--keep
is specified. Table of raw counts. Each row is a maximal spanning window for a gene, and each column a nucleotide position in that window. All windows are aligned at the landmark.- OUTBASE_normcounts.txt
Saved if
--keep
is specified. Table of normalized counts, produced by dividing each row in the raw count table by the of counts in that row within the columns specified by--normalize_over
.- OUTBASE_mask.txt
Saved if
--keep
is specified. Matrix of masks indicating which cells innormcounts
were excluded from computations.- OUTBASE_metagene_overview.[png | svg | pdf | et c…]
Metagene average plotted above a heatmap of normalized counts, in which each row of pixels is a maximal spanning window for a gene, and rows are sorted by the column in which they reach their max value. To facilitate visualization, colors in the heatmap are scaled from 0 to the 95th percentile of non-zero numbers in the normalized counts
OUTBASE is supplied by the user.
- Chart
One or more metagene profiles generated by the
count
subprogram, for example, on different datasets, are plotted against each other.
See command-line help for each subprogram for details on parameters for each
- plastid.bin.metagene.do_chart(args, plot_parser, printer=NullWriter())[source]¶
Plot metagene profiles against one another
- Parameters
- args
argparse.Namespace
Namespace containing arguments
- printerfile-like, optional
Anything implementing a
write()
method, for logging purposes.
- args
- Returns
matplotlib.Figure
- plastid.bin.metagene.do_count(args, alignment_parser, plot_parser, printer=NullWriter())[source]¶
Calculate a metagene average over maximal spanning windows specified in roi_table, taking the following steps:
The raw counts at each position in each maximal spanning window (from the
generate
subprogram) are totaled to create a raw count vector for the ROI.A normalized count vector is created fore each window by dividing its raw count vector by the total number of counts occurring within a user-defined normalization window within the window.
A metagene average is created by taking aligning all of the normalized count vectors, and taking the median normalized counts over all vectors at each nucleotide position. Count vectors deriving from ROIs that don’t meet a minimum count threshold (set via the
--norm_region
option) are excluded.
- Parameters
- args
argparse.Namespace
Namespace containing arguments
- printerfile-like, optional
Anything implementing a
write()
method, for logging purposes.
- args
- Returns
numpy.ndarray
raw counts at each position (column) in each window (row)
numpy.ndarray
counts at each position (column) in each window (row), normalized by the total number of counts in that row from norm_start to norm_end
pandas.DataFrame
Metagene profile of median normalized counts at each position across all windows, and the number of windows included in the calculation of each median
- plastid.bin.metagene.group_regions_make_windows(source, mask_hash, flank_upstream, flank_downstream, window_func=<function window_cds_start>, is_sorted=False, group_by='gene_id', printer=NullWriter())[source]¶
Group regions of interest by a shared attribute, and generate maximal spanning windows for them. Results are given in a table suitable for use in
count
subprogram. Windows are generated by the following algorithm:Transcripts are grouped by group_by attribute (default: by gene). If all transcripts in a group share the same genomic coordinate for the landmark of interest (for example, if all share the same start codon), then the set of transcripts is included in the analysis. If not, the set of transcripts and their associated gene are excluded from further processing.
For each set of transcripts that pass step (1), the maximal spanning window is created by aligning the set of transcripts at the landmark, and adding nucleotide positions in transcript coordinates to the growing window in both 5’ and 3’ directions until either:
the next nucleotide position added is no longer corresponds to the same genomic position in all transcripts
the window reaches the maximum user-specified size
- Parameters
- sourcelist or generator
Source of
SegmentChain
orTranscript
objects, preferably with gene_id and transcript_id (e.g. transcripts assembled from a GTF2 or GFF3 file), so that transcripts can be grouped by gene when making maximal spanning windows.- mask_hash
GenomeHash
GenomeHash
containing regions to exclude from analysis- flank_upstreamint
Number of nucleotides upstream of landmark to include in windows (in transcript coordinates)
- flank_downstream: int
Number of nucleotides downstream of landmark to include in windows (in transcript coordinates)
- window_funcfunc, optional
Function that defines a landmark in an individual transcript, and builds a window around that landmark over that region. As examples,
window_cds_start()
andwindow_cds_stop()
are provided, though any function that meets the following criteria can be used:It must take the same parameters as
window_cds_start()
It must return the same types as
window_cds_start()
Such functions could choose arbitrary features as landmarks, such as peaks in ribosome density, nucleic acid sequence features, transcript start or end sites, or any property that can be deduced from a
Transcript
. (Default:window_cds_start()
)- group_bystr, optional
Attribute by which
SegmentChains
orTranscripts
should be grouped before generating maximal spanning windows (Default: “gene_id”)- is_sortedbool, optional
Input file is sorted and/or tabix-indexed. If True,
group_regions_make_windows()
will take advantage of this to save memory. (Default: False)- printerfile-like, optional
filehandle to write logging info to (Default:
NullWriter()
)
- Returns
pandas.DataFrame
A
pandas.DataFrame
containing the following columns describing the maximal spanning windows:Column
Contains
alignment_offset
Offset to align window to all other windows in the file from the 5’ end, if the window happens to be shorter on the 5’ end than specified in flank_upstream
region_id
ID of region, given by shared value of group_by parameter (i.e. by default is ‘gene_id’)
region
Maximal spanning window, formatted as chromosome:start-end:(strand)
region_length
Length of maximal spanning window
region_bed
Maximal spanning window, formatted as a BED line
window_size
Requested length of maximal spanning window. May be larger than actual window if alignment_offset or threeprime_offset is nonzero
zero_point
Distance from 5’ end of window to landmark, including alignment_offset
threeprime_offset
Offset to align window to all other windows the file from the 3’ end, if the window happens to be shorter on the 3’ end than specified in flank_downstream
- list
List of
SegmentChain
representing each window. These data are also represented as strings in thepandas.DataFrame
Notes
Not all genes will be included in the output if, for example, there isn’t a position set common to all transcripts surrounding the landmark
- plastid.bin.metagene.main(argv=['-T', '-E', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '_build/html'])[source]¶
Command-line program
- Parameters
- argvlist, optional
A list of command-line arguments, which will be processed as if the script were called from the command line if
main()
is called directly.Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line
- plastid.bin.metagene.maximal_spanning_window(regions, mask_hash, flank_upstream, flank_downstream, window_func=<function window_cds_start>, name=None, printer=NullWriter())[source]¶
Create a maximal spanning window over regions surrounding a landmark,
The maximal spanning window is created by:
Applying window_func to each region in regions to create a sub-window of region that surrounds a landmark identified by window_func, with up to flank_upstream bases 5’ of the landmark, and flank_downstream bases 3` of the landmark.
If the landmark in all regions corresponds to the same genomic position, a maximal spanning window is created by starting at the landmark, and growing the window in the 5’ and 3’ directions along all regions until either:
the next nucleotide position added no longer corresponds to the same genomic position in all regions
the window reaches the maximum size specified by`flank_upstream` (in 5’ direction) or flank_downstream (in 3’ direction)
- Parameters
- regionslist
List of
SegmentChains
orTranscripts
- mask_hash
GenomeHash
GenomeHash
containing regions to exclude from analysis- flank_upstreamint
Number of nucleotides upstream of landmark to include in maximal spanning window, if possible
- flank_downstream: int
Number of nucleotides downstream of landmark to include in maximal spanning window, if possible
- window_funcfunc, optional
Function that defines a landmark in an individual region, and builds a window around that landmark over that region. As examples,
window_cds_start()
andwindow_cds_stop()
are provided, though any function that meets the following criteria can be used:It must take the same parameters as
window_cds_start()
It must return the same types as
window_cds_start()
Such functions could choose arbitrary features as landmarks, such as peaks in ribosome density, nucleic acid sequence features, transcript start or end sites, or any property that can be deduced from a
Transcript
. (Default:window_cds_start()
)- namestr or None, optional
Name for maximal spanning window, to which it’s ID attribute will be set. If None, a name will be generated.
- printerfile-like, optional
filehandle to write logging info to (Default:
NullWriter()
)
- Returns
- SegmentChain
Maximal spanning window, if regions share the same landmark. Otherwise, 0-length
SegmentChain
- int orobj:numpy.nan
Alignment offset to the window start, if the maximal spanning window itself is not long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other maximal spanning windows.
If regions do not share the same landmark,
numpy.nan
- plastid.bin.metagene.window_cds_start(transcript, flank_upstream, flank_downstream, ref_delta=0)[source]¶
Returns a window surrounding a start codon.
- Parameters
- transcript
Transcript
Transcript on which to generate window
- flank_upstreamint
Nucleotide length upstream of start codon to include in window, if transcript has a start codon
- flank_downstreamint
Nucleotide length downstream of start codon to include in window, if transcript has a start codon
- ref_deltaint, optional
Offset from start codon to the reference point. If 0, the landmark is the reference point. (Default: 0)
- transcript
- Returns
SegmentChain
Window surrounding start codon if transcript is coding. Otherwise, zero-length
SegmentChain
- int
Alignment offset to the window start, if transcript itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other windows generated around start codons in other transcripts. If transcript is not coding, returns
numpy.nan
- (str, int, str)
Genomic coordinate of reference point as (chromosome name, coordinate, strand). If transcript has no start codon, returns
numpy.nan
- plastid.bin.metagene.window_cds_stop(transcript, flank_upstream, flank_downstream, ref_delta=0)[source]¶
Returns a window surrounding a stop codon.
- Parameters
- transcript
Transcript
Transcript on which to generate window
- flank_upstreamint
Nucleotide length upstream of stop codon to include in window, if transcript has a stop codon
- flank_downstreamint
Nucleotide length downstream of stop codon to include in window, if transcript has a stop codon
- ref_deltaint, optional
Offset from stop codon to the reference point. If 0, the landmark is the reference point. (Default: 0)
- transcript
- Returns
SegmentChain
Window surrounding stop codon if transcript is coding. Otherwise, zero-length
SegmentChain
- int
alignment offset to the window start, if transcript itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other windows generated around stop codons in other transcripts. If transcript is not coding, returns
numpy.nan
- (str, int, str)
Genomic coordinate of reference point as (chromosome name, coordinate, strand). If transcript has no stop codon, returns
numpy.nan
- plastid.bin.metagene.window_landmark(region, flank_upstream=50, flank_downstream=50, ref_delta=0, landmark=0)[source]¶
Define a window surrounding a landmark in a region, if the region has such a landmark, (e.g. a start codon in a transcript), accounting for splicing of the region, if the region is discontinuous
- Parameters
- transcript
SegmentChain
orTranscript
Region on which to generate a window surrounding a landmark
- landmarkint
Position of the landmark within region
- flank_upstreamint
Nucleotides upstream of landmark to include in window
- flank_downstreamint
Nucleotides downstream of landmark to include in window
- ref_deltaint
Offset from landmark to the reference point. If 0, the landmark is the reference point. Default: 0
- transcript
- Returns
SegmentChain
Window of region surrounding landmark
- int
alignment offset to the window start, if region itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align windows generated around similar landmarks from different regions (e.g. windows surrounding start codons in various transcripts).
- (str, int, str)
Genomic coordinate of reference point as (chromosome name, coordinate, strand)