plastid.bin.metagene module

Performs metagene analyses. The workflow is separated into the following subprograms:

Generate

A metagene profile is a position-wise average over all genes in the vicinity of an interesting landmark (e.g. a start codon). Because genes can have multiple transcript isoforms that may cover different genomic positions, which transcript positions (and therefore which genomic positions) to include in the average can be ambiguous when the isoforms are not knnow.

To handle this problem, we define for each gene the maximal spanning window over which every position at a given distance from the landmark of interest (e.g. a start or stop codon) maps to the same genomic coordinates in all transcript isoforms. The windows are defined by the following algorithm:

  1. Transcripts are grouped by gene.

  2. Landmarks are detected on each transcript for each gene. For genes in which all transcripts do not share the same genomic coordinate for the landmark of interest, no window can be defined, and that gene is excluded from further analysis.

  3. For each set of transcripts that passes step (2), the maximal spanning window is created by aligning the set of transcripts at the landmark, and bidirectionally growing the maximal spanning window until either:

    • the next nucleotide position added no longer corresponds to the same genomic position in all transcripts
    • the window reaches the maximum user-specified size

Note: if annotations are supplied as BED files, transcripts cannot be grouped by gene, because BED files don’t contain this information. In this case one ROI is generated per transcript.

Output files

OUTBASE_rois.txt

A tab-delimited text file describing the maximal spanning window for each gene, with columns as follows:

Column Contains
alignment_offset Offset to align window to all other windows in the file, if the window happens to be shorter on the 5’ end than specified in --flank_upstream. Typically this is 0.
region_id ID of region (e.g. gene) from which window was made
region maximal spanning window, formatted as chromosome:start-end:(strand)
window_size with of window
zero_point distance from 5’ end of window to landmark
OUTBASE_rois.bed
Maximal spanning windows in BED format for visualization in a genome browser. The thickly-rendered portion of a window indicates its landmark

where OUTBASE is supplied by the user.

Count

This program generates metagene profiles from a dataset of counts or alignments, taking the following steps:

  1. The raw counts at each position in each maximal spanning window (from the generate subprogram) fetched as a raw count vector for the window.
  2. A normalized count vector is created for each window by dividing its raw count vector by the total number of counts occurring within a user-defined normalization region within the window.
  3. A metagene average is created by taking aligning all of the normalized count vectors, and taking the median normalized counts over all vectors at each nucleotide position. Count vectors deriving from windows that don’t meet a minimum count threshold (set via the --norm_region option) are excluded.

Output files

Raw count vectors, normalized count vectors, and metagene profiles are all saved as tab-delimited text files, for subsequent plotting, filtering, or reanalysis.

OUTBASE_metagene_profile.txt

Tab-delimited table of metagene profile, containing the following columns:

Column Contains
x Distance in nucleotides from the landmark
metagene_average Value of metagene average at that position
regions_counted Number of maximal spanning windows included at that position in the average. i.e. windows that both met the threshold set by --min_counts and were not masked at that position by a mask file
OUTBASE_rawcounts.txt
Saved if --keep is specified. Table of raw counts. Each row is a maximal spanning window for a gene, and each column a nucleotide position in that window. All windows are aligned at the landmark.
OUTBASE_normcounts.txt
Saved if --keep is specified. Table of normalized counts, produced by dividing each row in the raw count table by the of counts in that row within the columns specified by --normalize_over.
OUTBASE_mask.txt
Saved if --keep is specified. Matrix of masks indicating which cells in normcounts were excluded from computations.
OUTBASE_metagene_overview.[png | svg | pdf | et c…]
Metagene average plotted above a heatmap of normalized counts, in which each row of pixels is a maximal spanning window for a gene, and rows are sorted by the column in which they reach their max value. To facilitate visualization, colors in the heatmap are scaled from 0 to the 95th percentile of non-zero numbers in the normalized counts

OUTBASE is supplied by the user.

Chart
One or more metagene profiles generated by the count subprogram, for example, on different datasets, are plotted against each other.

See command-line help for each subprogram for details on parameters for each


Command-line arguments

Optional arguments

Argument Description
-h, --help show this help message and exit

Subcommands

choose one of the following

Argument Description
generate Create ROI file from genome annotation
count Count reads falling into regions of interest, normalize, and average into a metagene profile
chart Plot metagene profiles

generate subcommand


Group regions of interest by a shared attribute, and generate

maximal spanning windows for them. Results are given in a table suitable for use in count subprogram. Windows are generated by the following algorithm:

  1. Transcripts are grouped by group_by attribute (default: by gene). If all transcripts in a group share the same genomic coordinate for the landmark of interest (for example, if all share the same start codon), then the set of transcripts is included in the analysis. If not, the set of transcripts and their associated gene are excluded from further processing.

  2. For each set of transcripts that pass step (1), the maximal spanning window is created by aligning the set of transcripts at the landmark, and adding nucleotide positions in transcript coordinates to the growing window in both 5’ and 3’ directions until either:

    • the next nucleotide position added is no longer corresponds to the same genomic position in all transcripts
    • the window reaches the maximum user-specified size

Positional arguments
Argument Description
outbase Basename for output files
Optional arguments
Argument Description
-h, --help show this help message and exit
--landmark  {cds_start,cds_stop} Landmark around which to build metagene profile (Default: cds_start)
--upstream  UPSTREAM Nucleotides to include upstream of landmark (Default: 50)
--downstream  DOWNSTREAM Nucleotides to include downstream of landmark (Default: 50)
--group_by  GROUP_BY Attribute (e.g. in GTF2/GFF3 column 9) by which to group regions before generating maximal spanning windows (Default: group transcripts by gene using ‘gene_id’ attribute from GTF2, or ‘Parent’ attribute in GFF3)
Warning/error options
Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)
Annotation file options (one or more annotation files required)

Open one or more genome annotation files

Argument Description
--annotation_files  infile.[BED | BigBed | GTF2 | GFF3] [infile.[BED | BigBed | GTF2 | GFF3] ...] Zero or more annotation files (max 1 file if BigBed)
--annotation_format  {BED,BigBed,GTF2,GFF3} Format of annotation_files (Default: GTF2). Note: GFF3 assembly assumes SO v.2.5.2 feature ontologies, which may or may not match your specific file.
--add_three If supplied, coding regions will be extended by 3 nucleotides at their 3’ ends (except for GTF2 files that explicitly include stop_codon features). Use if your annotation file excludes stop codons from CDS.
--tabix annotation_files are tabix-compressed and indexed (Default: False). Ignored for BigBed files.
--sorted annotation_files are sorted by chromosomal position (Default: False)
Bed-specific options
Argument Description
--bed_extra_columns  BED_EXTRA_COLUMNS [BED_EXTRA_COLUMNS ...] Number of extra columns in BED file (e.g. in custom ENCODE formats) or list of names for those columns. (Default: 0).
--mask_bed_extra_columns  MASK_BED_EXTRA_COLUMNS [MASK_BED_EXTRA_COLUMNS ...] Number of extra columns in BED file (e.g. in custom ENCODE formats) or list of names for those columns. (Default: 0).
Bigbed-specific options
Argument Description
--maxmem  MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)
--mask_maxmem  MASK_MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)
Gff3-specific options
Argument Description
--gff_transcript_types  GFF_TRANSCRIPT_TYPES [GFF_TRANSCRIPT_TYPES ...] GFF3 feature types to include as transcripts, even if no exons are present (for GFF3 only; default: use SO v2.5.3 specification)
--gff_exon_types  GFF_EXON_TYPES [GFF_EXON_TYPES ...] GFF3 feature types to include as exons (for GFF3 only; default: use SO v2.5.3 specification)
--gff_cds_types  GFF_CDS_TYPES [GFF_CDS_TYPES ...] GFF3 feature types to include as CDS (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_transcript_types  MASK_GFF_TRANSCRIPT_TYPES [MASK_GFF_TRANSCRIPT_TYPES ...] GFF3 feature types to include as transcripts, even if no exons are present (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_exon_types  MASK_GFF_EXON_TYPES [MASK_GFF_EXON_TYPES ...] GFF3 feature types to include as exons (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_cds_types  MASK_GFF_CDS_TYPES [MASK_GFF_CDS_TYPES ...] GFF3 feature types to include as CDS (for GFF3 only; default: use SO v2.5.3 specification)
Mask file options (optional)

Add mask file(s) that annotate regions that should be excluded from analyses (e.g. repetitive genomic regions).

Argument Description
--mask_annotation_files  infile.[BED | BigBed | GTF2 | GFF3 | PSL] [infile.[BED | BigBed | GTF2 | GFF3 | PSL] ...] Zero or more annotation files (max 1 file if BigBed)
--mask_annotation_format  {BED,BigBed,GTF2,GFF3,PSL} Format of mask_annotation_files (Default: GTF2). Note: GFF3 assembly assumes SO v.2.5.2 feature ontologies, which may or may not match your specific file.
--mask_add_three If supplied, coding regions will be extended by 3 nucleotides at their 3’ ends (except for GTF2 files that explicitly include stop_codon features). Use if your annotation file excludes stop codons from CDS.
--mask_tabix mask_annotation_files are tabix-compressed and indexed (Default: False). Ignored for BigBed files.
--mask_sorted mask_annotation_files are sorted by chromosomal position (Default: False)

count subcommand


Calculate a metagene average over maximal spanning windows specified in roi_table, taking the following steps:

  1. The raw counts at each position in each maximal spanning window (from the generate subprogram) are totaled to create a raw count vector for the ROI.
  2. A normalized count vector is created fore each window by dividing its raw count vector by the total number of counts occurring within a user-defined normalization window within the window.
  3. A metagene average is created by taking aligning all of the normalized count vectors, and taking the median normalized counts over all vectors at each nucleotide position. Count vectors deriving from ROIs that don’t meet a minimum count threshold (set via the --norm_region option) are excluded.

Positional arguments
Argument Description
roi_file Text file containing maximal spanning windows and offsets, generated by the generate subprogram.
outbase Basename for output files
Optional arguments
Argument Description
-h, --help show this help message and exit
--min_counts  N Minimum counts required in normalization region to be included in metagene average (Default: 10)
--normalize_over  N N Portion of each window against which its individual raw count profile will be normalized. Specify two integers, in nucleotide distance from landmark (negative for upstream, positive for downstream. Surround negative numbers with quotes.). (Default: 20 50)
--norm_region  N N Deprecated. Use --normalize_over instead. Formerly, Portion of each window against which its individual raw count profile will be normalized. Specify two integers, in nucleotide distance, from 5’ end of window. (Default: 70 100)
--landmark  LANDMARK Name of landmark at zero point, optional.
--use_mean If supplied, use columnwise mean to generate profile (Default: use median)
--keep Save intermediate count files. Useful for additional computations (Default: False)
Warning/error options
Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)
Count & alignment file options

Open alignment or count files and optionally set mapping rules

Argument Description
--count_files  COUNT_FILES [COUNT_FILES ...] One or more count or alignment file(s) from a single sample or set of samples to be pooled.
--countfile_format  {BAM,bigwig,bowtie,wiggle} Format of file containing alignments or counts (Default: BAM)
--sum  SUM Sum used in normalization of counts and RPKM/RPNT calculations (Default: total mapped reads/counts in dataset)
--min_length  N Minimum read length required to be included (BAM & bowtie files only. Default: 25)
--max_length  N Maximum read length permitted to be included (BAM & bowtie files only. Default: 100)
--maxmem  MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)
--big_genome Use slower but memory-efficient implementation for big genomes or for memory-limited computers. For wiggle & bowtie files only.
Alignment mapping functions (bam & bowtie files only)

For BAM or bowtie files, one of the mutually exclusive read mapping functions is required:

Argument Description
--fiveprime_variable Map read alignment to a variable offset from 5’ position of read, with offset determined by read length. Requires –offset below
--fiveprime Map read alignment to 5’ position.
--threeprime Map read alignment to 3’ position
--center Subtract N positions from each end of read, and add 1/(length-N), to each remaining position, where N is specified by –nibble
Filtering and alignment mapping options

The remaining arguments are optional and affect the behavior of specific mapping functions:

Argument Description
--offset  OFFSET For –fiveprime or –threeprime, provide an integer representing the offset into the read, starting from either the 5’ or 3’ end, at which data should be plotted. For –fiveprime_variable, provide the filename of a two-column tab-delimited text file, in which first column represents read length or the special keyword ‘default’, and the second column represents the offset from the five prime end of that read length at which the read should be mapped. (Default: 0)
--nibble  N For use with –center only. nt to remove from each end of read before mapping (Default: 0)

—stylesheet {seaborn-darkgrid,seaborn-notebook,classic,seaborn-ticks,grayscale,bmh,seaborn-talk,dark_background,ggplot,fivethirtyeight,seaborn-colorblind,seaborn-deep,seaborn-whitegrid,seaborn-bright,seaborn-poster,seaborn-muted,seaborn-paper,seaborn-white,seaborn-pastel,seaborn-dark,seaborn-dark-palette}

Plotting options
Argument Description
--figformat  {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff} File format for figure(s); Default: png)
--figsize  N N Figure width and height, in inches. (Default: use matplotlibrc params)
--title  TITLE Base title for plot(s).
--cmap  CMAP Matplotlib color map from which palette will be made (e.g. ‘Blues’,’autumn’,’Set1’; default: use colors from --stylesheet if given, or color cycle in matplotlibrc)
--dpi  DPI Figure resolution (Default: 150) Use this matplotlib stylesheet instead of matplotlibrc params

chart subcommand


Plot metagene profiles against one another


Positional arguments
Argument Description
outbase Basename for output file.
infiles One or more metagene profiles, generated by the count subprogram, which will be plotted together.
Optional arguments
Argument Description
-h, --help show this help message and exit
--labels  LABELS [LABELS ...] Sample names for each metagene profile (optional).
--landmark  LANDMARK Name of landmark at zero point (e.g. ‘CDS start’ or ‘CDS stop’; optional)
Warning/error options
Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)

—stylesheet {seaborn-darkgrid,seaborn-notebook,classic,seaborn-ticks,grayscale,bmh,seaborn-talk,dark_background,ggplot,fivethirtyeight,seaborn-colorblind,seaborn-deep,seaborn-whitegrid,seaborn-bright,seaborn-poster,seaborn-muted,seaborn-paper,seaborn-white,seaborn-pastel,seaborn-dark,seaborn-dark-palette}

Plotting options
Argument Description
--figformat  {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff} File format for figure(s); Default: png)
--figsize  N N Figure width and height, in inches. (Default: use matplotlibrc params)
--title  TITLE Base title for plot(s).
--cmap  CMAP Matplotlib color map from which palette will be made (e.g. ‘Blues’,’autumn’,’Set1’; default: use colors from --stylesheet if given, or color cycle in matplotlibrc)
--dpi  DPI Figure resolution (Default: 150) Use this matplotlib stylesheet instead of matplotlibrc params

Script contents

plastid.bin.metagene.do_chart(args, plot_parser, printer=NullWriter())[source]

Plot metagene profiles against one another

Parameters:
args : argparse.Namespace

Namespace containing arguments

printer : file-like, optional

Anything implementing a write() method, for logging purposes.

Returns:
:py:class:`matplotlib.Figure`
plastid.bin.metagene.do_count(args, alignment_parser, plot_parser, printer=NullWriter())[source]

Calculate a metagene average over maximal spanning windows specified in roi_table, taking the following steps:

  1. The raw counts at each position in each maximal spanning window (from the generate subprogram) are totaled to create a raw count vector for the ROI.
  2. A normalized count vector is created fore each window by dividing its raw count vector by the total number of counts occurring within a user-defined normalization window within the window.
  3. A metagene average is created by taking aligning all of the normalized count vectors, and taking the median normalized counts over all vectors at each nucleotide position. Count vectors deriving from ROIs that don’t meet a minimum count threshold (set via the --norm_region option) are excluded.
Parameters:
args : argparse.Namespace

Namespace containing arguments

printer : file-like, optional

Anything implementing a write() method, for logging purposes.

Returns:
:py:class:`numpy.ndarray`

raw counts at each position (column) in each window (row)

:py:class:`numpy.ndarray`

counts at each position (column) in each window (row), normalized by the total number of counts in that row from norm_start to norm_end

:class:`pandas.DataFrame`

Metagene profile of median normalized counts at each position across all windows, and the number of windows included in the calculation of each median

plastid.bin.metagene.group_regions_make_windows(source, mask_hash, flank_upstream, flank_downstream, window_func=<function window_cds_start>, is_sorted=False, group_by='gene_id', printer=NullWriter())[source]

Group regions of interest by a shared attribute, and generate maximal spanning windows for them. Results are given in a table suitable for use in count subprogram. Windows are generated by the following algorithm:

  1. Transcripts are grouped by group_by attribute (default: by gene). If all transcripts in a group share the same genomic coordinate for the landmark of interest (for example, if all share the same start codon), then the set of transcripts is included in the analysis. If not, the set of transcripts and their associated gene are excluded from further processing.

  2. For each set of transcripts that pass step (1), the maximal spanning window is created by aligning the set of transcripts at the landmark, and adding nucleotide positions in transcript coordinates to the growing window in both 5’ and 3’ directions until either:

    • the next nucleotide position added is no longer corresponds to the same genomic position in all transcripts
    • the window reaches the maximum user-specified size
Parameters:
source : list or generator

Source of SegmentChain or Transcript objects, preferably with gene_id and transcript_id (e.g. transcripts assembled from a GTF2 or GFF3 file), so that transcripts can be grouped by gene when making maximal spanning windows.

mask_hash : GenomeHash

GenomeHash containing regions to exclude from analysis

flank_upstream : int

Number of nucleotides upstream of landmark to include in windows (in transcript coordinates)

flank_downstream: int

Number of nucleotides downstream of landmark to include in windows (in transcript coordinates)

window_func : func, optional

Function that defines a landmark in an individual transcript, and builds a window around that landmark over that region. As examples, window_cds_start() and window_cds_stop() are provided, though any function that meets the following criteria can be used:

  1. It must take the same parameters as window_cds_start()
  2. It must return the same types as window_cds_start()

Such functions could choose arbitrary features as landmarks, such as peaks in ribosome density, nucleic acid sequence features, transcript start or end sites, or any property that can be deduced from a Transcript. (Default: window_cds_start())

group_by : str, optional

Attribute by which SegmentChains or Transcripts should be grouped before generating maximal spanning windows (Default: “gene_id”)

is_sorted : bool, optional

Input file is sorted and/or tabix-indexed. If True, group_regions_make_windows() will take advantage of this to save memory. (Default: False)

printer : file-like, optional

filehandle to write logging info to (Default: NullWriter())

Returns:
:class:`pandas.DataFrame`

A pandas.DataFrame containing the following columns describing the maximal spanning windows:

Column Contains
alignment_offset Offset to align window to all other windows in the file from the 5’ end, if the window happens to be shorter on the 5’ end than specified in flank_upstream
region_id ID of region, given by shared value of group_by parameter (i.e. by default is ‘gene_id’)
region Maximal spanning window, formatted as chromosome:start-end:(strand)
region_length Length of maximal spanning window
region_bed Maximal spanning window, formatted as a BED line
window_size Requested length of maximal spanning window. May be larger than actual window if alignment_offset or threeprime_offset is nonzero
zero_point Distance from 5’ end of window to landmark, including alignment_offset
threeprime_offset Offset to align window to all other windows the file from the 3’ end, if the window happens to be shorter on the 3’ end than specified in flank_downstream
list

List of SegmentChain representing each window. These data are also represented as strings in the pandas.DataFrame

Notes

Not all genes will be included in the output if, for example, there isn’t a position set common to all transcripts surrounding the landmark

plastid.bin.metagene.main(argv=['-T', '-E', '-b', 'readthedocs', '-d', '_build/doctrees-readthedocs', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

Parameters:
argv : list, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.

Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line

plastid.bin.metagene.maximal_spanning_window(regions, mask_hash, flank_upstream, flank_downstream, window_func=<function window_cds_start>, name=None, printer=NullWriter())[source]

Create a maximal spanning window over regions surrounding a landmark,

The maximal spanning window is created by:

  1. Applying window_func to each region in regions to create a sub-window of region that surrounds a landmark identified by window_func, with up to flank_upstream bases 5’ of the landmark, and flank_downstream bases 3` of the landmark.

  2. If the landmark in all regions corresponds to the same genomic position, a maximal spanning window is created by starting at the landmark, and growing the window in the 5’ and 3’ directions along all regions until either:

    • the next nucleotide position added no longer corresponds to the same genomic position in all regions
    • the window reaches the maximum size specified by`flank_upstream` (in 5’ direction) or flank_downstream (in 3’ direction)
Parameters:
regions : list

List of SegmentChains or Transcripts

mask_hash : GenomeHash

GenomeHash containing regions to exclude from analysis

flank_upstream : int

Number of nucleotides upstream of landmark to include in maximal spanning window, if possible

flank_downstream: int

Number of nucleotides downstream of landmark to include in maximal spanning window, if possible

window_func : func, optional

Function that defines a landmark in an individual region, and builds a window around that landmark over that region. As examples, window_cds_start() and window_cds_stop() are provided, though any function that meets the following criteria can be used:

  1. It must take the same parameters as window_cds_start()
  2. It must return the same types as window_cds_start()

Such functions could choose arbitrary features as landmarks, such as peaks in ribosome density, nucleic acid sequence features, transcript start or end sites, or any property that can be deduced from a Transcript. (Default: window_cds_start())

name : str or None, optional

Name for maximal spanning window, to which it’s ID attribute will be set. If None, a name will be generated.

printer : file-like, optional

filehandle to write logging info to (Default: NullWriter())

Returns:
SegmentChain

Maximal spanning window, if regions share the same landmark. Otherwise, 0-length SegmentChain

int or :obj:`numpy.nan`

Alignment offset to the window start, if the maximal spanning window itself is not long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other maximal spanning windows.

If regions do not share the same landmark, numpy.nan

plastid.bin.metagene.window_cds_start(transcript, flank_upstream, flank_downstream, ref_delta=0)[source]

Returns a window surrounding a start codon.

Parameters:
transcript : Transcript

Transcript on which to generate window

flank_upstream : int

Nucleotide length upstream of start codon to include in window, if transcript has a start codon

flank_downstream : int

Nucleotide length downstream of start codon to include in window, if transcript has a start codon

ref_delta : int, optional

Offset from start codon to the reference point. If 0, the landmark is the reference point. (Default: 0)

Returns:
|SegmentChain|

Window surrounding start codon if transcript is coding. Otherwise, zero-length SegmentChain

int

Alignment offset to the window start, if transcript itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other windows generated around start codons in other transcripts. If transcript is not coding, returns numpy.nan

(str, int, str)

Genomic coordinate of reference point as (chromosome name, coordinate, strand). If transcript has no start codon, returns numpy.nan

plastid.bin.metagene.window_cds_stop(transcript, flank_upstream, flank_downstream, ref_delta=0)[source]

Returns a window surrounding a stop codon.

Parameters:
transcript : Transcript

Transcript on which to generate window

flank_upstream : int

Nucleotide length upstream of stop codon to include in window, if transcript has a stop codon

flank_downstream : int

Nucleotide length downstream of stop codon to include in window, if transcript has a stop codon

ref_delta : int, optional

Offset from stop codon to the reference point. If 0, the landmark is the reference point. (Default: 0)

Returns:
|SegmentChain|

Window surrounding stop codon if transcript is coding. Otherwise, zero-length SegmentChain

int

alignment offset to the window start, if transcript itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align this window to other windows generated around stop codons in other transcripts. If transcript is not coding, returns numpy.nan

(str, int, str)

Genomic coordinate of reference point as (chromosome name, coordinate, strand). If transcript has no stop codon, returns numpy.nan

plastid.bin.metagene.window_landmark(region, flank_upstream=50, flank_downstream=50, ref_delta=0, landmark=0)[source]

Define a window surrounding a landmark in a region, if the region has such a landmark, (e.g. a start codon in a transcript), accounting for splicing of the region, if the region is discontinuous

Parameters:
transcript : SegmentChain or Transcript

Region on which to generate a window surrounding a landmark

landmark : int

Position of the landmark within region

flank_upstream : int

Nucleotides upstream of landmark to include in window

flank_downstream : int

Nucleotides downstream of landmark to include in window

ref_delta : int

Offset from landmark to the reference point. If 0, the landmark is the reference point. Default: 0

Returns:
|SegmentChain|

Window of region surrounding landmark

int

alignment offset to the window start, if region itself wasn’t long enough in the 5’ direction to include the entire distance specified by flank_upstream. Use this to align windows generated around similar landmarks from different regions (e.g. windows surrounding start codons in various transcripts).

(str, int, str)

Genomic coordinate of reference point as (chromosome name, coordinate, strand)