plastid.bin.psite module

This script estimates P-site offsets, stratified by read length, in a ribosome profiling dataset. To do so, read alignments are mapped to their fiveprime ends, and a metagene profile surrounding the start codon is calculated separately for each read length.

The start codon peak for each read length is heuristically identified as the largest peak upstream of the start codon, or within a region defined by the user. The distance between that peak and the start codon itself is taken to be the P-site offset for that read length.

Notes

Generate an ROI file first
This script requires an ROI file of maximal spanning windows surrounding each gene’s start codon. This file can be generated by the generate subprogram of the metagene script.
Check the data
Users should examine the graphical output to make sure the P-site estimates are reasonable, because if clear start codon peaks are not present in the data, the algorithm described above will have trouble.
For RNase I only
This algorithm presumes that the RNase used to prepare the ribosome-protected footprints has no appreciable cutting bias, so that footprints may be clearly resolved to the edge of the ribosome.

Output files

OUTBASE_p_offsets.txt
Tab-delimited text file with two columns. The first is read length, and the second the offset from the fiveprime end of that read length to the ribosomal P-site. This table can be supplied as the argument for --offset when using --fiveprime_variable mapping in any of the other scripts in plastid.bin
OUTBASE_p_offsets.[svg | png | pdf | et c]
Plot of metagene profiles for each read length, when reads are mapped to their 5’ ends, P-site offsets are applied.
OUTBASE_metagene_profiles.txt
Metagene profiles, stratified by read length, before P-site offsets are applied.
OUTBASE_K_rawcounts.txt
Saved if --keep is given on command line. Raw count vectors for each metagene window specified in input ROI file, without P-site mapping rules applied, for reads of length K
OUTBASE_K_normcounts.txt
Saved if --keep is given on command line. Normalized count vectors for each metagene window specified in input ROI file, without P-site mapping rules applied, for reads of length K

where OUTBASE is supplied by the user.


Command-line arguments

Positional arguments

Argument Description
roi_file ROI file surrounding start codons, from metagene generate subprogram
outbase Basename for output files

Optional arguments

Argument Description
-h, --help show this help message and exit
--min_counts  N Minimum counts required in normalization region to be included in metagene average (Default: 10)
--normalize_over  N N Portion of each window against which its individual raw count profile will be normalized. Specify two integers, in nucleotide distance from landmark (negative for upstream, positive for downstream. Surround negative numbers with quotes.). (Default: 20 50)
--norm_region  N N Deprecated. Use --normalize_over instead. Formerly, Portion of each window against which its individual raw count profile will be normalized. Specify two integers, in nucleotide distance, from 5’ end of window. (Default: 70 100)
--require_upstream If supplied, the P-site offset is taken to be the distance between the largest peak upstream of the start codon and the start codon itself. Otherwise, the P-site offset is taken to be the distance between the largest peak in the entire ROI and the start codon. Ignored if --constrain is used.
--constrain  X X Constrain P-site offset to be between specified distance from start codon. Useful for noisy data. (Reasonable set: 10 15; default: not constrained)
--aggregate Estimate P-site from aggregate reads at each position, instead of median normalized read density. Noisier, but helpful for lower-count data or read lengths with few counts. (Default: False)
--keep Save intermediate count files. Useful for additional computations (Default: False)
--default  DEFAULT Default 5’ P-site offset for read lengths that are not present or evaluated in the dataset. Unaffected by --constrain (Default: 13)

Warning/error options

Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)

Count & alignment file options

Open alignment or count files and optionally set mapping rules

Argument Description
--count_files  COUNT_FILES [COUNT_FILES ...] One or more count or alignment file(s) from a single sample or set of samples to be pooled.
--countfile_format  {BAM} Format of file containing alignments or counts (Default: BAM)
--sum  SUM Sum used in normalization of counts and RPKM/RPNT calculations (Default: total mapped reads/counts in dataset)
--min_length  N Minimum read length required to be included (BAM & bowtie files only. Default: 25)
--max_length  N Maximum read length permitted to be included (BAM & bowtie files only. Default: 100)

—stylesheet {seaborn-darkgrid,seaborn-notebook,classic,seaborn-ticks,grayscale,bmh,seaborn-talk,dark_background,ggplot,fivethirtyeight,seaborn-colorblind,seaborn-deep,seaborn-whitegrid,seaborn-bright,seaborn-poster,seaborn-muted,seaborn-paper,seaborn-white,seaborn-pastel,seaborn-dark,seaborn-dark-palette}

Plotting options

Argument Description
--figformat  {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff} File format for figure(s); Default: png)
--figsize  N N Figure width and height, in inches. (Default: use matplotlibrc params)
--title  TITLE Base title for plot(s).
--cmap  CMAP Matplotlib color map from which palette will be made (e.g. ‘Blues’,’autumn’,’Set1’; default: use colors from --stylesheet if given, or color cycle in matplotlibrc)
--dpi  DPI Figure resolution (Default: 150) Use this matplotlib stylesheet instead of matplotlibrc params

Script contents

plastid.bin.psite.do_count(roi_table, ga, norm_start, norm_end, min_counts, min_len, max_len, aggregate=False, printer=NullWriter())[source]

Calculate a metagene profile for each read length in the dataset

Parameters:
roi_table : pandas.DataFrame

Table specifying regions of interest, generated by plastid.bin.metagene.do_generate()

ga : BAMGenomeArray

Count data

norm_start : int

Coordinate in window specifying normalization region start

norm_end : int

Coordinate in window specifying normalization region end

min_counts : float

Minimum number of counts in window[norm_start:norm_end] required for inclusion in metagene profile

min_len : int

Minimum read length to include

max_len : int

Maximum read length to include

aggregate : bool, optional

Estimate P-site from aggregate reads at each position, instead of median normalized read density. Potentially noisier, but helpful for lower-count data or read lengths with few counts. (Default: False)

printer : file-like, optional

filehandle to write logging info to (Default: NullWriter())

Returns:
dict

Dictionary of numpy.ndarray s of raw counts at each position (column) for each window (row)

dict

Dictionary of numpy.ndarray s of normalized counts at each position (column) for each window (row), normalized by the total number of counts in that row from norm_start to norm_end

:class:`pandas.DataFrame`

Metagene profile of median normalized counts at each position across all windows, and the number of windows included in the calculation of each median, stratified by read length

plastid.bin.psite.main(argv=['-T', '-E', '-b', 'readthedocs', '-d', '_build/doctrees-readthedocs', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

Parameters:
argv : list, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directrly.

Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line