This script estimates P-site offsets, stratified by read length, in a ribosome profiling dataset. To do so, read alignments are mapped to their fiveprime ends, and a metagene profile surrounding the start codon is calculated separately for each read length.
The start codon peak for each read length is heuristically identified as the largest peak upstream of the start codon, or within a region defined by the user. The distance between that peak and the start codon itself is taken to be the P-site offset for that read length.
- Generate an ROI file first
- This script requires an ROI file of maximal spanning windows
surrounding each gene’s start codon. This file can be generated by the
generatesubprogram of the
- Check the data
- Users should examine the graphical output to make sure the P-site estimates are reasonable, because if clear start codon peaks are not present in the data, the algorithm described above will have trouble.
- For RNase I only
- This algorithm presumes that the RNase used to prepare the ribosome-protected footprints has no appreciable cutting bias, so that footprints may be clearly resolved to the edge of the ribosome.
- Tab-delimited text file with two columns. The first is read length, and the second the offset from the fiveprime end of that read length to the ribosomal P-site. This table can be supplied as the argument for
--fiveprime_variablemapping in any of the other scripts in
- OUTBASE_p_offsets.[svg | png | pdf | et c]
- Plot of metagene profiles for each read length, when reads are mapped to their 5’ ends, P-site offsets are applied.
- Metagene profiles, stratified by read length, before P-site offsets are applied.
- Saved if
--keepis given on command line. Raw count vectors for each metagene window specified in input ROI file, without P-site mapping rules applied, for reads of length K
- Saved if
--keepis given on command line. Normalized count vectors for each metagene window specified in input ROI file, without P-site mapping rules applied, for reads of length K
where OUTBASE is supplied by the user.
ROI file surrounding start codons, from
Basename for output files
show this help message and exit
Minimum counts required in normalization region to be included in metagene average (Default: 10)
--normalize_over N N
Portion of each window against which its individual raw count profile will be normalized. Specify two integers, in nucleotide distance from landmark (negative for upstream, positive for downstream. Surround negative numbers with quotes.). (Default: 20 50)
--norm_region N N
--normalize_overinstead. Formerly, Portion of each window against which its individual raw count profile will be normalized. Specify two integers, in nucleotide distance, from 5’ end of window. (Default: 70 100)
If supplied, the P-site offset is taken to be the distance between the largest peak upstream of the start codon and the start codon itself. Otherwise, the P-site offset is taken to be the distance between the largest peak in the entire ROI and the start codon. Ignored if
--constrain X X
Constrain P-site offset to be between specified distance from start codon. Useful for noisy data. (Reasonable set: 10 15; default: not constrained)
Estimate P-site from aggregate reads at each position, instead of median normalized read density. Noisier, but helpful for lower-count data or read lengths with few counts. (Default: False)
Save intermediate count files. Useful for additional computations (Default: False)
Default 5’ P-site offset for read lengths that are not present or evaluated in the dataset. Unaffected by
Suppress all warning messages. Cannot use with ‘-v’.
Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)
Count & alignment file options¶
Open alignment or count files and optionally set mapping rules
--count_files COUNT_FILES [COUNT_FILES ...]
One or more count or alignment file(s) from a single sample or set of samples to be pooled.
Format of file containing alignments or counts (Default: BAM)
Sum used in normalization of counts and RPKM/RPNT calculations (Default: total mapped reads/counts in dataset)
Minimum read length required to be included (BAM & bowtie files only. Default: 25)
Maximum read length permitted to be included (BAM & bowtie files only. Default: 100)
File format for figure(s); Default: png)
--figsize N N
Figure width and height, in inches. (Default: use matplotlibrc params)
Base title for plot(s).
Matplotlib color map from which palette will be made (e.g. ‘Blues’,’autumn’,’Set1’; default: use colors from
--stylesheetif given, or color cycle in matplotlibrc)
Figure resolution (Default: 150) Use this matplotlib stylesheet instead of matplotlibrc params
do_count(roi_table, ga, norm_start, norm_end, min_counts, min_len, max_len, aggregate=False, printer=NullWriter())¶
Calculate a metagene profile for each read length in the dataset
- roi_table :
Table specifying regions of interest, generated by
- ga :
- norm_start : int
Coordinate in window specifying normalization region start
- norm_end : int
Coordinate in window specifying normalization region end
- min_counts : float
Minimum number of counts in window[norm_start:norm_end] required for inclusion in metagene profile
- min_len : int
Minimum read length to include
- max_len : int
Maximum read length to include
- aggregate : bool, optional
Estimate P-site from aggregate reads at each position, instead of median normalized read density. Potentially noisier, but helpful for lower-count data or read lengths with few counts. (Default: False)
- printer : file-like, optional
filehandle to write logging info to (Default:
numpy.ndarrays of raw counts at each position (column) for each window (row)
numpy.ndarrays of normalized counts at each position (column) for each window (row), normalized by the total number of counts in that row from norm_start to norm_end
Metagene profile of median normalized counts at each position across all windows, and the number of windows included in the calculation of each median, stratified by read length
- roi_table :
main(argv=['-T', '-E', '-b', 'readthedocs', '-d', '_build/doctrees-readthedocs', '-D', 'language=en', '.', '_build/html'])¶
- argv : list, optional
A list of command-line arguments, which will be processed as if the script were called from the command line if
main()is called directrly.
Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line