Read phasing in ribosome profiling¶
During each cycle of peptide elongation, the ribosome takes a 3-nucleotide step along its mRNA. This physical process creates triplet periodicity in ribosome profiling data, which becomes visible when the read alignments are mapped to their P-site offsets ([IGNW09]):
Triplet periodicity or sub-codon phasing, is a useful feature, because it can be used to infer the reading frame(s) in which a coding region is decoded:
In this tutorial, we show how to measure sub-codon phasing in the following sections:
The examples below use the Demo dataset.
Note
Heavily biased nucleases – for example, micrococcal nuclease, used for ribosome profiling of E. coli ([OBS+11]) and D. melanogaster ([DFB+13]) – introduce uncertainty into P-site mapping, which obscures phasing in these datasets.
Here we are using ribosome profiling data prepared with RNase I
Examining phasing manually¶
In this section, we tabulate sub-codon phasing manually. First, we need to open the read alignments and transcript annotation:
>>> import numpy
>>> from plastid import Transcript, BED_Reader, BAMGenomeArray, FivePrimeMapFactory
>>> # retrieve an iterator over transcripts
>>> transcripts = BED_Reader("merlin_orfs.bed",return_type=Transcript)
>>> # open read alignments and map to P-sites
>>> alignments = BAMGenomeArray("SRR609197_riboprofile_5hr_rep1.bam")
>>> alignments.set_mapping(FivePrimeMapFactory(offset=14))
Ribosome-protected footprints of varying lengths exhibit variable phasing. So, we’ll look at this particular dataset’s most highly-phased population of reads, 33-mers. To do so, we’ll add a size filter:
>>> from plastid import SizeFilterFactory
>>> size_filter = SizeFilterFactory(min=33,max=34)
>>> alignments.add_filter("size",size_filter)
Next, we can count phasing:
# create a holder for phasing
>>> phasing = numpy.zeros(3)
# start codons are hyper-phased; stop codons can have differnt
# phasing or even be de-phased depending on experimental protocol
# so, we'll ignore 5 codons after the start, and 5 before the stop
>>> codon_buffer = 5*3
# count
>>> for my_transcript in transcripts:
>>> cds = my_transcript.get_cds()
>>> # if transcript is coding
>>> if len(cds) > 0:
>>> try:
>>>
>>> # get numpy.ndarray of counts in coding region
>>> counts = cds.get_counts(alignments)[codon_buffer:-codon_buffer]
>>>
>>> # reshape to Nx3, where N = number of codons
>>> counts = counts.reshape((len(counts)/3,3))
>>>
>>> # sum over codon positions to get a 3-vector,
>>> # and add to data holder
>>> phasing += counts.sum(0)
>>>
>>> except: # raise exception if coding region is not n*3 nucleotides long
>>> print("Length (%s nt) of CDS for `%s` contains partial codons. Frameshift?" % (len(counts),my_transcript.get_name()))
# compute fraction of phased reads
>>> phasing_proportions = phasing.astype(float) / phasing.sum()
>>> phasing_proportions
array([ 0.51042163, 0.29362327, 0.19595509])
Note
- Avoid double-counting
If the transcript annotation includes multiple transcript isoforms for the same gene, codons that appear in more than one isoform will be double-counted in the phasing estimate.
This may be avoided by instead estimating phasing over maximal spanning windows generated by the
metagene
script.- Dually-coding regions
If the annotation file contains overlapping coding regions which appear in different frames, including these in the phasing tabulation will under-estimate phasing. It makes sense to exclude such areas using a mask file.
Measuring phasing using the phase_by_size script¶
The phase_by_size
script automates the steps given above. It separately
calculates read phasing separately for read alignments of each length
between a user-defined minimum and maximum. Here, we’ll use the
--codon_buffer
argument to exclude the 5 first and last five codons of each
open reading frame, as these can have variable phasing.
To avoid double-counting, it is ideal to use an ROI file of
maximal spanning windows created by the
metagene
script. To create the ROI file:
# generate metagene `roi` file. See `metagene` documentation for details
$ metagene generate merlin_orfs_cds_start \
--landmark cds_start \
--annotation_files merlin_orfs.gtf
To use the ROI file, give its name as the first parameter:
# use ROI file `merlin_orfs_cds_start_rois.txt` from metagne run above
$ phase_by_size merlin_orfs_cds_start_rois.txt SRR609197 \
--count_files SRR609197_riboprofile_5hr_rep1.bam \
--fiveprime --offset 14 \
--codon_buffer 5 \
--min_length 25 --max_length 35
Alternatively, to use an annotation file, don’t include the ROI file in the
command call, and specify an annotation using --annotation_files
:
$ phase_by_size SRR609197 \
--count_files SRR609197_riboprofile_5hr_rep1.bam \
--annotation_files merlin_orfs.bed \
--annotation_format BED \
--fiveprime --offset 14 \
--codon_buffer 5 \
--min_length 25 --max_length 35
In either case, phase_by_size
will create a text file showing the proportion
of reads whose P-sites map to each codon position for each read length (columns
phase0, phase1, & phase0) and the proportion of total reads that
each read length represents (column fraction_reads_counted):
#read_length reads_counted fraction_reads_counted phase0 phase1 phase2
25 6511 0.009640 0.326832 0.327599 0.345569
26 9952 0.014735 0.385953 0.295217 0.318830
27 17636 0.026111 0.320934 0.282717 0.396348
28 42976 0.063629 0.251792 0.381794 0.366414
29 93754 0.138809 0.309309 0.370971 0.319720
30 148400 0.219716 0.318733 0.367635 0.313632
31 155684 0.230501 0.336624 0.421713 0.241663
32 118565 0.175543 0.445578 0.374141 0.180281
33 58761 0.087000 0.511121 0.299076 0.189803
34 18818 0.027861 0.508237 0.276597 0.215166
35 4360 0.006455 0.514450 0.236468 0.249083
Note
At present, phase_by_size
only supports read alignments
in BAM format.
See also¶
Determine P-site offsets for ribosome profiling data for a discussion on how to determine the P-site offsets to use for a given ribosome profiling dataset.
Read mapping functions for a discussion on how to apply P-site offsets or other mapping rules