Read phasing in ribosome profiling¶
During each cycle of peptide elongation, the ribosome takes a 3-nucleotide step along its mRNA. This physical process creates triplet periodicity in ribosome profiling data, which becomes visible when the read alignments are mapped to their P-site offsets ([IGNW09]):
In this tutorial, we show how to measure sub-codon phasing in the following sections:
The examples below use the Demo dataset.
Heavily biased nucleases – for example, micrococcal nuclease, used for ribosome profiling of E. coli ([OBS+11]) and D. melanogaster ([DFB+13]) – introduce uncertainty into P-site mapping, which obscures phasing in these datasets.
Here we are using ribosome profiling data prepared with RNase I
In this section, we tabulate sub-codon phasing manually. First, we need to open the read alignments and transcript annotation:
>>> import numpy >>> from plastid import Transcript, BED_Reader, BAMGenomeArray, FivePrimeMapFactory >>> # retrieve an iterator over transcripts >>> transcripts = BED_Reader("merlin_orfs.bed",return_type=Transcript) >>> # open read alignments and map to P-sites >>> alignments = BAMGenomeArray("SRR609197_riboprofile_5hr_rep1.bam") >>> alignments.set_mapping(FivePrimeMapFactory(offset=14))
Ribosome-protected footprints of varying lengths exhibit variable phasing. So, we’ll look at this particular dataset’s most highly-phased population of reads, 33-mers. To do so, we’ll add a size filter:
>>> from plastid import SizeFilterFactory >>> size_filter = SizeFilterFactory(min=33,max=34) >>> alignments.add_filter("size",size_filter)
Next, we can count phasing:
# create a holder for phasing >>> phasing = numpy.zeros(3) # start codons are hyper-phased; stop codons can have differnt # phasing or even be de-phased depending on experimental protocol # so, we'll ignore 5 codons after the start, and 5 before the stop >>> codon_buffer = 5*3 # count >>> for my_transcript in transcripts: >>> cds = my_transcript.get_cds() >>> # if transcript is coding >>> if len(cds) > 0: >>> try: >>> >>> # get numpy.ndarray of counts in coding region >>> counts = cds.get_counts(alignments)[codon_buffer:-codon_buffer] >>> >>> # reshape to Nx3, where N = number of codons >>> counts = counts.reshape((len(counts)/3,3)) >>> >>> # sum over codon positions to get a 3-vector, >>> # and add to data holder >>> phasing += counts.sum(0) >>> >>> except: # raise exception if coding region is not n*3 nucleotides long >>> print("Length (%s nt) of CDS for `%s` contains partial codons. Frameshift?" % (len(counts),my_transcript.get_name())) # compute fraction of phased reads >>> phasing_proportions = phasing.astype(float) / phasing.sum() >>> phasing_proportions array([ 0.51042163, 0.29362327, 0.19595509])
- Avoid double-counting
If the transcript annotation includes multiple transcript isoforms for the same gene, codons that appear in more than one isoform will be double-counted in the phasing estimate.
- Dually-coding regions
- If the annotation file contains overlapping coding regions which appear in different frames, including these in the phasing tabulation will under-estimate phasing. It makes sense to exclude such areas using a mask file.
phase_by_size script automates the steps given above. It separately
calculates read phasing separately for read alignments of each length
between a user-defined minimum and maximum. Here, we’ll use the
--codon_buffer argument to exclude the 5 first and last five codons of each
open reading frame, as these can have variable phasing.
# generate metagene `roi` file. See `metagene` documentation for details $ metagene generate merlin_orfs_cds_start \ --landmark cds_start \ --annotation_files merlin_orfs.gtf
To use the ROI file, give its name as the first parameter:
# use ROI file `merlin_orfs_cds_start_rois.txt` from metagne run above $ phase_by_size merlin_orfs_cds_start_rois.txt SRR609197 \ --count_files SRR609197_riboprofile_5hr_rep1.bam \ --fiveprime --offset 14 \ --codon_buffer 5 \ --min_length 25 --max_length 35
Alternatively, to use an annotation file, don’t include the ROI file in the
command call, and specify an annotation using
$ phase_by_size SRR609197 \ --count_files SRR609197_riboprofile_5hr_rep1.bam \ --annotation_files merlin_orfs.bed \ --annotation_format BED \ --fiveprime --offset 14 \ --codon_buffer 5 \ --min_length 25 --max_length 35
In either case,
phase_by_size will create a text file showing the proportion
of reads whose P-sites map to each codon position for each read length (columns
phase0, phase1, & phase0) and the proportion of total reads that
each read length represents (column fraction_reads_counted):
#read_length reads_counted fraction_reads_counted phase0 phase1 phase2 25 6511 0.009640 0.326832 0.327599 0.345569 26 9952 0.014735 0.385953 0.295217 0.318830 27 17636 0.026111 0.320934 0.282717 0.396348 28 42976 0.063629 0.251792 0.381794 0.366414 29 93754 0.138809 0.309309 0.370971 0.319720 30 148400 0.219716 0.318733 0.367635 0.313632 31 155684 0.230501 0.336624 0.421713 0.241663 32 118565 0.175543 0.445578 0.374141 0.180281 33 58761 0.087000 0.511121 0.299076 0.189803 34 18818 0.027861 0.508237 0.276597 0.215166 35 4360 0.006455 0.514450 0.236468 0.249083