Gene expression analysis

This tutorial illustrates how to measure read density over regions. As an example, we look at gene expression (in raw read counts and RPKM) using matched samples of RNA-seq and ribosome profiling data. However, the analysis below can apply to any type of high-throughput sequencing data (e.g. ClIP-seq, ChIP-seq, DMS-seq, et c).

This document aimed at beginners, and is broken into the following sections:

The examples below the Demo dataset.


For simplicity, we are:

  1. performing these analyses without using a mask file to exclude repetitive genome seqence from analysis. In practice, we would consider doing so. See Excluding (masking) regions of the genome for a discussion of the issues involved
  2. not excluding start and stop codons from read density measurements in ribosome profiling data. In practice, we would do so, because start and stop codon peaks, artificially inflate estimates of steady-state translation.

Tabulating read counts & RPKM

Via command-line scripts

plastid includes two scripts for measuring gene expression:

  • cs, which pre-processes a genome annotation and makes various heuristic corrections to gene boundaries (e.g. if genes overlap)
  • counts_in_region, which does not

The differences between the scripts are further explained in What are the differences between counts_in_region and cs?. Here we will use counts_in_region.

Our first dataset is ribosome profiling, and we will map the ribosomal P-site at 14 nucleotides from the 5’ end of each read (approximating [SGWM+12]). To specify this, we use the arguments --fiveprime --offset 14.

The data we want to count is in the file SRR609197_riboprofile_5hr_rep1.bam, which we pass via --count_files. The genes we are interested in counting in this example are on chromosome I, in the annotation file merlin_orfs.gtf. Finally, we will tell the script to save the output in riboprofile.txt.

Putting this together, the script is run from the terminal as:

$ counts_in_region riboprofile.txt --count_files SRR609197_riboprofile_5hr_rep1.bam \
                                   --annotation_files merlin_orfs.gtf \
                                   --fiveprime --offset 14

counts_in_region will create a tab-delimited text file called riboprofile.txt containing the results. The first few lines of the file look like this:

## total_dataset_counts: 500477
#region_name    region                  counts          counts_per_nucleotide   rpkm            length
ORFL1W_(RL1)    merlin:1316-2398(+)     1.14000000e+02  1.05360444e-01          2.10520051e+02  1082
ORFL2C          merlin:2401-2772(-)     1.00000000e+01  2.69541779e-02          5.38569762e+01  371
ORFL3C          merlin:2834-3064(-)     1.50000000e+01  6.52173913e-02          1.30310466e+02  230
ORFL4C          merlin:2929-3201(-)     1.40000000e+01  5.14705882e-02          1.02843064e+02  272
ORFL5C          merlin:4074-4307(-)     2.30000000e+01  9.87124464e-02          1.97236729e+02  233
ORFL6C          merlin:4078-4488(-)     6.10000000e+01  1.48780488e-01          2.97277373e+02  410
ORFL7C          merlin:4335-4739(-)     6.20000000e+01  1.53465347e-01          3.06638160e+02  404
[rest of output omitted]

For detailed documentation of the output and command-line arguments, see the module documentation for counts_in_region.


Gene expression – or, more broadly, read density over from any high-throughput sequencing experiment over any genomic region – can be calculated easily in an interactive Python session.

In this example, we separately caclulate read density over:

  • entire transcripts
  • 5’ UTRs
  • coding regions
  • 3’ UTRs

First, we need to import a few things:

>>> import copy
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> from plastid import *

First, open the read alignments, storing each dataset in a BAMGenomeArray:

>>> my_datasets = { "ribosome_profiling" : "SRR609197_riboprofile_5hr_rep1.bam",
>>>                 "RNA-seq"            : "SRR592963_rnaseq_5hr_rep1.bam",
>>>               }

>>> my_datasets = { K : BAMGenomeArray(V) for K,V in my_datasets.items() }

Next, we tell the BAMGenomeArrays which mapping rule to use. We will map the ribosome-protected footprints to their P-sites, which we estimate as 14 nucleotides from the 5’ end of each read:

>>> my_datasets["ribosome_profiling"].set_mapping(FivePrimeMapFactory(offset=14))

We will map the RNA-seq data along the entire length of each read alignment. Each position in each alignment will be attributed \(1.0 / \ell\), where \(\ell\) is the length of the read alignment. CenterMapFactory() can do this for us:

>>> my_datasets["RNA-seq"].set_mapping(CenterMapFactory())

Now, we need to create a place to hold our data. We’ll use dictionary of lists. The call to copy.deepcopy() on the empty list is necessary to prevent all of these dictionary keys from pointing to the same list, which is a weird side effect of the order in which things are evaluated inside comprehensions:

>>> # we will count gene sub-regions in addition to entire genes
>>> regions = ("exon","5UTR","CDS","3UTR")

>>> # we will calculate both total counts and RPKM
>>> metrics = ("counts","rpkm")

>>> # create an empty list for each sample, region, and metric
>>> my_data = { "%s_%s_%s" % (SAMPLE,REGION,METRIC) : copy.deepcopy([])\
>>>                                                   for SAMPLE in my_datasets.keys()\
>>>                                                   for REGION in regions\
>>>                                                   for METRIC in metrics }

>>> # add a list to our dictionary of lists to store transcript IDs
>>> my_data["transcript_id"] = []

>>> # add additional lists to store information about each region
>>> for region in regions:
>>>     my_data["%s_chain"  % region] = []  # SegmentChain representing region
>>>     my_data["%s_length" % region] = []  # Length of that SegmentChain, in nucleotides

Now that we have an empty dictionary of lists to hold our data, we’re ready to start making measurements. We’ll use nested for loops to count expression in the 5’ UTR, CDS, 3’UTR and total region (exon) of each transcript (note: this will run for a while; you might want to get some coffee):

>>> for transcript in BED_Reader("merlin_orfs.bed",return_type=Transcript):
>>>     # First, save ID of transcript we are evaluating
>>>     my_data["transcript_id"].append(transcript.get_name())

>>>     # Next, get transcript sub-regions, save them in a dict
>>>     # mapping region names to genomic regions (SegmentChains)
>>>     my_dict = { "exon" : transcript,
>>>                 "5UTR" : transcript.get_utr5(),
>>>                 "CDS"  : transcript.get_cds(),
>>>                 "3UTR" : transcript.get_utr3()
>>>                }

>>>     # Iterate over these sub-regions for each transcript
>>>     for region,subchain in my_dict.items():
>>>         # Save the length for each sub-region
>>>         my_data["%s_length" % region].append(subchain.length)
>>>         my_data["%s_chain"  % region].append(str(subchain))

>>>         # Iterate over each sample, getting the counts over each region
>>>         for sample_name, sample_data in my_datasets.items():
>>>             # subchain.get_counts() fetches a list of counts at each position
>>>             # here we just want the sum
>>>             counts = sum(subchain.get_counts(sample_data))
>>>             rpkm   = float(counts) / subchain.length * 1000 * 1e6 / sample_data.sum()
>>>             my_data["%s_%s_counts" % (sample_name,region)].append(counts)
>>>             my_data["%s_%s_rpkm"   % (sample_name,region)].append(rpkm)

Finally, we can save the calculated values to a file. It is easiest to do this by converting the dictionary of lists into a pandas.DataFrame:

>>> # convert to DataFrame, then save as tab-delimited text file
>>> df = pd.DataFrame(my_data)
>>> df.to_csv("gene_expression_demo.txt",sep="\t",index=False,header=True)

The text files may be re-loaded for further analysis, or plotted. For example, to plot the RPKM measurements for translation (ribosome profiling) and transcription (RNA-seq) against each other:

>>> my_figure = plt.figure()
>>> plt.loglog() # log-scaling makes it easier

>>> # make a copy of dataframe for plotting
>>> # this is because 0-values cannot be plotted in log-space,
>>> # so we set them to a pseudo value called `MIN_VAL`
>>> MIN_VAL = 1
>>> plot_df = copy.deepcopy(df)
>>> df["RNA-seq_exon_rpkm"][df["RNA-seq_exon_rpkm"] == 0] = MIN_VAL
>>> df["ribosome_profiling_CDS_rpkm"][df["ribosome_profiling_CDS_rpkm"] == 0] = MIN_VAL

>>> # now, make a scatter plot
>>> plt.scatter(plot_df["RNA-seq_exon_rpkm"],
>>>             plot_df["ribosome_profiling_CDS_rpkm"],
>>>             marker="o",alpha=0.5,facecolor="none",edgecolor="#007ADF")
>>> plt.xlabel("Transcript levels (RPKM of mRNA fragments over all exons)")
>>> plt.ylabel("Translation (RPKM of footprints over CDS)")


This produces the following plot:

Scatter plot of translation versus transcription levels

Translation versus transcription levels for each gene

Estimating translation efficiency

Translation efficiency is a measurement of how much protein is made from a single mRNA. Translation efficiency thus reports specifically on the translational control of gene expression.

Translation efficiency can be estimated by normalizing an mRNA’s translating ribosome density (in RPKM, as measured by ribosome profiling) by the mRNA’s abundance (in RPKM, measured by RNA-Seq) ([IGNW09]).

Making this estimate from the calculations above is simple:

>>> df["translation_efficiency"] = df["ribosome_profiling_CDS_rpkm"] / df["RNA-seq_exon_rpkm"]

Then, we can compare the effects of transcriptional and translational control:

>>> plt.loglog()
>>> plot_df = copy.deepcopy(df)
>>> plot_df["RNA-seq_exon_rpkm"][df["RNA-seq_exon_rpkm"] == 0] = MIN_VAL
>>> plot_df["translation_efficiency"][df["translation_efficiency"] == 0] = MIN_VAL

>>> # now, make a scatter plot
>>> plt.scatter(plot_df["RNA-seq_exon_rpkm"],
>>>             plot_df["translation_efficiency"],
>>>             marker="o",alpha=0.2,facecolor="none",edgecolor="#007ADF")
>>> plt.xlabel("Transcript levels (RPKM of mRNA fragments over all exons)")
>>> plt.ylabel("Translation efficiency")
>>> plt.xlim(1,plt.xlim()[1])
>>> plt.ylim(plt.ylim()[0]/10.0,100)


Testing for differential expression

This portion requires part two of the demo dataset. Download it and place it in the same folder as part 1 of the dataset.

RNA-seq, specifically

There are many strategies for significance testing of differential gene expression between multiple datasets, many of which (e.g. cufflinks and kallisto) are specifically developed for RNA-seq These packages don’t require plastid at all. For further information on them packages, see their documentation.

Any high-throughput sequencing experiment, including RNA-seq

For other experimental data types – e.g. ribosome profiling, DMS-seq, ChIP-Seq, ClIP-Seq, et c – the assumptions made by many packages specifically developed for RNA-seq analysis (e.g. even coverage across transcripts) do not hold.

In contrast, the R package DESeq2 ([AH10][AMC+13][LHA14]) offer a generally applicable statistical approach that is appropriate to virtually any count-based sequencing data.


The discussion below is heavily simplified and largely draws upon guidance in Analysing RNA-Seq data with the “DESeq2” package, hosted on the DESeq2 website.

Users are strongly encouraged to read the DESeq2 documentation for a fuller discussion of DESeq’s statistical models with additional examples.

As input, DESeq2 takes two tables and an equation:

  1. A table of uncorrected, unnormalized counts, in which:

    • each table row corresponds to a genomic region
    • each column corresponds to an experimental sample
    • the value in a each cell corresponds ot the number of counts in the corresponding genomic region and sample
  2. A sample design table describing the properties of each sample (e.g. if any are technical or biological replicates, or any treatments or conditions that differ between samples)

  3. A design equation, describing how the samples or treatments relate to one another

From these, DESeq2 separately model intrinsic counting error (Poisson noise) as well as additional inter-replicate error from biological or experimental variability.

From these error models DESeq2 can detect significant differences in count numbers between non-replicate samples, accounting for different sequencing depth between samples.

The count table

The first table may be constructed by running cs or counts_in_region on each biological sample to obtain counts. Here, we’ll use RNA-seq data:

$ counts_in_region rnaseq_5hr_rep1.txt --count_files  SRR592963_rnaseq_5hr_rep1.bam   --fiveprime --annotation_files merlin_orfs.bed --annotation_format BED
$ counts_in_region rnaseq_5hr_rep2.txt --count_files  SRR2064027_rnaseq_5hr_rep2.bam  --fiveprime --annotation_files merlin_orfs.bed --annotation_format BED
$ counts_in_region rnaseq_24hr_rep1.txt --count_files SRR592964_rnaseq_24hr_rep1.bam  --fiveprime --annotation_files merlin_orfs.bed --annotation_format BED
$ counts_in_region rnaseq_24hr_rep2.txt --count_files SRR2064029_rnaseq_24hr_rep2.bam --fiveprime --annotation_files merlin_orfs.bed --annotation_format BED

We combine the data into a single table in Python:

>>> import pandas as pd
>>> sample_names = ["rnaseq_5hr_rep1","rnaseq_5hr_rep2","rnaseq_24hr_rep1","rnaseq_24hr_rep2"]

>>> # load samples as DataFrames
>>> samples = { K : pd.read_table("%s.txt" % K,sep="\t",header=0,comment="#",index_col=None) for K in sample_names }

>>> # combine count columns to single DataFrame
>>> combined_df = samples["rnaseq_5hr_rep1"][["region_name"]]
>>> for k,v in samples.items():
>>>     combined_df[k] = v["counts"]

>>> combined_df.head()

>>> # save
>>> combined_df.to_csv("combined_counts.txt",sep="\t",header=True,index=False,
>>>                    columns=["region_name","rnaseq_5hr_rep1","rnaseq_5hr_rep2",
>>>                             "rnaseq_24hr_rep1","rnaseq_24hr_rep2"])

The design table

The second table (in this example, rnaseq_sample_table.txt) is a description of the experimental design. This can be created in any text editor and saved as a tab-delimited text file. In this example, the we have two conditions, representing timepoints at 5 and 24 hours post-infection:

sample_name          condition
rnaseq_5hr_rep1      5_hours
rnaseq_5hr_rep2      5_hours
rnaseq_24hr_rep1     24_hours
rnaseq_24hr_rep2     24_hours

The design equation is this case is very simple:

design = ~ condition

Putting it together

With the count table, design table, and equation ready, everything can be loaded into R:

> # workflow modified from DESeq2 vignette at
> #

> # import DESeq2
> library("DESeq2")

> # load RNA seq data into a data.frame
> # first line of file are colum headers
> # "region" column specifies a list of row names
> count_table <- read.delim("combined_counts.txt",
>                           sep="\t",
>                           header=TRUE,
>                           row.names="region_name")

> # load experiment design table
> sample_table <- read.delim("rnaseq_sample_table.txt",
>                            sep="\t",
>                            header=TRUE,
>                            row.names="sample_name")

> # note, design parameter below tells DESeq2 that the 'condition' column
> # distinguishes replicates from non-replicates
> dds <- DESeqDataSetFromMatrix(countData = count_table,
>                               colData = sample_table,
>                               design = ~ condition)

> # filter out rows with zero counts
> dds <- dds[rowSums(counts(dds)) > 1,]

> # set baseline for comparison to 5 hour timepoint
> dds$condition <- relevel(dds$condition,ref="5_hours")

> # run analysis
> dds <- DESeq(dds)
> res <- results(dds)
> # sort results ascending by adjusted p-value, column `padj`
> resOrdered <- res[order(res$padj),]

> # export sorted data to text file for further analysis
> write.table(,sep="\t",quote=FALSE,
>             file="5_vs_24hr_rnaseq_p_values.txt")

> # or, just select genes whose adjusted P-values meet significance level
> res[res$padj < 0.05,]

> # look at MA plot
> plotMA(res,main="DESeq2")

> # see DESeq2 vignette for further information

Differential translation efficiency

Tests for differential translation efficiency can also be implemented in DESeq2. The discussion below follows a reply from DESeq2 author Mike Love (source here). We’ll use the RNA-seq samples from above, plus some matched ribosome profiling.

Sample table

First, collect the count data for the ribosome profiling:

$ counts_in_region riboprofile_5hr_rep1.txt  --count_files SRR609197_riboprofile_5hr_rep1.bam   --fiveprime_variable --offset demo_p_offset.txt --annotation_files merlin_orfs.bed --annotation_format BED
$ counts_in_region riboprofile_5hr_rep2.txt  --count_files SRR2064020_riboprofile_5hr_rep2.bam  --fiveprime_variable --offset demo_p_offset.txt --annotation_files merlin_orfs.bed --annotation_format BED
$ counts_in_region riboprofile_24hr_rep1.txt --count_files SRR592954_riboprofile_24hr_rep1.bam  --fiveprime_variable --offset demo_p_offset.txt --annotation_files merlin_orfs.bed --annotation_format BED
$ counts_in_region riboprofile_24hr_rep2.txt --count_files SRR2064022_riboprofile_24hr_rep2.bam --fiveprime_variable --offset demo_p_offset.txt --annotation_files merlin_orfs.bed --annotation_format BED

Combine the data into a new table, as before:

>>> import pandas as pd
>>> sample_names = ["rnaseq_5hr_rep1",
>>>                 "rnaseq_5hr_rep2",
>>>                 "rnaseq_24hr_rep1",
>>>                 "rnaseq_24hr_rep2",
>>>                 "riboprofile_5hr_rep1",
>>>                 "riboprofile_5hr_rep2",
>>>                 "riboprofile_24hr_rep1",
>>>                 "riboprofile_24hr_rep2"
>>>                ]

>>> # load samples as DataFrames
>>> samples = { K : pd.read_table("%s.txt" % K,sep="\t",header=0,comment="#",index_col=None) for K in sample_names }

>>> # combine count columns to single DataFrame
>>> combined_df = samples["rnaseq_5hr_rep1"][["region_name"]]
>>> for k,v in samples.items():
>>>     combined_df[k] = v["counts"]

>>> # save
>>> combined_df.to_csv("te_combined_counts.txt",sep="\t",header=True,index=False,
>>>                            columns=["region_name","rnaseq_5hr_rep1","rnaseq_5hr_rep2",
>>>                                    "rnaseq_24hr_rep1","rnaseq_24hr_rep2",
>>>                                    "riboprofile_5hr_rep1","riboprofile_5hr_rep2",
>>>                                    "riboprofile_24hr_rep1","riboprofile_24hr_rep2"
>>>                                    ])

Design table

The design table now includes an additional column to indicate which assay was used for each sample:

sample_name                 time         assay
rnaseq_5hr_rep1             5_hours      rnaseq
rnaseq_5hr_rep2             5_hours      rnaseq
rnaseq_24hr_rep1            24_hours     rnaseq
rnaseq_24hr_rep2            24_hours     rnaseq
riboprofile_5hr_rep1        5_hours      riboprof
riboprofile_5hr_rep2        5_hours      riboprof
riboprofile_24hr_rep1       24_hours     riboprof
riboprofile_24hr_rep2       24_hours     riboprof

Similarly, the design equation now needs an interaction term to alert DESeq2 that we want to test whether the relationship between the two assays to differ as a function of timepoint:

design = ~ assay + time + assay:time

The analysis

This is similar to what we did above. In R:

> # load DESeq2
> library("DESeq2")

> # load RNA seq data into a data.frame
> # first line of file are colum headers
> # "region" column specifies a list of row names
> te_combined_data <- read.delim("te_combined_counts.txt",
>                                sep="\t",
>                                header=TRUE,
>                                row.names="region_name")
> te_sample_table <- read.delim("te_sample_table.txt",
>                               sep="\t",
>                               header=TRUE,
>                               row.names="sample_name")

> # set up the experiment
> # note the interaction term in the design below:
> dds <- DESeqDataSetFromMatrix(countData = te_combined_data,
>                               colData = te_sample_table,
>                               design = ~ assay + time + assay:time)

> # run the experiment
> dds <- estimateSizeFactors(dds)
> dds <- estimateDispersions(dds)
> # likelihood ratio test on samples
> # compare model with and without interaction term
> dds <- nbinomLRT(dds,
>                  full= ~ assay + time + assay:time,
>                  reduced= ~ assay + time )
> res <- DESeq(dds)
> res <- results(dds)
> summary(res)

> # Order results by P-value.
> # Column `padj` contains adjusted P-values for changes
> # in translation efficiency over time.
> resOrdered <- res[order(res$padj),]
> # export
> write.table(,
>             sep="\t",quote=FALSE,
>             file="te_change_over_time.txt")

> # sneak a peak
> head(resOrdered)


The table te_change_over_time.txt can then be further manipulated in R, Excel, or Python (using Pandas.read_table()).

See also