A simple alignment and quantitation workflow¶
This tutorial covers a rudimentary workflow for aligning reads and doing some QC on the dataset. It assumes that the genome has been set up, as described in Setting up a genome for analysis.
This tutorial is very much under consruction.
This document includes the following sections:
- Preflight checks
- Perform alignments
- Visualization in genome browsers
This stage is now frequently taken care of by sequencing facilities. But, if not, you may need to remove untrimmed adaptor sequences manually.
Good options for this include:
Exact options will depend upon the cloning strategies used in your library prep. Please refer to their documentation as needed.
A number of packages exist for checking the quality of the raw data themselves, and are useful, for example, for spotting a failed cycle of sequencing. The Illumina pipeline creates HTML reports for a number of useful metrics during basecalling. Your sequencing facility can provide these to you.
If not, FastQC produces useful plots of base-wise quality scores, read length distributions, and nucleotide distributions. Its documentation contains examples of what you should expect to see if things work out.
If working on a sample derived from RNA (RNA-seq, ribosome profiling, et c), it is frequently useful to computationally filter out fragments derived from rRNA. Because these are abundant – even after rRNA depletion or polyA selection – removing them saves processing time and disk space downstream.
bowtie is an excellent tool for filtering out rRNA-derived fragments. To use it, one must first obtain or construct a fasta file of rRNA sequences from the organism of interest. These may be obtained from GenBank. Once constructed (in this example, assume it is named species_rRNA.fa), build a bowtie index (in this example, called species_rRNA). From the terminal:
$ bowtie-build species_rRNA species_rRNA.fa
Then, align the data, discarding any reads mapping to the rRNA, and saving those that do not align to a separate file, here called my_reads_rRNA_unalign.fq:
$ bowtie -p4 -v3 species_rRNA my_reads.fastq \ --un my_reads_rRNA_unalign.fq >/dev/null
Reads that do not align (in my_reads_rRNA_unalign.fq), may then be used as input for additional stages of prealignment, or as input for chromosomal alignment.
Depending upon how you intend to analyze your data, it may be useful to separate out other classes of reads with additional stages of prealignment:
To remove expected sequence contaminants:
For example, flies eat yeast. But, often, if we’re performing an experiment on flies, we aren’t too concerned with which genes the yeast are expressing. In this case, it is useful to make a bowtie index of the yeast transcriptome, and filter those reads out, just as we did with the rRNA above.
To measure transposon expression:
Because transposons exist throughout the genome in multiple, quite similar copies, they produce multimapping reads.
If retrotransposon activity is of interest, one way to capture it is to make a database of consensus sequences of each type of transposon in your genome of interest, and to pre-align to it allowing a generous number of mismatches. Instead of discaring the reads (as we did with rRNA), retain the alignments, using them for expression quantitation.
As previously, take all the reads which did not align in the pre-alignment stages, and put them into chromosomal alignment, below.
Having filtered out reads derived from rRNA, we can align the remaining data to
the genome. In this example, we use Tophat, because it is capable of
performing spliced (and other gapped) alignments. We’ll use splice junctions
juncs file made in the genome setup tutorial.
# run tophat $ tophat -o my_chr_alignments \ --bowtie1 \ --raw-juncs splice_junctions.juncs \ --no-novel-juncs \ /path/to/chromosme/bowtie/index \ my_reads_rRNA_unalign.fq # rename output file $ mv my_chr_alignments/accepted_hits.bam chr_alignments.bam # index the BAM file for use in plastid, IGV, et c $ samtools index chr_alignments.bam
Selection of chromosomal alignment parameters is a complex topic, and beyond the scope of this tutorial.
For more details, see manual for Tophat (or your favorite aligner) and other forums dedicated to this purpose. It is important to choose alignment parameters tailored to your own needs.
To do so, make sure you have generated maximal spanning windows from
your genome annotation using the
(see the metagene tutorial and the
script documentation for details).
The output from
metagene generate can then be used for P-site estimation
and analysis of sub-codon phasing.
See Determine P-site offsets for ribosome profiling data for background and step-by-step instructions.
This is a large topic, and the details of how to do this depend upon your experiment. One workflow we like is to:
csto create a corrected genome annotation
csto count the number of reads aligning to each gene in each dataset
- Combine relevant columns the output from
csfrom each sample into a single master table
- Using the master table from (3), perform some clustering to make sure samples behave as expected. For example, replicates should cluster together, while differently treated samples might not. We don’t have a tutorial on this right now, but DESeq2 has a fabulous tutorial on this on its home page, in its vignettes. We recommend you check it out!
- Use DESeq2 to perform differential expression analysis and test for significance
Modern genome browsers can import BAM files of alignments directly. By default, they tend to plot individual alignments and a summary track of read coverage over each nucleotide.
Frequently it is useful to plot some aspect of the data, rather than raw read coverage, as a function of nucleotide position in the genome. For example, in ribosome profiling data, it is useful to plot the total number of ribosomes translating a particular nucleotide.
To export transformed data as a browser track in bedGraph or wiggle
formats, see the
make_wiggle script. If working on a large (e.g. plant or
metazoan) genome, it might be helpful to convert the output from
into to a BigWig file, using Jim Kent’s utilities from UCSC.