Setting up a genome for analysis

When on-boarding a new genome, it helps to do some pre-processing. Most of this involves downloading and converting files into formats that can be read more quickly. Some of it involves creating new files.

Details on why this is important can be found in Categories and formats of genomics data.

Fortunately, these steps only need to be performed when the genome annotation or sequence changes.

Note

This tutorial is under construction.

Note

The examples in this tutorial are sequential. You don’t need to do every step, but some steps require subsequent steps to be modified if they are skipped. Instructions for doing so are included in each step.

If you miss this information, you may get spurious results or frustrating errors.

This document includes the following sections:

Download some tools

In this tutorial, we use the following tools. They are useful to have on-hand in many circumstances:

Obtain a genome sequence & matching annotation

Choose a curator

For many organisms, multiple genome assemblies exist. Often, there are multiple versions, and also multiple curators. For example, Ensembl, UCSC, and RefSeq all host versions of the mouse and human genomes.

Because chromosomal sequences differ between assemblies, it is essential that the genome annotation (containing the locations of genes and transcripts) come from the same curator as assembly (sequences). It is also important that assembly and annotation verions match.

In other words, if using the current UCSC mouse genome assembly, be sure to use their annotation as well. Do not mix and match!

Sources for genome assemblies and annotations include:

  • SGD for S. cerevisiae S228C (reference strain), as well as many others (available in their downloads page)

  • FlyBase and modENCODE for D. melanogaster and other insects

  • ENCODE/GENCODE, Ensembl, UCSC, and RefSeq for human, mouse, and other model organsisms

  • NCBI Assembly and JGI for others (e.g. non-model organisms alternate strains, data hot off the press)

Download files

Once you have chosen a data source, you will need:

  • Genome sequence: this will generally be in fasta or twobit format. Occasionally, it will be included in a GFF3 file, in which case it will need to be extracted.

  • Genome annotation: this will appear in GTF2 (just transcripts and coding regions), GFF3 (all types of features- e.g. genes, origens of replication, microRNAs, et c), or BED format.

    BED format files are easy to work with, but usually contain transcripts divorced from their genes. We recommend grabbing the GTF2 or GFF3 file as well as a BED file if available. In many cases, we’ll re-make a new BED file using the info from the GTF2/GFF3 later.

Convert file formats

Extract genome sequence from GFF3

If the annotation and/or sequence is in GFF3 format, we made need to extract the genome sequence if a separate fasta file is not available. To make a fasta file from data embedded in a GFF3, open a terminal and type:

$ cat my_file.gff | awk 'BEGIN { doprint = 0}; \
                         doprint == 1 { print $0 }; \
                         $0 ~ /#FASTA/ { doprint = 1 }' >my_file.fa

Make an indexed sequence file

If you’re using a large genome and plan to do lots of sequence manipulation downstream, consider making either a twobit file or an indexed bgzf-compressed fasta file from the fasta file. BioPython reads bgzf-compressed fasta files natively, and twobitreader reads twobit files. plastid works with objects produced by both packages.

To compress the fasta file with bgzf, make sure to have htslib installed, then type from the terminal:

$ bgzip my_file.fa

To make a twobit file, make sure to have Jim Kent’s utilities installed, then type from the terminal:

$ faToTwoBit my_file.fa my_file.2bit

Sort your files

Because of the way features are stored in GFF3 and GTF2, we can get away with using far less memory if they are sorted. In general, it is good practice to sort annotation files by chromosome, starting coordinate, ending coordinate, and strand.

There are multiple ways to sort these file. One easy one is to use the sort utility. Follow the examples below:

# sort a GFF3 file
$ cat my_file.gff | grep -v "#" | sort -k1,1 -k4,4n >my_file_sorted.gff

# sort a GTF2 file - the same way!
$ cat my_file.gtf | grep -v "#" | sort -k1,1 -k4,4n >my_file_sorted.gtf

# sort a BED file
$ cat my_file.bed | grep -v "#" | -k1,1 -k2,2n >my_file_sorted.bed

When working with sorted files in plastid, use the --sorted command-line argument for command line scripts (as in the examples below), or the keyword argument is_sorted=True when using various readers interactively (for an example, see the documentation for GTF2_TranscriptAssembler).

Assemble transcripts from GFF3 or GTF2

GFF3 files include many genomic features aside from transcripts. It often helps to filter these out, and make a more compact, more standardized GTF2 or BED file.

That said, both GFF3 and GTF2 files are memory hogs, and difficult to use with large (e.g. mammal, ciliate, plant) genomes. BED files use far less memory, but in their native format fail to include useful information like parent-child relationships between genes and transcripts.

A useful compromise is to use an extended BED file or a BigBed file, which are still memory efficient but can preserve extra information like gene-to-transcript relationships. However, comparatively few tools support these formats (plastid does), while GFF3, GTF2 and BED are more or less universally supported.

So, here we provide information on how to make all of them. plastid includes a script called reformat_transcripts for this. If starting with a GTF2 file, replace --annotation_format GFF3 with --annotation_format GTF2 in the examples below. We assume your files are sorted by chromosome. If not, drop the --sorted argument from below, and go do laundry or something while things run.

To make a GTF2 file from a GFF3 file:

$ reformat_transcripts --annotation_files my_file.gff \
                       --annotation_format GFF3 \
                       --sorted \
                       --output_format GTF2 \
                       myfile.gtf2

To make a BED file:

$ reformat_transcripts --annotation_files my_file.gff \
                       --annotation_format GFF3 \
                       --sorted \
                       --output_format BED \
                       myfile.bed

To make an extended BED file that include extra columns for gene-transcript relationships and notes (assuming an attribute called Notes is defined in the GFF3/GTF2 file), add the --extra_columns argument:

$ reformat_transcripts --annotation_files my_file.gff \
                       --annotation_format GFF3 \
                       --output_format BED \
                       --sorted \
                       --extra_columns gene_id Notes\
                       -- \
                       myfile_extended.bed

To convert a BED or extended BED file to a BigBed file, use the bedToBigBed program from Jim Kent’s utilities. In this example, we’ll use the extended BED file.

To take advantage of the extra columns we output, bedToBigBed needs to know what they contain. Fortunately, the screen output from reformat_transcripts will contain a table declaration for this purpose. In this example, it reads:

table bigbed_columns "myfile_extended columns"
(
    string            chrom;          "Chromosome"
    uint              chromStart;     "chr start"
    uint              chromEnd;       "chr end"
    string            name;           "item name"
    uint              score;          "score"
    char[1]           strand;         "strand"
    uint              thickStart;     "thickstart"
    uint              thickEnd;       "thickend"
    uint              reserved;       "normally itemRgb"
    int               blockCount;     "block count"
    int[blockCount]   blockSizes;     "block sizes"
    int[blockCount]   chromStarts;    "block starts"
    string            gene_id;        "description of custom field contents"
    string            Notes;          "description of custom field contents"
)

Copy and paste that into a file called my_fields.as, editing the descriptions of gene_id and Notes as you see fit.

Next, we need to know the chromosome sizes so the BigBed file can be indexed. bedToBigBed expects a two-column tab-delimited text file where the columns are (chromosome name, chromosome size). If you have the genome sequence as a fasta file, create a file of chromosome sizes by entering the following in a Python terminal:

>>> from Bio import SeqIO
>>> genome = SeqIO.parse(open("my_genome_sequence.fa"),"fasta")

>>> outfile = open("mychroms.sizes","w")
>>> for my_seq in genome:
>>>     outfile.write(my_seq.id + "\t" + str(len(my_seq)) + "\n")
>>>

>>> outfile.close()

Finally, sort the extended BED file and run bedToBigBed. If you’re using the non-extended BED file, drop the arguments -type=12+2 --extraIndex=name,gene_id, -as=my_fields.as from the example. Type from the bash terminal:

$ sort -k1,1 -k2,2n myfile_extended.bed >myfile_sorted.bed
$ bedToBigBed -tab -type=bed12+2 \
              -extraIndex=name,gene_id \
              -as=my_fields.as \
              myfile_sorted.bed mychroms.sizes myfile.bb

The output file, myfile.bb, can be used in plastid and by other programs and genome browsers that support BigBed files. The use of -extraIndex (if using the extended BED file) is a bonus- it throws extra indices on the names and gene IDs of transcripts in the BigBed file, enabling Plastid to search the file efficiently by name. See plastid.readers.bigbed for examples.

Using any of these files with plastid’s command-line scripts is easy. Just be sure to set the --annotation_format argument to BED, BigBed, GTF2 or GFF3, depending upon which format you are using.

If using an extended BED file, set --annotation_format to BED, and use the --bed_extra_columns argument to specify the column names. For example, to convert the extended BED back to a GTF2:

$ reformat_transcripts --annotation_files myfile_extended.bed \
                       --annotation_format BED \
                       --bed_extra_columns gene_id Notes\
                       --sorted \
                       --output_format GTF2 \
                       myfile.gtf

Consider excluding transcripts supported by scant evidence

Curators of genome annotations require different thresholds of evidence when adding transcripts to the annotation.

Some curators are very conservative, and at the risk of exluding valid transcripts, require lots of verification to add a transcript model. Others are very inclusive, and include many potentially dubious trancripts in their annotations. Obviously, this is a tradeoff, and depending upon your needs, you may wish to be conservative or inclusive.

A discussion of whether or when to filter transcripts is beyond the scope of this tutorial. However, removing transcripts that have little or no biological evidence can improve processing times, in particular for vertebrate genomes.

A useful resource for this is the APPRIS database ([RME+13]), which uses multiple sources of evidence (e.g. conservation among vertebrates, presence of protein-coding domains, et c) to label transcript isoforms as principal, alternative, or minor, and within each of these categories, provides a reliability score.

One option, then, is to use APPRIS annotations to remove all isoforms considered minor, retaining those considered principal or alternative.

Then, re-sort, index, convert the resulting annotation file as described in the previous section.

Make files for aligners

bowtie indexes

bowtie and Tophat require pre-built genome indexes before alignment. These are built from fasta files of genome sequence. From the terminal:

$ bowtie-build my_file.fa my_genome_index

This will create six files, all beginning with my_genome_index.

.juncs file for Tophat

Tophat uses a custom file format (.juncs) to specify splice junctions. While Tophat can extract junctions from a GTF2 file, it is often convenient to have a pre-built .juncs file. Plastid includes a script called findjuncs for this. To use it, type from the terminal:

$ findjuncs --annotation_files my_file.gtf \
            --annotation_format GTF2 \
            --sorted \
            --export_tophat \
            my_junctions

This will create two files: my_junctions.juncs and my_junctions.bed, a BED format file containing the same splice junctions represented as exon pairs. findjuncs can perform a number of other useful pieces of information- see its documentation to see what it can do.

Make files for analysis in Plastid

If you plan to do your analysis using plastid’s tools, it is helpful to pre-compute a number of files, in the following order:

Crossmap

Repetitive elements and recently-duplicated genes often contain sequence that can multimap, or align equally well to multiple places in the genome. In a sequencing dataset, the origin of multimapping reads is often ambiguous, and it can be useful to exclude such reads, as well as the regions from the genome that give rise to them, from analysis. Plastid includes a tool called crossmap for this purpose. The algorithm is described in detail here.

To use crossmap, you need a sequence file, a bowtie index, and bowtie (not bowtie 2) installed on your machine. Follow the example below, but set the parameter -k to the expected size of your read alignments (or a little shorter, to be conservative):

$ crossmap -k 26 --mismatches 2 --sequence_file my_file.fa /path/to/bowtie/indexes/my_genome_index my_crossmap

Note

Mismatches

Enabling mismatches with short read sizes will make the crossmap build take a lot longer, because it dramatically increases the search space that bowtie needs to traverse.

Memory usage

Building a crossmap on a large (e.g. mouse, human, or plant) genome requires a lot of memory if the genome sequence is stored in a fasta file. Converting the fasta file to a `2bit`_ file will save substantial amounts of memory

Processes

Using multiple processes will speed crossmap’s execution time, but will also increase the memory footprint, because each process needs its own memory space

For further discusison of crossmap and examples of how to use its output, see Excluding (masking) regions of the genome.

As with all annotation files (and, a crossmap is an annotation), we recommend sorting and, for large genomes, converting the output from BED to BigBed format. Instructions for how to do this can be found in the screen output from crossmap, or above.

Examples below assume you have made a crossmap. If you decide not to, drop all of the --mask_annotation* arguments from the examples below.

Maximal spanning windows

plastid uses maximal spanning windows for metagene analysis, estimation of P-site offsets, and determination of sub-codon phasing for ribosome profiling data.

If you plan to do any of these analyses, complete these steps, use plastid’s metagene script to build maximal spanning windows from a genome annotation.

The algorithm for window generation is described in depth in the metagene tutorial. To generate the windows, type from the terminal:

$ metagene generate --annotation_files my_file.gtf --sorted \
           --mask_annotation_files my_crossmap.bb \
           --mask_annotation_format BigBed \
           --downstream 200 \
           my_windows

Corrected gene boundaries

Plastid includes tools for measuring gene expression. One of these, called cs, includes a generate mode, that creates a corrected genome annotation. This correct annotation differs from an off-the-shelf genome annotation in that it:

  • collapses genes whose transcripts share exact exons to “merged” genes

  • excludes regions of the genome that are overlapped by more than one merged gene

  • excludes regions flagged in a mask file or crossmap

  • classifies sub-regions of genes as CDS, 5’ UTR, or 3’ UTR based upon how they appear in that gene’s transcripts

The corrected annotation may then be used by cs or other tookits (within plastid or outside it), for gene expression measurement. For details, see the documentation for cs.

Sessions for genome browsing

We can’t stress enough the value of visual examination of your data in a genome browser. We like The Broad Institute’s IGV for this purpose. To create a session for your genome, follow the instructions for creating a .genome file, here.

Supply your custom GTF2, BED, or BigBed file as the gene file, and the exact genome sequence (fasta format) used above as the sequence file.

We also recommend loading a BigBed version of any crossmap file; it will help explain curious increases or decreases in read density that you might see when you load your alignments.