Creating custom BED, BigBed, and GTF2 annotation files

In this tutorial, we describe how to make custom genome annotations in BED, BigBed and GTF2 formats. These can then be used like any other annotation file, for example:

  • to view custom regions of interest in a genome browser
  • to perform analyses, using pipelines in plastid or other programs

Note

plastid can also make custom GFF3 files. For this see Working with GFF files

For a detailed comparison of the various annotation formats, see Which annotation format should I use?.

This document contains the following sections:

Workflow

Annotation files are tab-delimited formats, and could manually be made in a text editor or spreadsheet. However, an easier and potentially less error-prone process is to:

  1. Enter an interactive Python session
  2. Construct SegmentChains and/or Transcripts that represent the features of interest, using 0-indexed, half-open coordinates. This can be done completely de novo, or using other features as starting points.
  3. Use the as_bed() or as_gtf() methods to export the SegmentChains or Transcripts in the desired format. To convert the BED file to a BigBed file, see Making BigBed files.

plastid will automatically handle all coordinate conversions, splicing, text formatting, and attribute conversion.

Examples

Make a GTF2 for reporter construct

Suppose we made a cell line containing a reporter construct, and that we want to measure the expression level of this reporter along with the expression levels of all other genes in an RNA-seq experiment. Let’s call the vector lenti223 and suppose it contains a GFP and an mCherry, each driven by some promoter.

An analysis pipeline for this experiment might look the following:

  1. Create a custom BED or GTF2 file describing the reporter construct(s)
  2. Combine the custom annotation file with an annotation from the host genome
  3. Download the appropriate genome sequence, and add the vector sequence as another contig
  4. Align sequencing data to the combined genome
  5. Analyze sequencing data using custom annotation file.

Here, we’ll focus on just making the custom GTF2 file. Interactively we’ll represent the reporter transcripts as Transcripts and define coordinates manually:

>>> from plastid import GenomicSegment, SegmentChain, Transcript

# GFP transcript, containing 100 bp of 5' UTR and 150 bp of 3' UTR
# 714bp coding region from bases 945-1659
>>> gfp = Transcript(GenomicSegment("lenti223",845,1809,"+"),ID="sfGFP",cds_genome_start=945,cds_genome_end=1659)

# mCherry transcript, similarly constructed
>>> rfp = Transcript(GenomicSegment("lenti223",2100,3061,"+"),ID="mCherry",cds_genome_start=2200,cds_genome_end=2911)

# now, write out features
# we could have made a BED file using as_bed() in place of as_gtf()
>>> with open("custom.gtf","w") as fout:
>>>     fout.write(gfp.as_gtf())
>>>     fout.write(rfp.as_gtf())
>>>     fout.close()

The file custom.gtf should look something like this:

lenti223    .    exon           846     1809    .    +    .    gene_id "gene_sfGFP"; transcript_id "sfGFP"; ID "sfGFP";
lenti223    .    CDS            946     1656    .    +    0    gene_id "gene_sfGFP"; transcript_id "sfGFP"; ID "sfGFP";
lenti223    .    start_codon    946     948     .    +    .    gene_id "gene_sfGFP"; transcript_id "sfGFP"; cds_start "100"; cds_end "814"; ID "sfGFP";
lenti223    .    stop_codon     1657    1659    .    +    .    gene_id "gene_sfGFP"; transcript_id "sfGFP"; cds_start "100"; cds_end "814"; ID "sfGFP";
lenti223    .    exon           2101    3061    .    +    .    gene_id "gene_mCherry"; transcript_id "mCherry"; ID "mCherry";
lenti223    .    CDS            2201    2908    .    +    0    gene_id "gene_mCherry"; transcript_id "mCherry"; ID "mCherry";
lenti223    .    start_codon    2201    2203    .    +    .    gene_id "gene_mCherry"; transcript_id "mCherry"; cds_start "100"; cds_end "811"; ID "mCherry";
lenti223    .    stop_codon     2909    2911    .    +    .    gene_id "gene_mCherry"; transcript_id "mCherry"; cds_start "100"; cds_end "811"; ID "mCherry";

Make a BED file containing halves of each coding region

Manually entering coordinates is laborious. More frequently, novel annotations are derived from existing ones. Let’s suppose we’d like to make a BED file containing halves of coding regions. For this we’ll use the demo dataset.

We’ll load the transcripts, create new SegmentChains from those, and save them:

>>> from plastid import BED_Reader

# read transcripts
>>> reader = BED_Reader("some_file.bed")

# open file for writing
>>> halfbed = open("cds_halves.bed","w")

>>> for transcript in reader:
>>>     cds = transcript.get_cds()
>>>
>>>     # make sure transcript is coding before divide CDS
>>>     if cds.length > 0:
>>>         name = transcript.get_name()
>>>         halflength = cds.length // 2
>>>
>>>         # get halves, name each half after the parent CDS
>>>         first_half  = cds.get_subchain(0,halflength,ID=name + "_firsthalf")
>>>         second_half = cds.get_subchain(halflength,cds.length,ID=name + "_secondhalf")
>>>
>>>         # save output
>>>         halfbed.write(first_half.as_bed())
>>>         halfbed.write(second_half.as_bed())

# close file
>>> halfbed.close()

Making custom extended BED files

Extended BED and BigBed files can contain extra columns, such as a gene ID. This can be extremely useful.

To export attributes of a SegmentChain or Transcript as extra columns in a extended BED format, pass a list of the attribute names (from the dictionary attr) to the extra_columns keyword of SegmentChain.as_bed. Attributes will be exported in the order they appear in extra_columns, and will be given an empty value of “” when they are not defined

>>> attr = { "ID" : "some feature ID",
             "extra_field_1" : 542,
             "extra_field_2" : "some extra field",
           }

>>> my_chain = Transcript(GenomicSegment("chrA",100,150,"+"),
                          GenomicSegment("chrA",500,550,"+"),
                          **attr)
>>> print(my_chain.as_bed()
chrA    100    550    some feature ID    0    +    100    100    0,0,0    2    50,50,    0,400,

>>> print(my_chain.as_bed(extra_columns=["extra_field_1","extra_field_2"]))
chrA    100    550    some feature ID    0    +    100    100    0,0,0    2    50,50,    0,400,    542    some extra field

If an attribute is not defined, the column will be left empty “”:

>>> print(my_chain.as_bed(extra_columns=["extra_field_1","nonexistent_field","extra_field_2"]))
chrA    100    550    some feature ID    0    +    100    100    0,0,0    2    50,50,    0,400,    542        some extra field

Making BigBed files

BigBed files are easily made from BED or Extended BED files using Jim Kent’s utilities. To make a BigBed file:

  1. Create a custom BED or extended BED file. For extended BED files, consider making an optional autoSql description of the names & data types of the extra columns. This will allow parsers to convert these to native types when reading the BigBed file.

  2. Sort the BED file by chromosome and start position. This is easily done in a terminal session:

    $ sort -k1,1n -k2,2n my_annotation.bed >my_annotation_sorted.bed
    
  3. Download and install Jim Kent’s utilities, which include the bedToBigBed program.

  4. Obtain a chromosome/contig .sizes file. If using genome builds from UCSC, these can be downloaded using the fetchChromSizes program included with Jim Kent’s utilities. For example:

    $ fetchChromSizes hg38 >>hg38.sizes
    
  5. Run bedToBigBed. From the terminal:

    $ bedToBigBed my_annotation_sorted.bed my_genome.sizes my_annotation.bb
    

    Your annotation will be saved as my_annotation.bb.


See also