Creating custom BED, BigBed, and GTF2 annotation files¶
In this tutorial, we describe how to make custom genome annotations in BED, BigBed and GTF2 formats. These can then be used like any other annotation file, for example:
to view custom regions of interest in a genome browser
to perform analyses, using pipelines in
plastid
or other programs
Note
plastid
can also make custom GFF3 files. For this see Working with GFF files
For a detailed comparison of the various annotation formats, see Which annotation format should I use?.
This document contains the following sections:
Workflow¶
Annotation files are tab-delimited formats, and could manually be made in a text editor or spreadsheet. However, an easier and potentially less error-prone process is to:
Enter an interactive Python session
Construct
SegmentChains
and/orTranscripts
that represent the features of interest, using 0-indexed, half-open coordinates. This can be done completely de novo, or using other features as starting points.Use the
as_bed()
oras_gtf()
methods to export theSegmentChains
orTranscripts
in the desired format. To convert the BED file to a BigBed file, see Making BigBed files.
plastid
will automatically handle all coordinate conversions, splicing,
text formatting, and attribute conversion.
Examples¶
Make a GTF2 for reporter construct¶
Suppose we made a cell line containing a reporter construct, and that we want to measure the expression level of this reporter along with the expression levels of all other genes in an RNA-seq experiment. Let’s call the vector lenti223 and suppose it contains a GFP and an mCherry, each driven by some promoter.
An analysis pipeline for this experiment might look the following:
Create a custom BED or GTF2 file describing the reporter construct(s)
Combine the custom annotation file with an annotation from the host genome
Download the appropriate genome sequence, and add the vector sequence as another contig
Align sequencing data to the combined genome
Analyze sequencing data using custom annotation file.
Here, we’ll focus on just making the custom GTF2 file. Interactively we’ll
represent the reporter transcripts as Transcripts
and define coordinates
manually:
>>> from plastid import GenomicSegment, SegmentChain, Transcript
# GFP transcript, containing 100 bp of 5' UTR and 150 bp of 3' UTR
# 714bp coding region from bases 945-1659
>>> gfp = Transcript(GenomicSegment("lenti223",845,1809,"+"),ID="sfGFP",cds_genome_start=945,cds_genome_end=1659)
# mCherry transcript, similarly constructed
>>> rfp = Transcript(GenomicSegment("lenti223",2100,3061,"+"),ID="mCherry",cds_genome_start=2200,cds_genome_end=2911)
# now, write out features
# we could have made a BED file using as_bed() in place of as_gtf()
>>> with open("custom.gtf","w") as fout:
>>> fout.write(gfp.as_gtf())
>>> fout.write(rfp.as_gtf())
>>> fout.close()
The file custom.gtf
should look something like this:
lenti223 . exon 846 1809 . + . gene_id "gene_sfGFP"; transcript_id "sfGFP"; ID "sfGFP";
lenti223 . CDS 946 1656 . + 0 gene_id "gene_sfGFP"; transcript_id "sfGFP"; ID "sfGFP";
lenti223 . start_codon 946 948 . + . gene_id "gene_sfGFP"; transcript_id "sfGFP"; cds_start "100"; cds_end "814"; ID "sfGFP";
lenti223 . stop_codon 1657 1659 . + . gene_id "gene_sfGFP"; transcript_id "sfGFP"; cds_start "100"; cds_end "814"; ID "sfGFP";
lenti223 . exon 2101 3061 . + . gene_id "gene_mCherry"; transcript_id "mCherry"; ID "mCherry";
lenti223 . CDS 2201 2908 . + 0 gene_id "gene_mCherry"; transcript_id "mCherry"; ID "mCherry";
lenti223 . start_codon 2201 2203 . + . gene_id "gene_mCherry"; transcript_id "mCherry"; cds_start "100"; cds_end "811"; ID "mCherry";
lenti223 . stop_codon 2909 2911 . + . gene_id "gene_mCherry"; transcript_id "mCherry"; cds_start "100"; cds_end "811"; ID "mCherry";
Make a BED file containing halves of each coding region¶
Manually entering coordinates is laborious. More frequently, novel annotations are derived from existing ones. Let’s suppose we’d like to make a BED file containing halves of coding regions. For this we’ll use the demo dataset.
We’ll load the transcripts, create new SegmentChains
from those, and save
them:
>>> from plastid import BED_Reader
# read transcripts
>>> reader = BED_Reader("some_file.bed")
# open file for writing
>>> halfbed = open("cds_halves.bed","w")
>>> for transcript in reader:
>>> cds = transcript.get_cds()
>>>
>>> # make sure transcript is coding before divide CDS
>>> if cds.length > 0:
>>> name = transcript.get_name()
>>> halflength = cds.length // 2
>>>
>>> # get halves, name each half after the parent CDS
>>> first_half = cds.get_subchain(0,halflength,ID=name + "_firsthalf")
>>> second_half = cds.get_subchain(halflength,cds.length,ID=name + "_secondhalf")
>>>
>>> # save output
>>> halfbed.write(first_half.as_bed())
>>> halfbed.write(second_half.as_bed())
# close file
>>> halfbed.close()
Making custom extended BED files¶
Extended BED and BigBed files can contain extra columns, such as a gene ID. This can be extremely useful.
To export attributes of a SegmentChain
or Transcript
as extra columns
in a extended BED format, pass a list of the attribute names (from
the dictionary attr) to the extra_columns keyword of
SegmentChain.as_bed
. Attributes will be
exported in the order they appear in extra_columns, and will be given an empty
value of “” when they are not defined
>>> attr = { "ID" : "some feature ID",
"extra_field_1" : 542,
"extra_field_2" : "some extra field",
}
>>> my_chain = Transcript(GenomicSegment("chrA",100,150,"+"),
GenomicSegment("chrA",500,550,"+"),
**attr)
>>> print(my_chain.as_bed()
chrA 100 550 some feature ID 0 + 100 100 0,0,0 2 50,50, 0,400,
>>> print(my_chain.as_bed(extra_columns=["extra_field_1","extra_field_2"]))
chrA 100 550 some feature ID 0 + 100 100 0,0,0 2 50,50, 0,400, 542 some extra field
If an attribute is not defined, the column will be left empty “”:
>>> print(my_chain.as_bed(extra_columns=["extra_field_1","nonexistent_field","extra_field_2"]))
chrA 100 550 some feature ID 0 + 100 100 0,0,0 2 50,50, 0,400, 542 some extra field
Making BigBed files¶
BigBed files are easily made from BED or Extended BED files using Jim Kent’s utilities. To make a BigBed file:
Create a custom BED or extended BED file. For extended BED files, consider making an optional autoSql description of the names & data types of the extra columns. This will allow parsers to convert these to native types when reading the BigBed file.
Sort the BED file by chromosome and start position. This is easily done in a terminal session:
$ sort -k1,1n -k2,2n my_annotation.bed >my_annotation_sorted.bed
Download and install Jim Kent’s utilities, which include the
bedToBigBed
program.Obtain a chromosome/contig
.sizes
file. If using genome builds from UCSC, these can be downloaded using thefetchChromSizes
program included with Jim Kent’s utilities. For example:$ fetchChromSizes hg38 >>hg38.sizes
Run
bedToBigBed
. From the terminal:$ bedToBigBed my_annotation_sorted.bed my_genome.sizes my_annotation.bb
Your annotation will be saved as
my_annotation.bb
.
See also¶
Which annotation format should I use? for a brief overview of the costs & benefits of BED, BigBed, GTF2 and GFF3 files.
Working with GFF files for information on making GFF3 files
SegmentChain
andTranscript
for details on these classesThe UCSC file format FAQ for details on file formats and further discussion of their capabilities, advantages, and disadvantages
The GFF3 specification for details on GFF3 files
Coordinate systems used in genomics for information on genomic coordinates
Sequence Ontology (SO) v2.53, for a description of a common GFF3 feature ontology
SO releases, for the current SO consortium release.
Jim Kent’s utilities for more info on making BigBed files.