Manipulating feature sequences¶
In this tutorial we will fetch the sequences of genomic features, like transcripts, in the following sections:
Get the sequence of a feature or transcript¶
SegmentChains
and Transcripts
can fetch their own genomic sequence via
their get_sequence()
method.
As an argument, the method expects a dictionary-like object of string-like objects, where the dictionary keys correspond to chromosome names, and the strings correspond to the forward-strand sequence. For example:
>>> little_genome = { "chrA" : "NNNNNNNNNNTCTAGACGATACANNNNNNNNNNCTACGATA" }
Sequences of multi-segment features are returned fully-spliced:
>>> from plastid import GenomicSegment, SegmentChain
>>> little_chain = SegmentChain(GenomicSegment("chrA",10,23,"+"),
>>> GenomicSegment("chrA",33,41,"+"),
>>> ID="little_chain")
>>> little_chain.get_sequence(little_genome)
'TCTAGACGATACACTACGATA'
For reverse-strand features, sequences are automatically reverse-complemented:
>>> antisense_chain = little_chain.get_antisense()
>>> antisense_chain
<SegmentChain segments=2 bounds=chrA:10-41(-) name=chrA:10-23^33-41(-)>
>>> antisense_chain.get_sequence(little_genome)
'TATCGTAGTGTATCGTCTAGA'
For convenience, SegmentChain
and Transcript
also implement a method called
get_fasta()
, which formats output
in FASTA format:
>>> print(little_chain.get_fasta(little_genome))
>little_chain
TCTAGACGATACACTACGATA
Supported sequence file formats¶
Genomic sequence can be stored in a number of formats. Two of the most common are:
plastid
doesn’t implement any of its own sequence file readers, but it is compatible
with any parser that represents sequences as a dictionary-like object of string-like
objects.
This allows users to use implementations in Biopython (for FASTA, EMBL, & GenBank formats) and twobitreader (for 2bit) files. For example, to load a FASTA file using Biopython:
>>> # load CMV genome from test dataset
>>> from Bio import SeqIO
>>> genome = SeqIO.to_dict(SeqIO.parse(open("merlin_NC006273-2.fa"),"fasta"))
>>> genome
{'merlin': SeqRecord(seq=Seq('CCATTCCGGGCCGTGTGCTGGGTCCCCGAGGGGCGGGGGGGTGTTTTCTGCGGG...GCT', SingleLetterAlphabet()), id='merlin', name='merlin', description='merlin gi|155573622|ref|NC_006273.2| Human herpesvirus 5 strain Merlin, complete genome', dbxrefs=[])}
In Biopython, nucleic acids are represented as SeqRecord objects
rather than strings. SegmentChain
and Transcript
don’t mind:
>>> # load CMV annotations from test dataset
>>> from plastid import Transcript, BED_Reader
>>> transcripts = list(BED_Reader(open("merlin_orfs.bed"),return_type=Transcript))
>>> transcripts[0]
<Transcript segments=1 bounds=merlin:1316-2398(+) name=ORFL1W_(RL1)>
>>> transcripts[0].get_sequence(genome)[:200]
'GCTCGCCTATTTAACCTCCACCCACTACAACACACACATGCCGCACAATCATGCCAGCCACAGACACAAACAGCACCCACACCACGCCGCTTCACCCAGAGGACCAACACACGTTACCCTTACACCACAGCACCACACAACCTCATGTCCAAACTTCGGACAAACACGCCGACAAACAACACCGCACGCAGATGGAGCTC'
Similarly, TwoBitFile objects from twobitreader can be directly passed
to get_sequence()
, because they
inherit from dict
and return strings:
>>> # load CMV genome as a 2bit file
>>> from twobitreader import TwoBitFile
>>> twobit_genome = TwoBitFile("merlin_NC006273-2.2bit")
>>> twobit_genome.keys()
['merlin']
>>> transcripts[0].get_sequence(twobit_genome)[:200]
'GCTCGCCTATTTAACCTCCACCCACTACAACACACACATGCCGCACAATCATGCCAGCCACAGACACAAACAGCACCCACACCACGCCGCTTCACCCAGAGGACCAACACACGTTACCCTTACACCACAGCACCACACAACCTCATGTCCAAACTTCGGACAAACACGCCGACAAACAACACCGCACGCAGATGGAGCTC'
Manipulating sequence¶
Tools for further manipulating sequence (e.g. reverse-complementing, translating) are supplied in Biopython’s Seq and SeqRecord objects:
>>> # SeqRecord examples
>>> from Bio.Seq import Seq
>>> seq = Seq(transcripts[0].get_cds().get_sequence(genome))
>>> seq.translate()
Seq('MPATDTNSTHTTPLHPEDQHTLPLHHSTTQPHVQTSDKHADKQHRTQMELDAAD...PW*', HasStopCodon(ExtendedIUPACProtein(), '*'))
Fuller explanations and further examples can be found in the Biopython documentation for Seq and SeqRecord.
See also¶
Biopython documentation for manipulation of nucleic acid sequences.
twobitreader documentation
UCSC file format FAQ for details on sequence file formats