plastid.readers.bigbed module

BigBedReader, a parser for BigBed files.

Summary

In contrast to BED, GTF2, and GFF3 files, BigBed files are binary, indexed, and randomly-accessible. This means:

  • BigBedReader can be used to iterate over records, like a reader, or to fetch records that cover a region of interest, in the manner of a GenomeHash
  • BigBed use less memory, because their records don’t need to be loaded into memory to be parsed or accessed.
  • Indexes BigBed files can be searched for matching records

Module Contents

BigBedReader(filename, …) Reader for BigBed files.
BigBedIterator(BigBedReader reader[, maxmem]) Iterate over records in the BigBed file, sorted lexically by chromosome and position.

Examples

Iterate over all features in a BigBed file:

>>> my_reader = BigBedReader("some_file.bb",return_type=Transcript)
>>> for feature in my_reader:
>>>    pass # do something with each Transcript

BigBed files can be accessed as dictionaries. To find features overlapping a region of interest:

>>> roi = GenomicSegment("chrI",0,100000,"+")
>>> overlapping_features = my_reader[roi]
>>> list(overlapping_features)
[ list of SegmentChains/Transcripts ]

Find features that match keyword(s) in a certain field:

>>> # which fields are indexed and searchable?
>>> my_reader.indexed_fields
['name', 'gene_id']

>>> # find all entries whose 'gene_id' matches 'nanos'
>>> list(bb.search('gene_id','nanos'))
[ list of matching SegmentChains/Transcripts ]

See also

Kent2010
Description of BigBed and BigWig formats. Especially see supplemental data.
UCSC file format FAQ
Descriptions of BED, GTF2, GFF3 and other text-based formats.
class plastid.readers.bigbed.BigBedReader(filename, return_type = SegmentChain, add_three_for_stop = False, maxmem = 0)

Bases: plastid.readers.bbifile._BBI_Reader

Reader for BigBed files. This class is useful for both iteration over genomic features one-by-one (like a reader), as well as random access to genomic features that overlap a region of interest (like a GenomeHash).

Parameters:
filename : str

Path to BigBed file

return_type : SegmentChain or subclass, optional

Type of feature to return from assembled subfeatures (Default: SegmentChain)

add_three_for_stop : bool, optional

Some annotation files exclude the stop codon from CDS annotations. If set to True, three nucleotides will be added to the threeprime end of each CDS annotation, UNLESS the annotated transcript contains explicit stop_codon feature. (Default: False)

maxmem : float, optional

Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. Does not include memory footprint of Python objects. (Default: 0, no limit)

Examples

Iterate over all features in a BigBed file:

>>> my_reader = BigBedReader("some_file.bb")
>>> for feature in my_reader:
>>>    pass # do something with each feature

BigBed files can be accessed as dictionaries. To find features overlapping a region of interest:

>>> roi = GenomicSegment("chrI",0,100000,"+")
>>> for feature in my_reader[roi]:
>>>     pass # do something with that feature

Find features overlapping a genomic region of interest roi, on either strand:

>>> for feature in my_reader.get(roi,stranded=False):
>>>     pass # do something with that feature
Attributes:
extension_fields : OrderedDict

Dictionary of names and types extra fields included in BigWig/BigBed file

extension_types : OrderedDict

Dictionary mapping custom field names to objects that parse their types from strings

filename : str

Name of BigWig or BigBed file

num_records : int

Number of features in file

num_chroms : int

Number of chromosomes in the BigBed file

chroms : dict

Dictionary mapping chromosome names to lengths

return_type : class implementing a from_bed() method, or str

Return type of reader

Methods

get(self, roi, bool stranded=True, …) Iterate over features that share genomic positions with a region of interest
search(self, field_name, *values) Search indexed fields in the BigBed file for records matching value See self.indexed_fields for names of indexed fields and self.extension_fields for descriptions of extension fields.
get(self, roi, bool stranded=True, bool check_unique=True)

Iterate over features that share genomic positions with a region of interest

Parameters:
roi : SegmentChain or GenomicSegment

Query feature representing region of interest

stranded : bool, optional

If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands. (Default: True)

check_unique: bool, optional

if True, assure that all results in generator are unique. (Default: True)

Yields:
object

self.return_type of each record in the BigBed file

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

search(self, field_name, *values)

Search indexed fields in the BigBed file for records matching value See self.indexed_fields for names of indexed fields and self.extension_fields for descriptions of extension fields.

Parameters:
field_name : str

Name of field to search

*values : one or more str

Value(s) to match. If multiple are given, records matching any value will be returned.

Yields:
object

self.return_type of matching record in the BigBed file

Raises:
IndexError

If field field_name is not indexed

Examples

Find all entries matching a given gene ID:

# open file
>>> bb = BigBedFile("some_file.bb")

# which fields are searchable?
>>> bb.indexed_fields
['name', 'gene_id']

# find all entries whose 'gene_id' matches 'nanos'
>>> bb.search('gene_id','nanos')
[ list of matching segmentchains ]

# find all entries whose 'gene_id' matches 'nanos' or 'oskar'
>>> bb.search('gene_id','nanos','oskar')
[ list of matching segmentchains ]
bed_fields

Number of standard BED format columns included in file

chrom_sizes

DEPRECATED: Use .chroms instead of .chrom_sizes

chromids
chroms

Dictionary mapping chromosome names to lengths

custom_fields

BigBedReader.custom_fields is DEPRECATED. Will be removed in plastid v0.5.0. Use BigBedReader.extension_fields in future

extension_fields

Dictionary of names and types extra fields included in BigWig/BigBed file

filename

Name of BigWig or BigBed file

indexed_fields

Names of indexed fields in BigBed file. These are searchable by self.search

num_chroms

Number of chromosomes in the BigBed file

num_records

Number of features in file

return_type

Return type of reader

uncompress_buf_size

Size of buffer needed to uncompress blocks. If 0, the data is uncompressed

version

Version of BigWig or BigBed file format

plastid.readers.bigbed.BigBedIterator(BigBedReader reader, maxmem=0) BigBedIterator(reader, maxmem = 0)

Iterate over records in the BigBed file, sorted lexically by chromosome and position.

Parameters:
reader : BigBedReader

Reader to iterate over

maxmem : float

Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. Does not include memory footprint of Python objects. (Default: 0, no limit)

Yields:
object

reader.return_type of BED record

Raises:
MemoryError

If memory cannot be allocated