plastid.readers.bigbed module

BigBedReader, a parser for BigBed files.

Summary

In contrast to BED, GTF2, and GFF3 files, BigBed files are binary, indexed, and randomly-accessible. This means:

  • BigBedReader can be used to iterate over records, like a reader, or to fetch records that cover a region of interest, in the manner of a GenomeHash

  • BigBed use less memory, because their records don’t need to be loaded into memory to be parsed or accessed.

  • Indexes BigBed files can be searched for matching records

Module Contents

BigBedReader(filename[, return_type, ...])

Reader for BigBed files.

BigBedIterator(BigBedIterator)

Iterate over records in the BigBed file, sorted lexically by chromosome and position.

Examples

Iterate over all features in a BigBed file:

>>> my_reader = BigBedReader("some_file.bb", return_type=Transcript)
>>> for feature in my_reader:
>>>    pass # do something with each Transcript

BigBed files can be accessed as dictionaries. To find features overlapping a region of interest:

>>> roi = GenomicSegment("chrI", 0, 100000, "+")
>>> overlapping_features = my_reader[roi]
>>> list(overlapping_features)
[ list of SegmentChains/Transcripts ]

Find features that match keyword(s) in a certain field:

>>> # which fields are indexed and searchable?
>>> my_reader.indexed_fields
['name', 'gene_id']

>>> # find all entries whose 'gene_id' matches 'nanos'
>>> list(bb.search('gene_id', 'nanos'))
[ list of matching SegmentChains/Transcripts ]

See also

Kent2010

Description of BigBed and BigWig formats. Especially see supplemental data.

UCSC file format FAQ

Descriptions of BED, GTF2, GFF3 and other text-based formats.

class plastid.readers.bigbed.BigBedReader(filename, return_type=SegmentChain, add_three_for_stop=False, maxmem=0)

Bases: plastid.readers.bbifile._BBI_Reader

Reader for BigBed files. This class is useful for both iteration over genomic features one-by-one (like a reader), as well as random access to genomic features that overlap a region of interest (like a GenomeHash).

Parameters
filenamestr

Path to BigBed file

return_typeSegmentChain or subclass, optional

Type of feature to return from assembled subfeatures (Default: SegmentChain)

add_three_for_stopbool, optional

Some annotation files exclude the stop codon from CDS annotations. If set to True, three nucleotides will be added to the threeprime end of each CDS annotation, UNLESS the annotated transcript contains explicit stop_codon feature. (Default: False)

maxmemfloat, optional

Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. Does not include memory footprint of Python objects. (Default: 0, no limit)

Examples

Iterate over all features in a BigBed file:

>>> my_reader = BigBedReader("some_file.bb")
>>> for feature in my_reader:
>>>    pass # do something with each feature

BigBed files can be accessed as dictionaries. To find features overlapping a region of interest:

>>> roi = GenomicSegment("chrI", 0, 100000, "+")
>>> for feature in my_reader[roi]:
>>>     pass # do something with that feature

Find features overlapping a genomic region of interest roi, on either strand:

>>> for feature in my_reader.get(roi, stranded=False):
>>>     pass # do something with that feature
Attributes
extension_fieldsOrderedDict

Dictionary of names and types extra fields included in BigWig/BigBed file

extension_typesOrderedDict

Dictionary mapping custom field names to objects that parse their types from strings

filenamestr

Name of BigWig or BigBed file

num_recordsint

Number of features in file

num_chromsint

Number of chromosomes in the BigBed file

chromsdict

Dictionary mapping chromosome names to lengths

return_typeclass implementing a from_bed() method, or str

Return type of reader

Methods

get(self, roi, bool stranded=True, ...)

Iterate over features that share genomic positions with a region of interest

search(self, field_name, *values)

Search indexed fields in the BigBed file for records matching value See self.indexed_fields for names of indexed fields and self.extension_fields for descriptions of extension fields.

get(self, roi, bool stranded=True, bool check_unique=True)

Iterate over features that share genomic positions with a region of interest

Parameters
roiSegmentChain or GenomicSegment

Query feature representing region of interest

strandedbool, optional

If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands. (Default: True)

check_unique: bool, optional

if True, assure that all results in generator are unique. (Default: True)

Yields
object

self.return_type of each record in the BigBed file

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

search(self, field_name, *values)

Search indexed fields in the BigBed file for records matching value See self.indexed_fields for names of indexed fields and self.extension_fields for descriptions of extension fields.

Parameters
field_namestr

Name of field to search

*valuesone or more str

Value(s) to match. If multiple are given, records matching any value will be returned.

Yields
object

self.return_type of matching record in the BigBed file

Raises
IndexError

If field field_name is not indexed

Examples

Find all entries matching a given gene ID:

# open file
>>> bb = BigBedFile("some_file.bb")

# which fields are searchable?
>>> bb.indexed_fields
['name', 'gene_id']

# find all entries whose 'gene_id' matches 'nanos'
>>> bb.search('gene_id', 'nanos')
[ list of matching segmentchains ]

# find all entries whose 'gene_id' matches 'nanos' or 'oskar'
>>> bb.search('gene_id', 'nanos', 'oskar')
[ list of matching segmentchains ]
bed_fields

Number of standard BED format columns included in file

chrom_sizes

DEPRECATED: Use .chroms instead of .chrom_sizes

chromids
chroms

Dictionary mapping chromosome names to lengths

extension_fields

Dictionary of names and types extra fields included in BigWig/BigBed file

filename

Name of BigWig or BigBed file

indexed_fields

Names of indexed fields in BigBed file. These are searchable by self.search

num_chroms

Number of chromosomes in the BigBed file

num_records

Number of features in file

return_type

Return type of reader

uncompress_buf_size

Size of buffer needed to uncompress blocks. If 0, the data is uncompressed

version

Version of BigWig or BigBed file format

plastid.readers.bigbed.BigBedIterator(BigBedReader reader, maxmem=0) BigBedIterator(reader, maxmem = 0)
plastid.readers.bigbed.BigBedIterator(reader, maxmem=0) None

Iterate over records in the BigBed file, sorted lexically by chromosome and position.

Parameters
readerBigBedReader

Reader to iterate over

maxmemfloat

Maximum desired memory footprint for C objects, in megabytes. May be temporarily exceeded if large queries are requested. Does not include memory footprint of Python objects. (Default: 0, no limit)

Yields
object

reader.return_type of BED record

Raises
MemoryError

If memory cannot be allocated