plastid.genomics.genome_hash module

This module contains tools for lookup of features in a region of interest within the a genome.

Summary

It is frequently useful to retrieve features that overlap specific regions of interest in the genome. GenomeHashes index features by location, providing quick lookup.

Module contents

Several implementations are provided, depending how the data are formatted:

Implementation

Format of feature data

GenomeHash

Objects in memory or in unindexed BED, GTF2, GFF3, or PSL files

BigBedGenomeHash

Annotations in BigBed files

TabixGenomeHash

Annotations in tabix-compressed BED, GTF2, GFF3, or PSL files

Examples

Create a GenomeHash:

>>> from plastid import *

# from objects in memory
>>> one_hash = GenomeHash(list_of_transcripts)

# from a non-indexed file
>>> my_hash = GenomeHash(list(GFF3_Reader("some_file.gff")))

# from a BigBed file
>>> bigbed_hash = BigBedGenomeHash("some_file.bb")

# from tabix-compressed BED file
>>> tabix_hash = TabixGenomeHash("some_file.bed.gz","BED")

To find features overlapping a region of interest, pass the feature coordinates to a GenomeHash as a GenomicSegment, SegmentChain, or Transcript:

>>> overlapping = my_hash[GenomicSegment("chrII",50,10000,"+")]
>>> overlapping
[ list of SegmentChains / Transcripts, et c ]

# SegmentChains & Transcripts can also be keys:
>>> tx = Transcript(GenomicSegment("chrII",50,300,"+"),GenomicSegment("chrII",9000,10000,"+"))
>>> overlapping2 = my_hash[tx]
>>> overlapping2
[ list of SegmentChains / Transcripts, et c ]

# find features that overlap `roi` on either strand
>>> either_strand_overlap = my_hash.get_overlapping_features(roi,stranded=False)
class plastid.genomics.genome_hash.BigBedGenomeHash(*filenames, return_type=SegmentChain)[source]

Bases: plastid.genomics.genome_hash.AbstractGenomeHash

Find features overlapping query regions in BigBed files.

Parameters
*filenamesstr

One or more filenames to open (NOT open filehandles)

return_typeclass implementing a from_bed() method

Class of object to return (Default: SegmentChain)

Attributes
bigbedreadersBigBedReader

BigBedReaders connecting to BigBed file(s)

Methods

get_overlapping_features(roi[, stranded])

Return list of features overlapping roi

get_overlapping_features(roi, stranded=True)[source]

Return list of features overlapping roi

Parameters
roiGenomicSegment or SegmentChain

Query feature indicating region of interest

strandedbool

If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns
list

Features that overlap roi

Raises
TypeError

if roi is not a GenomicSegment or SegmentChain

class plastid.genomics.genome_hash.GenomeHash(features=None, binsize=20000, do_copy=False)[source]

Bases: plastid.genomics.genome_hash.AbstractGenomeHash

Index memory-resident features (e.g. SegmentChains or Transcripts) by genomic position for quick lookup later.

Parameters
featuresdict or list, optional

dict or list of features, as SegmentChain objects or subclasses (Default: [])

binsizeint, optional

Size in nucleotides of neighborhood for hash. (Default: %s)

do_copybool

If True, features will be copied before being stored in the hash. This comes at a speed cost, but will prevent unexpected side effects if the features are being changed outside the hash.

If False (default), creation of the GenomeHash will be much faster.

Notes

Because all features are stored in memory, for large genomes, a TabixGenomeHash or BigBedGenomeHash is much more memory-efficient.

Methods

get_nearby_feature_names(roi[, stranded])

Return list of the names of features in all the bins occupied by roi

get_nearby_features(roi[, stranded])

Return list of features in all the bins occupied by roi

get_overlapping_features(roi[, stranded])

Return list of features overlapping roi.

update(features)

Add features to the GenomeHash

get_nearby_feature_names(roi, stranded=True)[source]

Return list of the names of features in all the bins occupied by roi

Parameters
roiGenomicSegment or SegmentChain

Query feature

strandedbool

if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns
list

Names of features near (within self.binsize distance from) roi

Raises
TypeError

if roi is not a GenomicSegment or SegmentChain

get_nearby_features(roi, stranded=True)[source]

Return list of features in all the bins occupied by roi

Parameters
roiGenomicSegment or SegmentChain

Query feature

strandedbool

if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns
list<str>

Features near (within self.binsize distance from) roi

Raises
TypeError

if roi is not a GenomicSegment or SegmentChain

get_overlapping_features(roi, stranded=True)[source]

Return list of features overlapping roi.

Parameters
roiGenomicSegment or SegmentChain

Query feature indicating region of interest

strandedbool

if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns
list

Features overlapping roi

Raises
TypeError

if roi is not a GenomicSegment or SegmentChain

update(features)[source]

Add features to the GenomeHash

Parameters
featuresdict or list

dict or list of features, as SegmentChain objects or subclasses

class plastid.genomics.genome_hash.TabixGenomeHash(*filenames, data_format='GTF2')[source]

Bases: plastid.genomics.genome_hash.AbstractGenomeHash

Find features overlapping query regions in Tabix-indexed files.

Parameters
filenamesstr or list of str

Filename or list of filenames of Tabix-compressed files

data_formatstr

Format of tabix-compressed file(s). Choices are: ‘GTF2’,`’GFF3’,’BED’,’PSL’` (Default: GTF2)

Attributes
filenamesstr or list

Name of file to open or list of filenames to open (NOT open filehandles)

tabix_readerslist of pysam.Tabixfile

Pysam interfaces to underlying data files

Methods

get_overlapping_features(roi[, stranded])

Return list of features overlapping roi

get_overlapping_features(roi, stranded=True)[source]

Return list of features overlapping roi

Parameters
roiGenomicSegment or SegmentChain

Query feature indicating region of interest

strandedbool

If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns
list

Features that overlap roi

Raises
TypeError

if roi is not a GenomicSegment or SegmentChain