plastid.genomics.genome_hash module

This module contains tools for lookup of features in a region of interest within the a genome.

Summary

It is frequently useful to retrieve features that overlap specific regions of interest in the genome. GenomeHashes index features by location, providing quick lookup.

Module contents

Several implementations are provided, depending how the data are formatted:

Implementation Format of feature data
GenomeHash Objects in memory or in unindexed BED, GTF2, GFF3, or PSL files
BigBedGenomeHash Annotations in BigBed files
TabixGenomeHash Annotations in tabix-compressed BED, GTF2, GFF3, or PSL files

Examples

Create a GenomeHash:

>>> from plastid import *

# from objects in memory
>>> one_hash = GenomeHash(list_of_transcripts)

# from a non-indexed file
>>> my_hash = GenomeHash(list(GFF3_Reader("some_file.gff")))

# from a BigBed file
>>> bigbed_hash = BigBedGenomeHash("some_file.bb")

# from tabix-compressed BED file
>>> tabix_hash = TabixGenomeHash("some_file.bed.gz","BED")

To find features overlapping a region of interest, pass the feature coordinates to a GenomeHash as a GenomicSegment, SegmentChain, or Transcript:

>>> overlapping = my_hash[GenomicSegment("chrII",50,10000,"+")]
>>> overlapping
[ list of SegmentChains / Transcripts, et c ]

# SegmentChains & Transcripts can also be keys:
>>> tx = Transcript(GenomicSegment("chrII",50,300,"+"),GenomicSegment("chrII",9000,10000,"+"))
>>> overlapping2 = my_hash[tx]
>>> overlapping2
[ list of SegmentChains / Transcripts, et c ]

# find features that overlap `roi` on either strand
>>> either_strand_overlap = my_hash.get_overlapping_features(roi,stranded=False)
class plastid.genomics.genome_hash.BigBedGenomeHash(*filenames, return_type=SegmentChain)[source]

Bases: plastid.genomics.genome_hash.AbstractGenomeHash

Find features overlapping query regions in BigBed files.

Parameters:
*filenames : str

One or more filenames to open (NOT open filehandles)

return_type : class implementing a from_bed() method

Class of object to return (Default: SegmentChain)

Attributes:
bigbedreaders : BigBedReader

BigBedReaders connecting to BigBed file(s)

Methods

get_overlapping_features(roi[, stranded]) Return list of features overlapping roi
get_overlapping_features(roi, stranded=True)[source]

Return list of features overlapping roi

Parameters:
roi : GenomicSegment or SegmentChain

Query feature indicating region of interest

stranded : bool

If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns:
list

Features that overlap roi

Raises:
TypeError

if roi is not a GenomicSegment or SegmentChain

class plastid.genomics.genome_hash.GenomeHash(features=None, binsize=20000, do_copy=False)[source]

Bases: plastid.genomics.genome_hash.AbstractGenomeHash

Index memory-resident features (e.g. SegmentChains or Transcripts) by genomic position for quick lookup later.

Parameters:
features : dict or list, optional

dict or list of features, as SegmentChain objects or subclasses (Default: [])

binsize : int, optional

Size in nucleotides of neighborhood for hash. (Default: %s)

do_copy : bool

If True, features will be copied before being stored in the hash. This comes at a speed cost, but will prevent unexpected side effects if the features are being changed outside the hash.

If False (default), creation of the GenomeHash will be much faster.

Notes

Because all features are stored in memory, for large genomes, a TabixGenomeHash or BigBedGenomeHash is much more memory-efficient.

Methods

get_nearby_feature_names(roi[, stranded]) Return list of the names of features in all the bins occupied by roi
get_nearby_features(roi[, stranded]) Return list of features in all the bins occupied by roi
get_overlapping_features(roi[, stranded]) Return list of features overlapping roi.
update(features) Add features to the GenomeHash
get_nearby_feature_names(roi, stranded=True)[source]

Return list of the names of features in all the bins occupied by roi

Parameters:
roi : GenomicSegment or SegmentChain

Query feature

stranded : bool

if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns:
list

Names of features near (within self.binsize distance from) roi

Raises:
TypeError

if roi is not a GenomicSegment or SegmentChain

get_nearby_features(roi, stranded=True)[source]

Return list of features in all the bins occupied by roi

Parameters:
roi : GenomicSegment or SegmentChain

Query feature

stranded : bool

if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns:
list<str>

Features near (within self.binsize distance from) roi

Raises:
TypeError

if roi is not a GenomicSegment or SegmentChain

get_overlapping_features(roi, stranded=True)[source]

Return list of features overlapping roi.

Parameters:
roi : GenomicSegment or SegmentChain

Query feature indicating region of interest

stranded : bool

if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns:
list

Features overlapping roi

Raises:
TypeError

if roi is not a GenomicSegment or SegmentChain

update(features)[source]

Add features to the GenomeHash

Parameters:
features : dict or list

dict or list of features, as SegmentChain objects or subclasses

class plastid.genomics.genome_hash.TabixGenomeHash(*filenames, data_format='GTF2')[source]

Bases: plastid.genomics.genome_hash.AbstractGenomeHash

Find features overlapping query regions in Tabix-indexed files.

Parameters:
filenames : str or list of str

Filename or list of filenames of Tabix-compressed files

data_format : str

Format of tabix-compressed file(s). Choices are: ‘GTF2’,`’GFF3’,’BED’,’PSL’` (Default: GTF2)

Attributes:
filenames : str or list

Name of file to open or list of filenames to open (NOT open filehandles)

tabix_readers : list of pysam.Tabixfile

Pysam interfaces to underlying data files

Methods

get_overlapping_features(roi[, stranded]) Return list of features overlapping roi
get_overlapping_features(roi, stranded=True)[source]

Return list of features overlapping roi

Parameters:
roi : GenomicSegment or SegmentChain

Query feature indicating region of interest

stranded : bool

If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands

Returns:
list

Features that overlap roi

Raises:
TypeError

if roi is not a GenomicSegment or SegmentChain