plastid.genomics.genome_hash module¶
This module contains tools for lookup of features in a region of interest within the a genome.
Summary¶
It is frequently useful to retrieve features that overlap specific regions
of interest in the genome. GenomeHashes
index features by location,
providing quick lookup.
Module contents¶
Several implementations are provided, depending how the data are formatted:
Implementation |
Format of feature data |
Objects in memory or in unindexed BED, GTF2, GFF3, or PSL files |
|
Annotations in BigBed files |
|
Annotations in tabix-compressed BED, GTF2, GFF3, or PSL files |
Examples¶
Create a GenomeHash
:
>>> from plastid import *
# from objects in memory
>>> one_hash = GenomeHash(list_of_transcripts)
# from a non-indexed file
>>> my_hash = GenomeHash(list(GFF3_Reader("some_file.gff")))
# from a BigBed file
>>> bigbed_hash = BigBedGenomeHash("some_file.bb")
# from tabix-compressed BED file
>>> tabix_hash = TabixGenomeHash("some_file.bed.gz","BED")
To find features overlapping a region of interest, pass the feature coordinates
to a GenomeHash
as a GenomicSegment
, SegmentChain
, or Transcript
:
>>> overlapping = my_hash[GenomicSegment("chrII",50,10000,"+")]
>>> overlapping
[ list of SegmentChains / Transcripts, et c ]
# SegmentChains & Transcripts can also be keys:
>>> tx = Transcript(GenomicSegment("chrII",50,300,"+"),GenomicSegment("chrII",9000,10000,"+"))
>>> overlapping2 = my_hash[tx]
>>> overlapping2
[ list of SegmentChains / Transcripts, et c ]
# find features that overlap `roi` on either strand
>>> either_strand_overlap = my_hash.get_overlapping_features(roi,stranded=False)
- class plastid.genomics.genome_hash.BigBedGenomeHash(*filenames, return_type=SegmentChain)[source]¶
Bases:
plastid.genomics.genome_hash.AbstractGenomeHash
Find features overlapping query regions in BigBed files.
- Parameters
- *filenamesstr
One or more filenames to open (NOT open filehandles)
- return_typeclass implementing a
from_bed()
method Class of object to return (Default:
SegmentChain
)
- Attributes
- bigbedreaders
BigBedReader
BigBedReaders
connecting to BigBed file(s)
- bigbedreaders
Methods
get_overlapping_features
(roi[, stranded])Return list of features overlapping roi
- get_overlapping_features(roi, stranded=True)[source]¶
Return list of features overlapping roi
- Parameters
- roi
GenomicSegment
orSegmentChain
Query feature indicating region of interest
- strandedbool
If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands
- roi
- Returns
- list
Features that overlap roi
- Raises
- TypeError
if roi is not a
GenomicSegment
orSegmentChain
- class plastid.genomics.genome_hash.GenomeHash(features=None, binsize=20000, do_copy=False)[source]¶
Bases:
plastid.genomics.genome_hash.AbstractGenomeHash
Index memory-resident features (e.g.
SegmentChains
orTranscripts
) by genomic position for quick lookup later.- Parameters
- featuresdict or list, optional
dict or list of features, as
SegmentChain
objects or subclasses (Default: [])- binsizeint, optional
Size in nucleotides of neighborhood for hash. (Default: %s)
- do_copybool
If True, features will be copied before being stored in the hash. This comes at a speed cost, but will prevent unexpected side effects if the features are being changed outside the hash.
If False (default), creation of the
GenomeHash
will be much faster.
Notes
Because all features are stored in memory, for large genomes, a
TabixGenomeHash
orBigBedGenomeHash
is much more memory-efficient.Methods
get_nearby_feature_names
(roi[, stranded])Return list of the names of features in all the bins occupied by roi
get_nearby_features
(roi[, stranded])Return list of features in all the bins occupied by roi
get_overlapping_features
(roi[, stranded])Return list of features overlapping roi.
update
(features)Add features to the
GenomeHash
- get_nearby_feature_names(roi, stranded=True)[source]¶
Return list of the names of features in all the bins occupied by roi
- Parameters
- roi
GenomicSegment
orSegmentChain
Query feature
- strandedbool
if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands
- roi
- Returns
- list
Names of features near (within self.binsize distance from) roi
- Raises
- TypeError
if roi is not a
GenomicSegment
orSegmentChain
- get_nearby_features(roi, stranded=True)[source]¶
Return list of features in all the bins occupied by roi
- Parameters
- roi
GenomicSegment
orSegmentChain
Query feature
- strandedbool
if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands
- roi
- Returns
- list<str>
Features near (within self.binsize distance from) roi
- Raises
- TypeError
if roi is not a
GenomicSegment
orSegmentChain
- get_overlapping_features(roi, stranded=True)[source]¶
Return list of features overlapping roi.
- Parameters
- roi
GenomicSegment
orSegmentChain
Query feature indicating region of interest
- strandedbool
if True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands
- roi
- Returns
- list
Features overlapping roi
- Raises
- TypeError
if roi is not a
GenomicSegment
orSegmentChain
- update(features)[source]¶
Add features to the
GenomeHash
- Parameters
- featuresdict or list
dict or list of features, as
SegmentChain
objects or subclasses
- class plastid.genomics.genome_hash.TabixGenomeHash(*filenames, data_format='GTF2')[source]¶
Bases:
plastid.genomics.genome_hash.AbstractGenomeHash
Find features overlapping query regions in Tabix-indexed files.
- Parameters
- filenamesstr or list of str
Filename or list of filenames of Tabix-compressed files
- data_formatstr
Format of tabix-compressed file(s). Choices are: ‘GTF2’,`’GFF3’,’BED’,’PSL’` (Default: GTF2)
- Attributes
- filenamesstr or list
Name of file to open or list of filenames to open (NOT open filehandles)
- tabix_readerslist of
pysam.Tabixfile
Pysam interfaces to underlying data files
Methods
get_overlapping_features
(roi[, stranded])Return list of features overlapping roi
- get_overlapping_features(roi, stranded=True)[source]¶
Return list of features overlapping roi
- Parameters
- roi
GenomicSegment
orSegmentChain
Query feature indicating region of interest
- strandedbool
If True, retrieve only features on same strand as query feature. Otherwise, retrieve features on both strands
- roi
- Returns
- list
Features that overlap roi
- Raises
- TypeError
if roi is not a
GenomicSegment
orSegmentChain