plastid.genomics.roitools module¶
This module contains classes for representing and manipulating genomic features.
Summary¶
Genomic features are represented as SegmentChains
, which can contain zero or
more continuous spans of the genome (GenomicSegments
), as well as rich
annotation data. For the specific case of RNA transcripts, a subclass of
SegmentChain
, called Transcript
is provided.
Module contents¶
|
Building block for |
|
Base class for genomic features, composed of zero or more |
|
Subclass of |
|
Construct |
|
Extend an annotated CDS region, if present, by three nucleotides at the threeprime end. |
Examples¶
SegmentChains
may be read directly from annotation files using the readers
in plastid.readers
:
>>> from plastid import *
>>> chains = list(BED_Reader(open("some_file.bed")))
or constructed from GenomicSegments
:
>>> seg1 = GenomicSegment("chrA", 5, 200, "-")
>>> seg2 = GenomicSegment("chrA", 250, 300, "-")
>>> my_chain = SegmentChain(seg1, seg2, ID="some_chain", ... , some_attribute="some_value")
SegmentChains
contain convenience methods for a number of comman tasks, for
example:
converting coordinates between the spliced space of the chain, and the genome:
>>> # get coordinate of 50th position from 5' end >>> my_chain.get_genomic_coordinate(50) ('chrA', 199, '-') # get coordinate of 49th position. splicing is taken care of! >>> my_chain.get_genomic_coordinate(49) ('chrA', 250, '-') # get coordinate in chain corresponding to genomic coordinate 118 >>> my_chain.get_segmentchain_coordinate("chrA", 118, "-") 131 # get a subchain containing positions 45-70 >>> subchain = my_chain.get_subchain(45, 70) >>> subchain <SegmentChain segments=2 bounds=chrA:180-255(-) name=some_chain_subchain> # the subchain preserves the discontinuity found in `my_chain` >>> subchain.segments [<GenomicSegment chrA:180-200 strand='-'>, <GenomicSegment chrA:250-255 strand='-'>]fetching
numpy arrays
of data at each position in the chain. The data is assumed to be kept in aGenomeArray
:>>> ga = BAMGenomeArray(["some_file.bam"], mapping=ThreePrimeMapFactory(offset=15)) >>> my_chain.get_counts(ga) array([843, 854, 153, 86, 462, 359, 290, 38, 38, 758, 342, 299, 430, 628, 324, 437, 231, 417, 536, 673, 243, 981, 661, 415, 207, 446, 197, 520, 653, 468, 863, 3, 272, 754, 352, 960, 966, 913, 367, ... ])similarly, fetching spliced sequence, reverse-complemented if necessary for minus-strand features. As input, the
SegmentChain
expects a dictionary-like object mapping chromosome names to string-like sequences (e.g. as in BioPython or twobitreader):>>> seqdict = { "chrA" : "TCTACATA ..." } # some string of chrA sequence >>> my_chain.get_sequence(seqdict) "ACTGTGTACTGTACGATCGATCGTACGTACGATCGATCGTACGTAGCTAGTCAGCTAGCTAGCTAGCTGA..."testing for overlap, containment, equality with other
SegmentChains
:>>> other_chain = SegmentChain(GenomicSegment("chrA", 200, 300, "-"), >>> GenomicSegment("chrA", 800, 900, "-")) >>> my_chain.overlaps(other_chain) True >>> other_chain in my_chain False >>> my_chain in my_chain True >>> my_chain.covers(other_chain) False >>> my_chain == other_chain False >>> my_chain == my_chain True >>> my_chain.as_bed() chrA 5 300 some_chain 0 - 5 5 0,0,0 2 195,50, 0,245, >>> my_chain.as_gtf() chrA . exon 6 200 . - . gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain"; chrA . exon 251 300 . - . gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";
- class plastid.genomics.roitools.GenomicSegment(chrom, start, end, strand)¶
Bases:
object
Building block for
SegmentChain
: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.Examples
GenomicSegments
sort lexically by chromosome, start position, end position, and finally strand:>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrB", 0, 10, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 75, 100, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 55, 75, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 150, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 100, "-") True
They also provide a few convenience methods for containment or overlap. To be contained, a segment must be on the same chromosome and strand as its container, and its coordinates must be within or equal to its endpoints:
>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "+") True >>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 50, 100, "+") True >>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "-") False >>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 75, 200, "+") False
Similarly, to overlap,
GenomicSegments
must be on the same strand and chromosome.- Attributes
chrom
strChromosome where
GenomicSegment
residesstart
intZero-indexed (Pythonic) start coordinate of
GenomicSegment
end
intZero-indexed, half-open (Pythonic) end coordinate of
GenomicSegment
strand
strStrand of
GenomicSegment
Methods
as_igv_str
(self)Format as an IGV location string
contains
(self, GenomicSegment other)Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.
from_igv_str
(unicode loc_str, unicode strand=u)Construct
GenomicSegment
from IGV location stringfrom_str
(unicode inp)Construct a
GenomicSegment
from itsstr()
representationoverlaps
(self, GenomicSegment other)Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.
- as_igv_str(self) unicode ¶
Format as an IGV location string
- contains(self, GenomicSegment other) bool ¶
Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.
- Parameters
- other
GenomicSegment
Query segment
- other
- Returns
- bool
- static from_igv_str(unicode loc_str, unicode strand=u'.')¶
Construct
GenomicSegment
from IGV location string- Parameters
- igvlocstr
IGV location string, in format ‘chromosome:start-end’, where start and end are 1-indexed and half-open
- strandstr
The chromosome strand (‘+’, ‘-’, or ‘.’)
- Returns
- static from_str(unicode inp)¶
Construct a
GenomicSegment
from itsstr()
representation- Parameters
- inpstr
String representation of
GenomicSegment
as chrom:start-end(strand) where start and end are in 0-indexed, half-open coordinates
- Returns
- overlaps(self, GenomicSegment other) bool ¶
Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.
- Parameters
- other
GenomicSegment
Query segment
- other
- Returns
- bool
- c_strand¶
- chrom¶
Chromosome where
GenomicSegment
resides
- end¶
Zero-indexed, half-open (Pythonic) end coordinate of
GenomicSegment
- start¶
Zero-indexed (Pythonic) start coordinate of
GenomicSegment
- strand¶
Strand of
GenomicSegment
‘+’ for forward / Watson strand
‘-’ for reverse / Crick strand
‘.’ for unstranded / both strands
- class plastid.genomics.roitools.SegmentChain(*segments, **attributes)¶
Bases:
object
Base class for genomic features, composed of zero or more
GenomicSegments
.SegmentChains
can therefore model discontinuous, features – such as multi-exon transcripts or gapped alignments – in addition, to continuous features.Numerous convenience functions are supplied for:
converting between coordinates relative to the genome and relative to the internal coordinates of a spliced
SegmentChain
fetching genomic sequence, read alignments, or count data, accounting for splicing of the segments, and, in the case of reverse-strand features, reverse-complementing
slicing or fetching sub-regions of a
SegmentChain
testing equality, inequality, overlap, containment, coverage of, or sharing of segments with other
SegmentChain
orGenomicSegment
objectsimport/export to BED, PSL, GTF2, and GFF3 formats, for use in other software packages or in a genome browser.
Intervals are sorted from lowest to greatest starting coordinate on their reference sequence, regardless of strand. Iteration over the SegmentChain will yield intervals from left-to-right in the genome.
- Parameters
- *segments
GenomicSegment
0 or more GenomicSegments on the same strand
- **attrkeyword arguments
Arbitrary attributes, including, for example:
Attribute
Description
type
A feature type used for GTF2/GFF3 export of each interval in the
SegmentChain
. (Default: ‘exon’)ID
A unique ID for the
SegmentChain
.transcript_id
A transcript ID used for GTF2 export
gene_id
A gene ID used for GTF2 export
- *segments
See also
Transcript
Transcript subclass, additionally providing richer GTF2, GFF3, and BED export, as well as methods for fetching coding regions and UTRs as subsegments
- Attributes
- spanning_segment
GenomicSegment
A
GenomicSegment
spanning the endpoints of theSegmentChain
strand
strStrand of the SegmentChain
chrom
strChromosome the SegmentChain resides on
attr
dictattr: dict
segments
listCopy of list of
GenomicSegments
that comprise self.mask_segments
listCopy of list of
GenomicSegments
representing regions masked in self. Changing this list will do nothing to the masks in self.
- spanning_segment
Methods
add_masks
(self, *mask_segments)Adds one or more
GenomicSegment
to the collection of masks.add_segments
(self, *segments)Add 1 or more
GenomicSegments
to theSegmentChain
.antisense_overlaps
(self, other)Returns True if self and other share genomic positions on opposite strands
as_bed
(self[, thickstart, thickend, as_int, ...])Format
SegmentChain
as a string of BED12[+X] output.as_gff3
(self, unicode feature_type=None, ...)Format self as a line of GFF3 output.
as_gtf
(self, unicode feature_type=None, ...)Format
SegmentChain
as a block of GTF2 output.as_psl
(self)Formats
SegmentChain
as PSL (blat) output.covers
(self, other)Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
from_bed
(unicode line[, extra_columns])Create a
SegmentChain
from a line from a BED file.from_psl
(psl_line)Create a
SegmentChain
from a line from a PSL (BLAT) filefrom_str
(unicode inp)Create a
SegmentChain
from a string formatted bySegmentChain.__str__()
:get_antisense
(self)Returns an
SegmentChain
antisense to self, with empty attr dict.get_counts
(self, ga[, stranded])Return list of counts or values drawn from ga at each position in self
get_fasta
(self, genome[, stranded])Formats sequence of SegmentChain as FASTA output
get_gene
(self)Return name of gene associated with
SegmentChain
, if any, by searching through self.attr for the keys gene_id and Parent.get_genomic_coordinate
(self, x[, stranded])Finds genomic coordinate corresponding to position x in self
get_junctions
(self)Returns a list of
GenomicSegments
representing spaces between theGenomicSegments
in self In the case of a transcript, these would represent introns.get_masked_counts
(self, ga[, stranded, copy])Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
get_masked_position_set
(self)Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()
get_masks
(self)Return masked positions as a list of
GenomicSegments
Return masked positions as a
SegmentChain
get_name
(self)Returns the name of this
SegmentChain
, first searching through self.attr for the keys ID, Name, and name.get_position_list
(self)Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChain
get_position_set
(self)Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChain
get_segmentchain_coordinate
(self, ...)Finds the
SegmentChain
coordinate corresponding to a genomic positionget_sequence
(self, genome[, stranded])Return spliced genomic sequence of
SegmentChain
as a stringget_subchain
(self, long start, long end, ...)Retrieves a sub-
SegmentChain
corresponding a range of positions specified in coordinates relative thisSegmentChain
.get_unstranded
(self)Returns an
SegmentChain
antisense to self, with empty attr dict.next
(self)Return next
GenomicSegment
in theSegmentChain
, from left to right on the chromsomeoverlaps
(self, other)Return True if self and other share genomic positions on the same strand
reset_masks
(self)Removes masks added by
add_masks()
shares_segments_with
(self, other)Returns a list of
GenomicSegment
that are shared between self and othersort
(self)unstranded_overlaps
(self, other)Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- add_masks(self, *mask_segments)¶
Adds one or more
GenomicSegment
to the collection of masks. Masks will be trimmed to the positions of theSegmentChain
during addition.- Parameters
- mask_segments
GenomicSegment
One or more segments, in genomic coordinates, covering positions to exclude from return values of
get_masked_position_set()
,get_masked_counts()
, orget_masked_length()
- mask_segments
- add_segments(self, *segments)¶
Add 1 or more
GenomicSegments
to theSegmentChain
. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.- Parameters
- segments
GenomicSegment
One or more
GenomicSegment
to add toSegmentChain
- segments
- antisense_overlaps(self, other)¶
Returns True if self and other share genomic positions on opposite strands
- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- as_bed(self, thickstart=None, thickend=None, as_int=True, color=None, extra_columns=None, empty_value='')¶
Format
SegmentChain
as a string of BED12[+X] output.If the
SegmentChain
was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.- Parameters
- thickstartint or None, optional
If not None, overrides the genome coordinate that starts thick plotting in genome browser found in self.attr[‘thickstart’]
- thickendint or None, optional
If not None, overrides the genome coordinate that stops thick plotting in genome browser found in self.attr[‘thickend’]
- as_intbool, optional
Force score to integer (Default: True)
- colorstr or None, optional
Color represented as RGB hex string. If not none, overrides the color in self.attr[‘color’]
- extra_columnsNone or list-like, optional
If None, and the
SegmentChain
was imported using the extra_columns keyword offrom_bed()
, theSegmentChain
will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, theSegmentChain
will be exported as a BED12 line.If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the
SegmentChain
, it will be exported with the value of empty_valueIf an empty list, no extra columns will be exported; the
SegmentChain
will be formatted as a BED12 line.- empty_valuestr, optional
Value to export for extra_columns that are not defined (Default: “”)
- Returns
- str
Line of BED12[+X]-formatted text
Notes
- BED12 columns are as follows:
Column
Contains
1
Contig or chromosome
2
Start of first block in feature (0-indexed)
3
End of last block in feature (half-open)
4
Feature name
5
Feature score
6
Strand
7
thickstart (in chromosomal coordinates)
8
thickend (in chromosomal coordinates)
9
Feature color as RGB tuple
10
Number of blocks in feature
11
Block lengths
12
Block starts, relative to start of first block
- For more details
See the UCSC file format faq
- as_gff3(self, unicode feature_type=None, bool escape=True, list excludes=None)¶
Format self as a line of GFF3 output.
Because GFF3 files permit many schemas of parent-child hierarchy, and in order to reduce confusion and overhead, attempts to export a multi-interval
SegmentChain
will raise an AttributeError.Instead, users may export the individual features from which the multi-interval
SegmentChain
was constructed, or construct features for them, setting ID, Parent, and type attributes following their own conventions.- Parameters
- feature_typestr
If not None, overrides the type attribute of self.attr
- escapebool, optional
Escape tokens in column 9 of GFF3 output (Default: True)
- excludeslist, optional
List of attribute key names to exclude from column 9 (Default: [])
- Returns
- str
Line of GFF3-formatted text
- Raises
- AttributeError
if the
SegmentChain
has multiple intervals
Notes
- Columns of GFF3 are as follows
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes
- For further information, see
- as_gtf(self, unicode feature_type=None, bool escape=True, list excludes=None)¶
Format
SegmentChain
as a block of GTF2 output.The frame or phase attribute (GTF2 column 8) is valid only for ‘CDS’ features, and, if not present in self.attr, is calculated assuming the
SegmentChain
contains the entire coding region. If theSegmentChain
contains multiple intervals, the frame or phase attribute will always be recalculated.All attributes in self.attr, except those created upon import, will be propagated to all of the features that are generated.
- Parameters
- feature_typestr
If not None, overrides the “type” attribute of
self.attr
- escapebool, optional
Escape tokens in column 9 of GTF output (Default: True)
- excludeslist, optional
List of attribute key names to exclude from column 8 (Default: [])
- Returns
- str
Block of GTF2-formatted text
Notes
- gene_id and transcript_id are required
The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in
SegmentChain.get_gene()
andSegmentChain.get_name()
, respectively.- Beware of attribute loss
To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this
Transcript
have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.- Columns of GTF2 are as follows
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes. “gene_id” and “transcript_id” are required
- For more info
- as_psl(self)¶
Formats
SegmentChain
as PSL (blat) output.- Returns
- str
PSL-representation of BLAT alignment
- Raises
- AttributeError
If not all of the attributes listed above are defined
Notes
This will raise an
AttributeError
unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:Column
Key
1
match_length
2
mismatches
3
rep_matches
4
N
5
query_gap_count
6
query_gap_bases
7
target_gap_count
8
target_gap_bases
9
strand
10
query_name
11
query_length
12
query_start
13
query_end
14
target_name
15
target_length
16
target_start
17
target_end
19
q_starts
: list of integers20
l_starts
: list of integersThese keys are defined only if the
SegmentChain
was created bySegmentChain.from_psl()
, or if the user has defined them.See the PSL spec for more information.
- covers(self, other)¶
Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length
SegmentChains
are not covered by other chains.- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- static from_bed(unicode line, extra_columns=0)¶
Create a
SegmentChain
from a line from a BED file. The BED line may contain 4 to 12 columns, per the specification. These will be auto-detected and parsed appropriately.See the UCSC file format faq for more details.
- Parameters
- line
Line from a BED file, containing 4 or more columns
- extra_columns: int or list optional
Extra, non-BED columns in :term:`Extended BED`_ format file corresponding to feature attributes. This is common in ENCODE-specific BED variants.
if extra-columns is:
an
int
: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of theSegmentChain
, under names like custom0, custom1, … , customN.a
list
ofstr
, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under their respective names in the attr dict.a
list
oftuple
, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of theSegmentChain
will be set to formatter_func(column_value).
(Default: 0)
- Returns
- static from_psl(psl_line)¶
Create a
SegmentChain
from a line from a PSL (BLAT) fileSee the PSL spec
- Parameters
- psl_linestr
Line from a PSL file
- Returns
- static from_str(unicode inp)¶
Create a
SegmentChain
from a string formatted bySegmentChain.__str__()
:chrom:start-end^start-end(strand)
where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.
- Parameters
- inpstr
String formatted in manner of
SegmentChain.__str__()
- Returns
- get_antisense(self) SegmentChain ¶
Returns an
SegmentChain
antisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChain
antisense to self
- get_counts(self, ga, stranded=True)¶
Return list of counts or values drawn from ga at each position in self
- Parameters
- gaGenomeArray from which to fetch counts
- strandedbool, optional
If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)
- Returns
- numpy.ndarray
Array of counts from ga covering self
- get_fasta(self, genome, stranded=True)¶
Formats sequence of SegmentChain as FASTA output
- Parameters
- genomedict or
twobitreader.TwoBitFile
Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecord
objects- strandedbool
If True and the
SegmentChain
is on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
FASTA-formatted seuqence of
SegmentChain
extracted from genome
- get_gene(self)¶
Return name of gene associated with
SegmentChain
, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made fromget_name()
.- Returns
- str
Returns in order of preference, gene_id from self.attr, Parent from self.attr or
'gene_%s' % self.get_name()
- get_genomic_coordinate(self, x, stranded=True)¶
Finds genomic coordinate corresponding to position x in self
- Parameters
- xint
position of interest, relative to
SegmentChain
- strandedbool, optional
If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- Returns
- str
Chromosome name
- long
Genomic cordinate corresponding to position x
- str
Chromosome strand (‘+’, ‘-’, or ‘.’)
- Raises
- IndexError
if x is outside the bounds of the
SegmentChain
- get_junctions(self)¶
Returns a list of
GenomicSegments
representing spaces between theGenomicSegments
in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.- Returns
- list
List of
GenomicSegments
covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)
- get_masked_counts(self, ga, stranded=True, copy=False)¶
Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by
SegmentChain.add_mask()
will be masked in the array- Parameters
- gndnon-abstract subclass of
AbstractGenomeArray
GenomeArray from which to fetch counts
- strandedbool, optional
If true and the
SegmentChain
is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)- copybool, optional
If False (default) returns a view of the data; so changing values in the view changes the values in the
GenomeArray
if it is mutable. If True, a copy is returned instead.
- gndnon-abstract subclass of
- Returns
numpy.ma.masked_array
- get_masked_position_set(self) set ¶
Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()
- Returns
- set
Set of genomic coordinates, as integers
- get_masks(self)¶
Return masked positions as a list of
GenomicSegments
- Returns
- list
list of
GenomicSegments
representing masked positions
- get_masks_as_segmentchain(self)¶
Return masked positions as a
SegmentChain
- Returns
SegmentChain
Masked positions
- get_name(self)¶
Returns the name of this
SegmentChain
, first searching through self.attr for the keys ID, Name, and name. If no value is found for any of those keys, a name is generated usingSegmentChain.__str__()
- Returns
- str
In order of preference, ID from self.attr, Name from self.attr, name from self.attr or
str(self)
- get_position_list(self)¶
Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChain
- Returns
- list
Genomic coordinates in self, as integers, in genomic order
- get_position_set(self)¶
Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChain
- Returns
- set
Set of genomic coordinates, as integers
- get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)¶
Finds the
SegmentChain
coordinate corresponding to a genomic position- Parameters
- chromstr
Chromosome name
- genomic_xint
coordinate, in genomic space
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- strandedbool, optional
If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)
- Returns
- int
Position in
SegmentChain
- Raises
- KeyError
if position outside bounds of
SegmentChain
- get_sequence(self, genome, stranded=True)¶
Return spliced genomic sequence of
SegmentChain
as a string- Parameters
- genomedict or
twobitreader.TwoBitFile
Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecord
objects- strandedbool
If True and the
SegmentChain
is on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
Nucleotide sequence of the
SegmentChain
extracted from genome
- get_subchain(self, long start, long end, bool stranded=True, **extra_attr)¶
Retrieves a sub-
SegmentChain
corresponding a range of positions specified in coordinates relative thisSegmentChain
. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.- Parameters
- startint
position of interest in SegmentChain coordinates, 0-indexed
- endint
position of interest in SegmentChain coordinates, 0-indexed and half-open
- strandedbool, optional
If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- extra_attrkeyword arguments
Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChain
covering parent chain positions start to end of self
- Raises
- IndexError
if start or end is outside the bounds of the
SegmentChain
- TypeError
if start or end is None
- get_unstranded(self) SegmentChain ¶
Returns an
SegmentChain
antisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChain
antisense to self
- next(self)¶
Return next
GenomicSegment
in theSegmentChain
, from left to right on the chromsome
- overlaps(self, other)¶
Return True if self and other share genomic positions on the same strand
- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome and strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- reset_masks(self)¶
Removes masks added by
add_masks()
See also
Returns a list of
GenomicSegment
that are shared between self and other- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- list
List of
GenomicSegments
common to self and other
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- sort(self)¶
- unstranded_overlaps(self, other)¶
Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- attr¶
attr: dict
- c_strand¶
- chrom¶
Chromosome the SegmentChain resides on
- length¶
- mask_segments¶
Copy of list of
GenomicSegments
representing regions masked in self. Changing this list will do nothing to the masks in self.
- masked_length¶
- segments¶
Copy of list of
GenomicSegments
that comprise self. Changing this list will do nothing to self.
- spanning_segment¶
- strand¶
Strand of the SegmentChain
- class plastid.genomics.roitools.Transcript(*segments, **attributes)¶
Bases:
plastid.genomics.roitools.SegmentChain
Subclass of
SegmentChain
specifically for RNA transcripts. In addition to coordinate-conversion, count fetching, sequence fetching, and various other methods inherited fromSegmentChain
,Transcript
provides convenience methods for fetching sub-chains corresponding to CDS features, 5’ UTRs, and 3’ UTRs.- Parameters
- *segments
GenomicSegment
0 or more GenomicSegments on the same strand
- **attrkeyword arguments
Arbitrary attributes, including, for example:
Attribute
Description
cds_genome_start
Location of CDS start, in genomic coordinates
cds_genome_start
Location of CDS end, in genomic coordinates
ID
A unique ID for the
SegmentChain
.transcript_id
A transcript ID used for GTF2 export
gene_id
A gene ID used for GTF2 export
- *segments
- Attributes
cds_genome_start
int or NoneStarting coordinate of coding region, relative to genome (i.e.
cds_genome_end
int or NoneEnding coordinate of coding region, relative to genome (i.e.
cds_start
int or NoneStart of coding region relative to 5’ end of transcript, in direction of transcript.
cds_end
int or NoneEnd of coding region relative to 5’ end of transcript, in direction of transcript.
- spanning_segment
GenomicSegment
A GenomicSegment spanning the endpoints of the Transcript
strand
strStrand of the SegmentChain
chrom
strChromosome the SegmentChain resides on
segments
listCopy of list of
GenomicSegments
that comprise self.mask_segments
listCopy of list of
GenomicSegments
representing regions masked in self. Changing this list will do nothing to the masks in self.attr
dictattr: dict
Methods
add_masks
(self, *mask_segments)Adds one or more
GenomicSegment
to the collection of masks.add_segments
(self, *segments)Add 1 or more
GenomicSegments
to theSegmentChain
.antisense_overlaps
(self, other)Returns True if self and other share genomic positions on opposite strands
as_bed
(self[, as_int, color, extra_columns, ...])Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr
as_gff3
(self, bool escape=True, ...)Format a
Transcript
as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53as_gtf
(self, unicode feature_type=u, ...)Format self as a GTF2 block.
as_psl
(self)Formats
SegmentChain
as PSL (blat) output.covers
(self, other)Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
from_bed
(unicode line[, extra_columns])Create a
Transcript
from a BED line with 4 or more columns.from_psl
(unicode psl_line)from_str
(unicode inp)Create a
SegmentChain
from a string formatted bySegmentChain.__str__()
:get_antisense
(self)Returns an
SegmentChain
antisense to self, with empty attr dict.get_cds
(self, **extra_attr)Retrieve
SegmentChain
covering the coding region of self, including the stop codon.get_counts
(self, ga[, stranded])Return list of counts or values drawn from ga at each position in self
get_fasta
(self, genome[, stranded])Formats sequence of SegmentChain as FASTA output
get_gene
(self)Return name of gene associated with
SegmentChain
, if any, by searching through self.attr for the keys gene_id and Parent.get_genomic_coordinate
(self, x[, stranded])Finds genomic coordinate corresponding to position x in self
get_junctions
(self)Returns a list of
GenomicSegments
representing spaces between theGenomicSegments
in self In the case of a transcript, these would represent introns.get_masked_counts
(self, ga[, stranded, copy])Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
get_masked_position_set
(self)Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()
get_masks
(self)Return masked positions as a list of
GenomicSegments
Return masked positions as a
SegmentChain
get_name
(self)Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name.
get_position_list
(self)Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChain
get_position_set
(self)Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChain
get_segmentchain_coordinate
(self, ...)Finds the
SegmentChain
coordinate corresponding to a genomic positionget_sequence
(self, genome[, stranded])Return spliced genomic sequence of
SegmentChain
as a stringget_subchain
(self, long start, long end, ...)Retrieves a sub-
SegmentChain
corresponding a range of positions specified in coordinates relative thisSegmentChain
.get_unstranded
(self)Returns an
SegmentChain
antisense to self, with empty attr dict.get_utr3
(self, **extra_attr)Retrieve sub-
SegmentChain
covering 3'UTR of self, excluding the stop codon.get_utr5
(self, **extra_attr)Retrieve sub-
SegmentChain
covering 5'UTR of self.next
(self)Return next
GenomicSegment
in theSegmentChain
, from left to right on the chromsomeoverlaps
(self, other)Return True if self and other share genomic positions on the same strand
reset_masks
(self)Removes masks added by
add_masks()
shares_segments_with
(self, other)Returns a list of
GenomicSegment
that are shared between self and othersort
(self)unstranded_overlaps
(self, other)Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- add_masks(self, *mask_segments)¶
Adds one or more
GenomicSegment
to the collection of masks. Masks will be trimmed to the positions of theSegmentChain
during addition.- Parameters
- mask_segments
GenomicSegment
One or more segments, in genomic coordinates, covering positions to exclude from return values of
get_masked_position_set()
,get_masked_counts()
, orget_masked_length()
- mask_segments
- add_segments(self, *segments)¶
Add 1 or more
GenomicSegments
to theSegmentChain
. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.- Parameters
- segments
GenomicSegment
One or more
GenomicSegment
to add toSegmentChain
- segments
- antisense_overlaps(self, other)¶
Returns True if self and other share genomic positions on opposite strands
- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- as_bed(self, as_int=True, color=None, extra_columns=None, empty_value='')¶
Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr
If the
SegmentChain
was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.- Parameters
- as_intbool, optional
Force “score” to integer (Default: True)
- colorstr or None, optional
Color represented as RGB hex string. If not none, overrides the color in self.attr[“color”]
- extra_columnsNone or list-like, optional
If None, and the
SegmentChain
was imported using the extra_columns keyword offrom_bed()
, theSegmentChain
will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, theSegmentChain
will be exported as a BED12 line.If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the
SegmentChain
, it will be exported with the value of empty_valueIf an empty list, no extra columns will be exported; the
SegmentChain
will be formatted as a BED12 line.- empty_valuestr, optional
- Returns
- str
Line of BED12-formatted text
Notes
- BED12 columns are as follows
Column
Contains
0
Contig or chromosome
1
Start of first block in feature (0-indexed)
2
End of last block in feature (half-open)
3
Feature name
4
Feature score
5
Strand
6
thickstart
7
thickend
8
Feature color as RGB tuple
9
Number of blocks in feature
10
Block lengths
11
Block starts, relative to start of first block
- Fore more information
See the UCSC file format faq
- as_gff3(self, bool escape=True, list excludes=None, unicode rna_type=u'mRNA')¶
Format a
Transcript
as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53The
Transcript
will be formatted according to the following rules:A feature of type rna_type will be created, with Parent attribute set to the value of
self.get_gene()
, and ID attribute set toself.get_name()
For each
GenomicSegment
in self, a child feature of type exon will be created. The Parent attribute of these features will be set to the value ofself.get_name()
. These will have unique IDs generated fromself.get_name()
.If self is coding (i.e. has none-None value for self.cds_genome_start and self.cds_genome_end), child features of type ‘five_prime_UTR’, ‘CDS’, and ‘three_prime_UTR’ will be created, with Parent attributes set to
self.get_name()
. These will have unique IDs generated fromself.get_name()
.
- Parameters
- escapebool, optional
Escape tokens in column 9 of GFF3 output (Default: True)
- excludeslist, optional
List of attribute key names to exclude from column 9 (Default: [])
- rna_typestr, optional
Feature type to export RNA as (e.g. ‘tRNA’, ‘noncoding_RNA’, et c. Default: ‘mRNA’)
- Returns
- str
Multiline block of GFF3-formatted text
Notes
- Beware of attribute loss
This
Transcript
was assembled from multiple individual component features (e.g. single exons), which may or may not have had their own unique attributes in their original annotation. To reduce overhead, these individual attributes (if they were present) have not been (entirely) stored, and consequently will not (all) be exported. If this poses problems, consider instead importing, modifying, and exporting the component features- GFF3 schemas vary
Different GFF3s have different schemas (parent-child relationships between features). Here we adopt the commonly-used schema set by Sequence Ontology (SO) v2.53, which may or may not match your schema.
- Columns of GFF3 are as follows
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes
- For futher information, see
Sequence Ontology (SO) v2.53 <http://www.sequenceontology.org/browser/>
- as_gtf(self, unicode feature_type=u'exon', bool escape=True, list excludes=None)¶
Format self as a GTF2 block.
GenomicSegments
are formatted as GTF2 ‘exon’ features. Coding regions, if peresent, are formatted as GTF2 ‘CDS’ features. Stop codons are excluded in the ‘CDS’ features, per the GTF2 specification, and exported separately.All attributes from self.attr are propagated to the exon and CDS features that are generated.
- Parameters
- feature_typestr
If not None, overrides the ‘type’ attribute of self.attr
- escapebool, optional
URL escape tokens in column 9 of GTF2 output (Default: True)
- Returns
- str
Block of GTF2-formatted text
Notes
- gene_id and transcript_id are required
The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in
SegmentChain.get_gene()
andSegmentChain.get_name()
, respectively.- Beware of attribute loss
To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this
Transcript
have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.
Columns of GTF2 are as follows:
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes. “gene_id” and “transcript_id” are required
- For more info
- as_psl(self)¶
Formats
SegmentChain
as PSL (blat) output.- Returns
- str
PSL-representation of BLAT alignment
- Raises
- AttributeError
If not all of the attributes listed above are defined
Notes
This will raise an
AttributeError
unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:Column
Key
1
match_length
2
mismatches
3
rep_matches
4
N
5
query_gap_count
6
query_gap_bases
7
target_gap_count
8
target_gap_bases
9
strand
10
query_name
11
query_length
12
query_start
13
query_end
14
target_name
15
target_length
16
target_start
17
target_end
19
q_starts
: list of integers20
l_starts
: list of integersThese keys are defined only if the
SegmentChain
was created bySegmentChain.from_psl()
, or if the user has defined them.See the PSL spec for more information.
- covers(self, other)¶
Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length
SegmentChains
are not covered by other chains.- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- static from_bed(unicode line, extra_columns=0)¶
Create a
Transcript
from a BED line with 4 or more columns. thickstart and thickend columns, if present, are assumed to specify CDS boundaries, a convention that, while common, is formally outside the BED specification.See the UCSC file format faq for more details.
- Parameters
- line
Line from a BED file with at least 4 columns
- extra_columns: int or list, optional
Extra, non-BED columns in BED file corresponding to feature attributes. This is common in ENCODE-specific BED variants.
if extra-columns is:
an
int
: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of theSegmentChain
, under names like custom0, custom1, … , customN.a
list
ofstr
, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored undera
list
oftuple
, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of theSegmentChain
will be set to formatter_func(column_value).
(Default: 0)
- Returns
- static from_psl(unicode psl_line)¶
- static from_str(unicode inp)¶
Create a
SegmentChain
from a string formatted bySegmentChain.__str__()
:chrom:start-end^start-end(strand)
where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.
- Parameters
- inpstr
String formatted in manner of
SegmentChain.__str__()
- Returns
- get_antisense(self) SegmentChain ¶
Returns an
SegmentChain
antisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChain
antisense to self
- get_cds(self, **extra_attr)¶
Retrieve
SegmentChain
covering the coding region of self, including the stop codon. If no coding region is present, returns an emptySegmentChain
.The following attributes are passed from self.attr to the new
SegmentChain
transcript_id, taken from
SegmentChain.get_name()
gene_id, taken from
SegmentChain.get_gene()
ID, generated as “%s_CDS % self.get_name()
- Parameters
- extra_attrkeyword arguments
Values that will be included in the CDS subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChain
CDS region of self if present, otherwise empty
SegmentChain
- get_counts(self, ga, stranded=True)¶
Return list of counts or values drawn from ga at each position in self
- Parameters
- gaGenomeArray from which to fetch counts
- strandedbool, optional
If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)
- Returns
- numpy.ndarray
Array of counts from ga covering self
- get_fasta(self, genome, stranded=True)¶
Formats sequence of SegmentChain as FASTA output
- Parameters
- genomedict or
twobitreader.TwoBitFile
Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecord
objects- strandedbool
If True and the
SegmentChain
is on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
FASTA-formatted seuqence of
SegmentChain
extracted from genome
- get_gene(self)¶
Return name of gene associated with
SegmentChain
, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made fromget_name()
.- Returns
- str
Returns in order of preference, gene_id from self.attr, Parent from self.attr or
'gene_%s' % self.get_name()
- get_genomic_coordinate(self, x, stranded=True)¶
Finds genomic coordinate corresponding to position x in self
- Parameters
- xint
position of interest, relative to
SegmentChain
- strandedbool, optional
If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- Returns
- str
Chromosome name
- long
Genomic cordinate corresponding to position x
- str
Chromosome strand (‘+’, ‘-’, or ‘.’)
- Raises
- IndexError
if x is outside the bounds of the
SegmentChain
- get_junctions(self)¶
Returns a list of
GenomicSegments
representing spaces between theGenomicSegments
in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.- Returns
- list
List of
GenomicSegments
covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)
- get_masked_counts(self, ga, stranded=True, copy=False)¶
Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by
SegmentChain.add_mask()
will be masked in the array- Parameters
- gndnon-abstract subclass of
AbstractGenomeArray
GenomeArray from which to fetch counts
- strandedbool, optional
If true and the
SegmentChain
is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)- copybool, optional
If False (default) returns a view of the data; so changing values in the view changes the values in the
GenomeArray
if it is mutable. If True, a copy is returned instead.
- gndnon-abstract subclass of
- Returns
numpy.ma.masked_array
- get_masked_position_set(self) set ¶
Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()
- Returns
- set
Set of genomic coordinates, as integers
- get_masks(self)¶
Return masked positions as a list of
GenomicSegments
- Returns
- list
list of
GenomicSegments
representing masked positions
- get_masks_as_segmentchain(self)¶
Return masked positions as a
SegmentChain
- Returns
SegmentChain
Masked positions
- get_name(self)¶
Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name. If no value is found,
Transcript.__str__()
is used.- Returns
- str
Returns in order of preference, transcript_id, ID, Name, or name from self.attr. If not found, returns
str(self)
- get_position_list(self)¶
Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChain
- Returns
- list
Genomic coordinates in self, as integers, in genomic order
- get_position_set(self)¶
Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChain
- Returns
- set
Set of genomic coordinates, as integers
- get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)¶
Finds the
SegmentChain
coordinate corresponding to a genomic position- Parameters
- chromstr
Chromosome name
- genomic_xint
coordinate, in genomic space
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- strandedbool, optional
If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)
- Returns
- int
Position in
SegmentChain
- Raises
- KeyError
if position outside bounds of
SegmentChain
- get_sequence(self, genome, stranded=True)¶
Return spliced genomic sequence of
SegmentChain
as a string- Parameters
- genomedict or
twobitreader.TwoBitFile
Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecord
objects- strandedbool
If True and the
SegmentChain
is on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
Nucleotide sequence of the
SegmentChain
extracted from genome
- get_subchain(self, long start, long end, bool stranded=True, **extra_attr)¶
Retrieves a sub-
SegmentChain
corresponding a range of positions specified in coordinates relative thisSegmentChain
. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.- Parameters
- startint
position of interest in SegmentChain coordinates, 0-indexed
- endint
position of interest in SegmentChain coordinates, 0-indexed and half-open
- strandedbool, optional
If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- extra_attrkeyword arguments
Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChain
covering parent chain positions start to end of self
- Raises
- IndexError
if start or end is outside the bounds of the
SegmentChain
- TypeError
if start or end is None
- get_unstranded(self) SegmentChain ¶
Returns an
SegmentChain
antisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChain
antisense to self
- get_utr3(self, **extra_attr)¶
Retrieve sub-
SegmentChain
covering 3’UTR of self, excluding the stop codon. If no coding region, returns an emptySegmentChain
The following attributes are passed from
self.attr
to the newSegmentChain
transcript_id, taken from
SegmentChain.get_name()
gene_id, taken from
SegmentChain.get_gene()
ID, generated as “%s_3UTR” % self.get_name()
- Parameters
- extra_attrkeyword arguments
Values that will be included in the 3’ UTR subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChain
3’ UTR region of self if present, otherwise empty
SegmentChain
- get_utr5(self, **extra_attr)¶
Retrieve sub-
SegmentChain
covering 5’UTR of self. If no coding region, returns an emptySegmentChain
The following attributes are passed from self.attr to the new
SegmentChain
transcript_id, taken from
SegmentChain.get_name()
gene_id, taken from
SegmentChain.get_gene()
ID, generated as “%s_5UTR” % self.get_name()
- Parameters
- extra_attrkeyword arguments
Values that will be included in the 5’UTR subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChain
5’ UTR region of self if present, otherwise empty
SegmentChain
- next(self)¶
Return next
GenomicSegment
in theSegmentChain
, from left to right on the chromsome
- overlaps(self, other)¶
Return True if self and other share genomic positions on the same strand
- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome and strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- reset_masks(self)¶
Removes masks added by
add_masks()
See also
Returns a list of
GenomicSegment
that are shared between self and other- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- list
List of
GenomicSegments
common to self and other
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- sort(self)¶
- unstranded_overlaps(self, other)¶
Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- Parameters
- other
SegmentChain
orGenomicSegment
Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match
- Raises
- TypeError
if other is not a
GenomicSegment
orSegmentChain
- attr¶
attr: dict
- c_strand¶
- cds_end¶
End of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_start, self.cds_genome_start and self.cds_genome_end to None
- cds_genome_end¶
Ending coordinate of coding region, relative to genome (i.e. leftmost; is stop codon for forward-strand features, start codon for reverse-strand features. Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_start to None
- cds_genome_start¶
Starting coordinate of coding region, relative to genome (i.e. leftmost; is start codon for forward-strand features, stop codon for reverse-strand features). Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_end to None
- cds_start¶
Start of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_end, self.cds_genome_start and self.cds_genome_end to None
- chrom¶
Chromosome the SegmentChain resides on
- length¶
- mask_segments¶
Copy of list of
GenomicSegments
representing regions masked in self. Changing this list will do nothing to the masks in self.
- masked_length¶
- segments¶
Copy of list of
GenomicSegments
that comprise self. Changing this list will do nothing to self.
- spanning_segment¶
- strand¶
Strand of the SegmentChain
- plastid.genomics.roitools.add_three_for_stop_codon(Transcript tx) Transcript ¶
Extend an annotated CDS region, if present, by three nucleotides at the threeprime end. Use in cases when annotation files exclude the stop codon from the annotated CDS.
- Parameters
- tx
Transcript
query transcript
- tx
- Returns
Transcript
Transcript
with same attributes as tx, but with CDS extended by one codon
- Raises
- IndexError
if a three prime UTR is defined that terminates before the complete stop codon
- plastid.genomics.roitools.merge_segments(list segments) list ¶
Merge all overlapping
GenomicSegments
in segments, so that all segments returned are guaranteed to be sorted and non-overlapping.Note
All segments are assumed to be on the same strand and chromosome.
- Parameters
- segmentslist
List of
GenomicSegments
, all on the same strand and chromosome
- Returns
- list
List of sorted, non-overlapping
GenomicSegments
- plastid.genomics.roitools.positionlist_to_segments(unicode chrom, unicode strand, list positions) list ¶
Construct
GenomicSegments
from a chromosome name, a strand, and a list of chromosomal positions.- Parameters
- chromstr
Chromosome name
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- positionslist of unique integers
Sorted, end-inclusive list of positions to include in final
GenomicSegment
- Returns
- list
List of
GenomicSegments
covering positionsWarning
This function is meant to quickly without excessive type conversions. So, the elements positions must be UNIQUE and SORTED. If they are not, use
positions_to_segments()
instead.
- plastid.genomics.roitools.positions_to_segments(unicode chrom, unicode strand, positions) list ¶
Construct
GenomicSegments
from a chromosome name, a strand, and a list of chromosomal positions.- Parameters
- chromstr
Chromosome name
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- positionslist of integers
End-inclusive list, tuple, or set of positions to include in final
GenomicSegment
- Returns
- list
List of
GenomicSegments
covering positions