plastid.genomics.roitools module¶
This module contains classes for representing and manipulating genomic features.
Summary¶
Genomic features are represented as SegmentChains, which can contain zero or
more continuous spans of the genome (GenomicSegments), as well as rich
annotation data. For the specific case of RNA transcripts, a subclass of
SegmentChain, called Transcript is provided.
Module contents¶
|
Building block for |
|
Base class for genomic features, composed of zero or more |
|
Subclass of |
|
Construct |
|
Extend an annotated CDS region, if present, by three nucleotides at the threeprime end. |
Examples¶
SegmentChains may be read directly from annotation files using the readers
in plastid.readers:
>>> from plastid import *
>>> chains = list(BED_Reader(open("some_file.bed")))
or constructed from GenomicSegments:
>>> seg1 = GenomicSegment("chrA", 5, 200, "-")
>>> seg2 = GenomicSegment("chrA", 250, 300, "-")
>>> my_chain = SegmentChain(seg1, seg2, ID="some_chain", ... , some_attribute="some_value")
SegmentChains contain convenience methods for a number of comman tasks, for
example:
converting coordinates between the spliced space of the chain, and the genome:
>>> # get coordinate of 50th position from 5' end >>> my_chain.get_genomic_coordinate(50) ('chrA', 199, '-') # get coordinate of 49th position. splicing is taken care of! >>> my_chain.get_genomic_coordinate(49) ('chrA', 250, '-') # get coordinate in chain corresponding to genomic coordinate 118 >>> my_chain.get_segmentchain_coordinate("chrA", 118, "-") 131 # get a subchain containing positions 45-70 >>> subchain = my_chain.get_subchain(45, 70) >>> subchain <SegmentChain segments=2 bounds=chrA:180-255(-) name=some_chain_subchain> # the subchain preserves the discontinuity found in `my_chain` >>> subchain.segments [<GenomicSegment chrA:180-200 strand='-'>, <GenomicSegment chrA:250-255 strand='-'>]fetching
numpy arraysof data at each position in the chain. The data is assumed to be kept in aGenomeArray:>>> ga = BAMGenomeArray(["some_file.bam"], mapping=ThreePrimeMapFactory(offset=15)) >>> my_chain.get_counts(ga) array([843, 854, 153, 86, 462, 359, 290, 38, 38, 758, 342, 299, 430, 628, 324, 437, 231, 417, 536, 673, 243, 981, 661, 415, 207, 446, 197, 520, 653, 468, 863, 3, 272, 754, 352, 960, 966, 913, 367, ... ])similarly, fetching spliced sequence, reverse-complemented if necessary for minus-strand features. As input, the
SegmentChainexpects a dictionary-like object mapping chromosome names to string-like sequences (e.g. as in BioPython or twobitreader):>>> seqdict = { "chrA" : "TCTACATA ..." } # some string of chrA sequence >>> my_chain.get_sequence(seqdict) "ACTGTGTACTGTACGATCGATCGTACGTACGATCGATCGTACGTAGCTAGTCAGCTAGCTAGCTAGCTGA..."testing for overlap, containment, equality with other
SegmentChains:>>> other_chain = SegmentChain(GenomicSegment("chrA", 200, 300, "-"), >>> GenomicSegment("chrA", 800, 900, "-")) >>> my_chain.overlaps(other_chain) True >>> other_chain in my_chain False >>> my_chain in my_chain True >>> my_chain.covers(other_chain) False >>> my_chain == other_chain False >>> my_chain == my_chain True >>> my_chain.as_bed() chrA 5 300 some_chain 0 - 5 5 0,0,0 2 195,50, 0,245, >>> my_chain.as_gtf() chrA . exon 6 200 . - . gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain"; chrA . exon 251 300 . - . gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";
- class plastid.genomics.roitools.GenomicSegment(chrom, start, end, strand)¶
Bases:
objectBuilding block for
SegmentChain: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.Examples
GenomicSegmentssort lexically by chromosome, start position, end position, and finally strand:>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrB", 0, 10, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 75, 100, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 55, 75, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 150, "+") True >>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 100, "-") True
They also provide a few convenience methods for containment or overlap. To be contained, a segment must be on the same chromosome and strand as its container, and its coordinates must be within or equal to its endpoints:
>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "+") True >>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 50, 100, "+") True >>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "-") False >>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 75, 200, "+") False
Similarly, to overlap,
GenomicSegmentsmust be on the same strand and chromosome.- Attributes
chromstrChromosome where
GenomicSegmentresidesstartintZero-indexed (Pythonic) start coordinate of
GenomicSegmentendintZero-indexed, half-open (Pythonic) end coordinate of
GenomicSegmentstrandstrStrand of
GenomicSegment
Methods
as_igv_str(self)Format as an IGV location string
contains(self, GenomicSegment other)Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.
from_igv_str(unicode loc_str, unicode strand=u)Construct
GenomicSegmentfrom IGV location stringfrom_str(unicode inp)Construct a
GenomicSegmentfrom itsstr()representationoverlaps(self, GenomicSegment other)Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.
- as_igv_str(self) unicode¶
Format as an IGV location string
- contains(self, GenomicSegment other) bool¶
Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.
- Parameters
- other
GenomicSegment Query segment
- other
- Returns
- bool
- static from_igv_str(unicode loc_str, unicode strand=u'.')¶
Construct
GenomicSegmentfrom IGV location string- Parameters
- igvlocstr
IGV location string, in format ‘chromosome:start-end’, where start and end are 1-indexed and half-open
- strandstr
The chromosome strand (‘+’, ‘-’, or ‘.’)
- Returns
- static from_str(unicode inp)¶
Construct a
GenomicSegmentfrom itsstr()representation- Parameters
- inpstr
String representation of
GenomicSegmentas chrom:start-end(strand) where start and end are in 0-indexed, half-open coordinates
- Returns
- overlaps(self, GenomicSegment other) bool¶
Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.
- Parameters
- other
GenomicSegment Query segment
- other
- Returns
- bool
- c_strand¶
- chrom¶
Chromosome where
GenomicSegmentresides
- end¶
Zero-indexed, half-open (Pythonic) end coordinate of
GenomicSegment
- start¶
Zero-indexed (Pythonic) start coordinate of
GenomicSegment
- strand¶
Strand of
GenomicSegment‘+’ for forward / Watson strand
‘-’ for reverse / Crick strand
‘.’ for unstranded / both strands
- class plastid.genomics.roitools.SegmentChain(*segments, **attributes)¶
Bases:
objectBase class for genomic features, composed of zero or more
GenomicSegments.SegmentChainscan therefore model discontinuous, features – such as multi-exon transcripts or gapped alignments – in addition, to continuous features.Numerous convenience functions are supplied for:
converting between coordinates relative to the genome and relative to the internal coordinates of a spliced
SegmentChainfetching genomic sequence, read alignments, or count data, accounting for splicing of the segments, and, in the case of reverse-strand features, reverse-complementing
slicing or fetching sub-regions of a
SegmentChaintesting equality, inequality, overlap, containment, coverage of, or sharing of segments with other
SegmentChainorGenomicSegmentobjectsimport/export to BED, PSL, GTF2, and GFF3 formats, for use in other software packages or in a genome browser.
Intervals are sorted from lowest to greatest starting coordinate on their reference sequence, regardless of strand. Iteration over the SegmentChain will yield intervals from left-to-right in the genome.
- Parameters
- *segments
GenomicSegment 0 or more GenomicSegments on the same strand
- **attrkeyword arguments
Arbitrary attributes, including, for example:
Attribute
Description
typeA feature type used for GTF2/GFF3 export of each interval in the
SegmentChain. (Default: ‘exon’)IDA unique ID for the
SegmentChain.transcript_idA transcript ID used for GTF2 export
gene_idA gene ID used for GTF2 export
- *segments
See also
TranscriptTranscript subclass, additionally providing richer GTF2, GFF3, and BED export, as well as methods for fetching coding regions and UTRs as subsegments
- Attributes
- spanning_segment
GenomicSegment A
GenomicSegmentspanning the endpoints of theSegmentChainstrandstrStrand of the SegmentChain
chromstrChromosome the SegmentChain resides on
attrdictattr: dict
segmentslistCopy of list of
GenomicSegmentsthat comprise self.mask_segmentslistCopy of list of
GenomicSegmentsrepresenting regions masked in self. Changing this list will do nothing to the masks in self.
- spanning_segment
Methods
add_masks(self, *mask_segments)Adds one or more
GenomicSegmentto the collection of masks.add_segments(self, *segments)Add 1 or more
GenomicSegmentsto theSegmentChain.antisense_overlaps(self, other)Returns True if self and other share genomic positions on opposite strands
as_bed(self[, thickstart, thickend, as_int, ...])Format
SegmentChainas a string of BED12[+X] output.as_gff3(self, unicode feature_type=None, ...)Format self as a line of GFF3 output.
as_gtf(self, unicode feature_type=None, ...)Format
SegmentChainas a block of GTF2 output.as_psl(self)Formats
SegmentChainas PSL (blat) output.covers(self, other)Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
from_bed(unicode line[, extra_columns])Create a
SegmentChainfrom a line from a BED file.from_psl(psl_line)Create a
SegmentChainfrom a line from a PSL (BLAT) filefrom_str(unicode inp)Create a
SegmentChainfrom a string formatted bySegmentChain.__str__():get_antisense(self)Returns an
SegmentChainantisense to self, with empty attr dict.get_counts(self, ga[, stranded])Return list of counts or values drawn from ga at each position in self
get_fasta(self, genome[, stranded])Formats sequence of SegmentChain as FASTA output
get_gene(self)Return name of gene associated with
SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent.get_genomic_coordinate(self, x[, stranded])Finds genomic coordinate corresponding to position x in self
get_junctions(self)Returns a list of
GenomicSegmentsrepresenting spaces between theGenomicSegmentsin self In the case of a transcript, these would represent introns.get_masked_counts(self, ga[, stranded, copy])Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
get_masked_position_set(self)Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()get_masks(self)Return masked positions as a list of
GenomicSegmentsReturn masked positions as a
SegmentChainget_name(self)Returns the name of this
SegmentChain, first searching through self.attr for the keys ID, Name, and name.get_position_list(self)Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChainget_position_set(self)Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChainget_segmentchain_coordinate(self, ...)Finds the
SegmentChaincoordinate corresponding to a genomic positionget_sequence(self, genome[, stranded])Return spliced genomic sequence of
SegmentChainas a stringget_subchain(self, long start, long end, ...)Retrieves a sub-
SegmentChaincorresponding a range of positions specified in coordinates relative thisSegmentChain.get_unstranded(self)Returns an
SegmentChainantisense to self, with empty attr dict.next(self)Return next
GenomicSegmentin theSegmentChain, from left to right on the chromsomeoverlaps(self, other)Return True if self and other share genomic positions on the same strand
reset_masks(self)Removes masks added by
add_masks()shares_segments_with(self, other)Returns a list of
GenomicSegmentthat are shared between self and othersort(self)unstranded_overlaps(self, other)Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- add_masks(self, *mask_segments)¶
Adds one or more
GenomicSegmentto the collection of masks. Masks will be trimmed to the positions of theSegmentChainduring addition.- Parameters
- mask_segments
GenomicSegment One or more segments, in genomic coordinates, covering positions to exclude from return values of
get_masked_position_set(),get_masked_counts(), orget_masked_length()
- mask_segments
- add_segments(self, *segments)¶
Add 1 or more
GenomicSegmentsto theSegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.- Parameters
- segments
GenomicSegment One or more
GenomicSegmentto add toSegmentChain
- segments
- antisense_overlaps(self, other)¶
Returns True if self and other share genomic positions on opposite strands
- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- as_bed(self, thickstart=None, thickend=None, as_int=True, color=None, extra_columns=None, empty_value='')¶
Format
SegmentChainas a string of BED12[+X] output.If the
SegmentChainwas imported as a BED file with extra columns, these will be output in the same order, after the BED columns.- Parameters
- thickstartint or None, optional
If not None, overrides the genome coordinate that starts thick plotting in genome browser found in self.attr[‘thickstart’]
- thickendint or None, optional
If not None, overrides the genome coordinate that stops thick plotting in genome browser found in self.attr[‘thickend’]
- as_intbool, optional
Force score to integer (Default: True)
- colorstr or None, optional
Color represented as RGB hex string. If not none, overrides the color in self.attr[‘color’]
- extra_columnsNone or list-like, optional
If None, and the
SegmentChainwas imported using the extra_columns keyword offrom_bed(), theSegmentChainwill be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, theSegmentChainwill be exported as a BED12 line.If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the
SegmentChain, it will be exported with the value of empty_valueIf an empty list, no extra columns will be exported; the
SegmentChainwill be formatted as a BED12 line.- empty_valuestr, optional
Value to export for extra_columns that are not defined (Default: “”)
- Returns
- str
Line of BED12[+X]-formatted text
Notes
- BED12 columns are as follows:
Column
Contains
1
Contig or chromosome
2
Start of first block in feature (0-indexed)
3
End of last block in feature (half-open)
4
Feature name
5
Feature score
6
Strand
7
thickstart (in chromosomal coordinates)
8
thickend (in chromosomal coordinates)
9
Feature color as RGB tuple
10
Number of blocks in feature
11
Block lengths
12
Block starts, relative to start of first block
- For more details
See the UCSC file format faq
- as_gff3(self, unicode feature_type=None, bool escape=True, list excludes=None)¶
Format self as a line of GFF3 output.
Because GFF3 files permit many schemas of parent-child hierarchy, and in order to reduce confusion and overhead, attempts to export a multi-interval
SegmentChainwill raise an AttributeError.Instead, users may export the individual features from which the multi-interval
SegmentChainwas constructed, or construct features for them, setting ID, Parent, and type attributes following their own conventions.- Parameters
- feature_typestr
If not None, overrides the type attribute of self.attr
- escapebool, optional
Escape tokens in column 9 of GFF3 output (Default: True)
- excludeslist, optional
List of attribute key names to exclude from column 9 (Default: [])
- Returns
- str
Line of GFF3-formatted text
- Raises
- AttributeError
if the
SegmentChainhas multiple intervals
Notes
- Columns of GFF3 are as follows
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes
- For further information, see
- as_gtf(self, unicode feature_type=None, bool escape=True, list excludes=None)¶
Format
SegmentChainas a block of GTF2 output.The frame or phase attribute (GTF2 column 8) is valid only for ‘CDS’ features, and, if not present in self.attr, is calculated assuming the
SegmentChaincontains the entire coding region. If theSegmentChaincontains multiple intervals, the frame or phase attribute will always be recalculated.All attributes in self.attr, except those created upon import, will be propagated to all of the features that are generated.
- Parameters
- feature_typestr
If not None, overrides the “type” attribute of
self.attr- escapebool, optional
Escape tokens in column 9 of GTF output (Default: True)
- excludeslist, optional
List of attribute key names to exclude from column 8 (Default: [])
- Returns
- str
Block of GTF2-formatted text
Notes
- gene_id and transcript_id are required
The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in
SegmentChain.get_gene()andSegmentChain.get_name(), respectively.- Beware of attribute loss
To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this
Transcripthave been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.- Columns of GTF2 are as follows
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes. “gene_id” and “transcript_id” are required
- For more info
- as_psl(self)¶
Formats
SegmentChainas PSL (blat) output.- Returns
- str
PSL-representation of BLAT alignment
- Raises
- AttributeError
If not all of the attributes listed above are defined
Notes
This will raise an
AttributeErrorunless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:Column
Key
1
match_length2
mismatches3
rep_matches4
N5
query_gap_count6
query_gap_bases7
target_gap_count8
target_gap_bases9
strand10
query_name11
query_length12
query_start13
query_end14
target_name15
target_length16
target_start17
target_end19
q_starts: list of integers20
l_starts: list of integersThese keys are defined only if the
SegmentChainwas created bySegmentChain.from_psl(), or if the user has defined them.See the PSL spec for more information.
- covers(self, other)¶
Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length
SegmentChainsare not covered by other chains.- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- static from_bed(unicode line, extra_columns=0)¶
Create a
SegmentChainfrom a line from a BED file. The BED line may contain 4 to 12 columns, per the specification. These will be auto-detected and parsed appropriately.See the UCSC file format faq for more details.
- Parameters
- line
Line from a BED file, containing 4 or more columns
- extra_columns: int or list optional
Extra, non-BED columns in :term:`Extended BED`_ format file corresponding to feature attributes. This is common in ENCODE-specific BED variants.
if extra-columns is:
an
int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of theSegmentChain, under names like custom0, custom1, … , customN.a
listofstr, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under their respective names in the attr dict.a
listoftuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of theSegmentChainwill be set to formatter_func(column_value).
(Default: 0)
- Returns
- static from_psl(psl_line)¶
Create a
SegmentChainfrom a line from a PSL (BLAT) fileSee the PSL spec
- Parameters
- psl_linestr
Line from a PSL file
- Returns
- static from_str(unicode inp)¶
Create a
SegmentChainfrom a string formatted bySegmentChain.__str__():chrom:start-end^start-end(strand)
where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.
- Parameters
- inpstr
String formatted in manner of
SegmentChain.__str__()
- Returns
- get_antisense(self) SegmentChain¶
Returns an
SegmentChainantisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChainantisense to self
- get_counts(self, ga, stranded=True)¶
Return list of counts or values drawn from ga at each position in self
- Parameters
- gaGenomeArray from which to fetch counts
- strandedbool, optional
If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)
- Returns
- numpy.ndarray
Array of counts from ga covering self
- get_fasta(self, genome, stranded=True)¶
Formats sequence of SegmentChain as FASTA output
- Parameters
- genomedict or
twobitreader.TwoBitFile Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecordobjects- strandedbool
If True and the
SegmentChainis on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
FASTA-formatted seuqence of
SegmentChainextracted from genome
- get_gene(self)¶
Return name of gene associated with
SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made fromget_name().- Returns
- str
Returns in order of preference, gene_id from self.attr, Parent from self.attr or
'gene_%s' % self.get_name()
- get_genomic_coordinate(self, x, stranded=True)¶
Finds genomic coordinate corresponding to position x in self
- Parameters
- xint
position of interest, relative to
SegmentChain- strandedbool, optional
If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- Returns
- str
Chromosome name
- long
Genomic cordinate corresponding to position x
- str
Chromosome strand (‘+’, ‘-’, or ‘.’)
- Raises
- IndexError
if x is outside the bounds of the
SegmentChain
- get_junctions(self)¶
Returns a list of
GenomicSegmentsrepresenting spaces between theGenomicSegmentsin self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.- Returns
- list
List of
GenomicSegmentscovering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)
- get_masked_counts(self, ga, stranded=True, copy=False)¶
Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by
SegmentChain.add_mask()will be masked in the array- Parameters
- gndnon-abstract subclass of
AbstractGenomeArray GenomeArray from which to fetch counts
- strandedbool, optional
If true and the
SegmentChainis on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)- copybool, optional
If False (default) returns a view of the data; so changing values in the view changes the values in the
GenomeArrayif it is mutable. If True, a copy is returned instead.
- gndnon-abstract subclass of
- Returns
numpy.ma.masked_array
- get_masked_position_set(self) set¶
Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()- Returns
- set
Set of genomic coordinates, as integers
- get_masks(self)¶
Return masked positions as a list of
GenomicSegments- Returns
- list
list of
GenomicSegmentsrepresenting masked positions
- get_masks_as_segmentchain(self)¶
Return masked positions as a
SegmentChain- Returns
SegmentChainMasked positions
- get_name(self)¶
Returns the name of this
SegmentChain, first searching through self.attr for the keys ID, Name, and name. If no value is found for any of those keys, a name is generated usingSegmentChain.__str__()- Returns
- str
In order of preference, ID from self.attr, Name from self.attr, name from self.attr or
str(self)
- get_position_list(self)¶
Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChain- Returns
- list
Genomic coordinates in self, as integers, in genomic order
- get_position_set(self)¶
Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChain- Returns
- set
Set of genomic coordinates, as integers
- get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)¶
Finds the
SegmentChaincoordinate corresponding to a genomic position- Parameters
- chromstr
Chromosome name
- genomic_xint
coordinate, in genomic space
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- strandedbool, optional
If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)
- Returns
- int
Position in
SegmentChain
- Raises
- KeyError
if position outside bounds of
SegmentChain
- get_sequence(self, genome, stranded=True)¶
Return spliced genomic sequence of
SegmentChainas a string- Parameters
- genomedict or
twobitreader.TwoBitFile Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecordobjects- strandedbool
If True and the
SegmentChainis on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
Nucleotide sequence of the
SegmentChainextracted from genome
- get_subchain(self, long start, long end, bool stranded=True, **extra_attr)¶
Retrieves a sub-
SegmentChaincorresponding a range of positions specified in coordinates relative thisSegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.- Parameters
- startint
position of interest in SegmentChain coordinates, 0-indexed
- endint
position of interest in SegmentChain coordinates, 0-indexed and half-open
- strandedbool, optional
If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- extra_attrkeyword arguments
Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChaincovering parent chain positions start to end of self
- Raises
- IndexError
if start or end is outside the bounds of the
SegmentChain- TypeError
if start or end is None
- get_unstranded(self) SegmentChain¶
Returns an
SegmentChainantisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChainantisense to self
- next(self)¶
Return next
GenomicSegmentin theSegmentChain, from left to right on the chromsome
- overlaps(self, other)¶
Return True if self and other share genomic positions on the same strand
- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome and strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- reset_masks(self)¶
Removes masks added by
add_masks()See also
Returns a list of
GenomicSegmentthat are shared between self and other- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- list
List of
GenomicSegmentscommon to self and other
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- sort(self)¶
- unstranded_overlaps(self, other)¶
Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- attr¶
attr: dict
- c_strand¶
- chrom¶
Chromosome the SegmentChain resides on
- length¶
- mask_segments¶
Copy of list of
GenomicSegmentsrepresenting regions masked in self. Changing this list will do nothing to the masks in self.
- masked_length¶
- segments¶
Copy of list of
GenomicSegmentsthat comprise self. Changing this list will do nothing to self.
- spanning_segment¶
- strand¶
Strand of the SegmentChain
- class plastid.genomics.roitools.Transcript(*segments, **attributes)¶
Bases:
plastid.genomics.roitools.SegmentChainSubclass of
SegmentChainspecifically for RNA transcripts. In addition to coordinate-conversion, count fetching, sequence fetching, and various other methods inherited fromSegmentChain,Transcriptprovides convenience methods for fetching sub-chains corresponding to CDS features, 5’ UTRs, and 3’ UTRs.- Parameters
- *segments
GenomicSegment 0 or more GenomicSegments on the same strand
- **attrkeyword arguments
Arbitrary attributes, including, for example:
Attribute
Description
cds_genome_startLocation of CDS start, in genomic coordinates
cds_genome_startLocation of CDS end, in genomic coordinates
IDA unique ID for the
SegmentChain.transcript_idA transcript ID used for GTF2 export
gene_idA gene ID used for GTF2 export
- *segments
- Attributes
cds_genome_startint or NoneStarting coordinate of coding region, relative to genome (i.e.
cds_genome_endint or NoneEnding coordinate of coding region, relative to genome (i.e.
cds_startint or NoneStart of coding region relative to 5’ end of transcript, in direction of transcript.
cds_endint or NoneEnd of coding region relative to 5’ end of transcript, in direction of transcript.
- spanning_segment
GenomicSegment A GenomicSegment spanning the endpoints of the Transcript
strandstrStrand of the SegmentChain
chromstrChromosome the SegmentChain resides on
segmentslistCopy of list of
GenomicSegmentsthat comprise self.mask_segmentslistCopy of list of
GenomicSegmentsrepresenting regions masked in self. Changing this list will do nothing to the masks in self.attrdictattr: dict
Methods
add_masks(self, *mask_segments)Adds one or more
GenomicSegmentto the collection of masks.add_segments(self, *segments)Add 1 or more
GenomicSegmentsto theSegmentChain.antisense_overlaps(self, other)Returns True if self and other share genomic positions on opposite strands
as_bed(self[, as_int, color, extra_columns, ...])Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr
as_gff3(self, bool escape=True, ...)Format a
Transcriptas a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53as_gtf(self, unicode feature_type=u, ...)Format self as a GTF2 block.
as_psl(self)Formats
SegmentChainas PSL (blat) output.covers(self, other)Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
from_bed(unicode line[, extra_columns])Create a
Transcriptfrom a BED line with 4 or more columns.from_psl(unicode psl_line)from_str(unicode inp)Create a
SegmentChainfrom a string formatted bySegmentChain.__str__():get_antisense(self)Returns an
SegmentChainantisense to self, with empty attr dict.get_cds(self, **extra_attr)Retrieve
SegmentChaincovering the coding region of self, including the stop codon.get_counts(self, ga[, stranded])Return list of counts or values drawn from ga at each position in self
get_fasta(self, genome[, stranded])Formats sequence of SegmentChain as FASTA output
get_gene(self)Return name of gene associated with
SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent.get_genomic_coordinate(self, x[, stranded])Finds genomic coordinate corresponding to position x in self
get_junctions(self)Returns a list of
GenomicSegmentsrepresenting spaces between theGenomicSegmentsin self In the case of a transcript, these would represent introns.get_masked_counts(self, ga[, stranded, copy])Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
get_masked_position_set(self)Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()get_masks(self)Return masked positions as a list of
GenomicSegmentsReturn masked positions as a
SegmentChainget_name(self)Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name.
get_position_list(self)Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChainget_position_set(self)Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChainget_segmentchain_coordinate(self, ...)Finds the
SegmentChaincoordinate corresponding to a genomic positionget_sequence(self, genome[, stranded])Return spliced genomic sequence of
SegmentChainas a stringget_subchain(self, long start, long end, ...)Retrieves a sub-
SegmentChaincorresponding a range of positions specified in coordinates relative thisSegmentChain.get_unstranded(self)Returns an
SegmentChainantisense to self, with empty attr dict.get_utr3(self, **extra_attr)Retrieve sub-
SegmentChaincovering 3'UTR of self, excluding the stop codon.get_utr5(self, **extra_attr)Retrieve sub-
SegmentChaincovering 5'UTR of self.next(self)Return next
GenomicSegmentin theSegmentChain, from left to right on the chromsomeoverlaps(self, other)Return True if self and other share genomic positions on the same strand
reset_masks(self)Removes masks added by
add_masks()shares_segments_with(self, other)Returns a list of
GenomicSegmentthat are shared between self and othersort(self)unstranded_overlaps(self, other)Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- add_masks(self, *mask_segments)¶
Adds one or more
GenomicSegmentto the collection of masks. Masks will be trimmed to the positions of theSegmentChainduring addition.- Parameters
- mask_segments
GenomicSegment One or more segments, in genomic coordinates, covering positions to exclude from return values of
get_masked_position_set(),get_masked_counts(), orget_masked_length()
- mask_segments
- add_segments(self, *segments)¶
Add 1 or more
GenomicSegmentsto theSegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.- Parameters
- segments
GenomicSegment One or more
GenomicSegmentto add toSegmentChain
- segments
- antisense_overlaps(self, other)¶
Returns True if self and other share genomic positions on opposite strands
- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- as_bed(self, as_int=True, color=None, extra_columns=None, empty_value='')¶
Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr
If the
SegmentChainwas imported as a BED file with extra columns, these will be output in the same order, after the BED columns.- Parameters
- as_intbool, optional
Force “score” to integer (Default: True)
- colorstr or None, optional
Color represented as RGB hex string. If not none, overrides the color in self.attr[“color”]
- extra_columnsNone or list-like, optional
If None, and the
SegmentChainwas imported using the extra_columns keyword offrom_bed(), theSegmentChainwill be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, theSegmentChainwill be exported as a BED12 line.If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the
SegmentChain, it will be exported with the value of empty_valueIf an empty list, no extra columns will be exported; the
SegmentChainwill be formatted as a BED12 line.- empty_valuestr, optional
- Returns
- str
Line of BED12-formatted text
Notes
- BED12 columns are as follows
Column
Contains
0
Contig or chromosome
1
Start of first block in feature (0-indexed)
2
End of last block in feature (half-open)
3
Feature name
4
Feature score
5
Strand
6
thickstart
7
thickend
8
Feature color as RGB tuple
9
Number of blocks in feature
10
Block lengths
11
Block starts, relative to start of first block
- Fore more information
See the UCSC file format faq
- as_gff3(self, bool escape=True, list excludes=None, unicode rna_type=u'mRNA')¶
Format a
Transcriptas a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53The
Transcriptwill be formatted according to the following rules:A feature of type rna_type will be created, with Parent attribute set to the value of
self.get_gene(), and ID attribute set toself.get_name()For each
GenomicSegmentin self, a child feature of type exon will be created. The Parent attribute of these features will be set to the value ofself.get_name(). These will have unique IDs generated fromself.get_name().If self is coding (i.e. has none-None value for self.cds_genome_start and self.cds_genome_end), child features of type ‘five_prime_UTR’, ‘CDS’, and ‘three_prime_UTR’ will be created, with Parent attributes set to
self.get_name(). These will have unique IDs generated fromself.get_name().
- Parameters
- escapebool, optional
Escape tokens in column 9 of GFF3 output (Default: True)
- excludeslist, optional
List of attribute key names to exclude from column 9 (Default: [])
- rna_typestr, optional
Feature type to export RNA as (e.g. ‘tRNA’, ‘noncoding_RNA’, et c. Default: ‘mRNA’)
- Returns
- str
Multiline block of GFF3-formatted text
Notes
- Beware of attribute loss
This
Transcriptwas assembled from multiple individual component features (e.g. single exons), which may or may not have had their own unique attributes in their original annotation. To reduce overhead, these individual attributes (if they were present) have not been (entirely) stored, and consequently will not (all) be exported. If this poses problems, consider instead importing, modifying, and exporting the component features- GFF3 schemas vary
Different GFF3s have different schemas (parent-child relationships between features). Here we adopt the commonly-used schema set by Sequence Ontology (SO) v2.53, which may or may not match your schema.
- Columns of GFF3 are as follows
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes
- For futher information, see
Sequence Ontology (SO) v2.53 <http://www.sequenceontology.org/browser/>
- as_gtf(self, unicode feature_type=u'exon', bool escape=True, list excludes=None)¶
Format self as a GTF2 block.
GenomicSegmentsare formatted as GTF2 ‘exon’ features. Coding regions, if peresent, are formatted as GTF2 ‘CDS’ features. Stop codons are excluded in the ‘CDS’ features, per the GTF2 specification, and exported separately.All attributes from self.attr are propagated to the exon and CDS features that are generated.
- Parameters
- feature_typestr
If not None, overrides the ‘type’ attribute of self.attr
- escapebool, optional
URL escape tokens in column 9 of GTF2 output (Default: True)
- Returns
- str
Block of GTF2-formatted text
Notes
- gene_id and transcript_id are required
The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in
SegmentChain.get_gene()andSegmentChain.get_name(), respectively.- Beware of attribute loss
To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this
Transcripthave been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.
Columns of GTF2 are as follows:
Column
Contains
1
Contig or chromosome
2
Source of annotation
3
Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4
Start (1-indexed)
5
End (fully-closed)
6
Score
7
Strand
8
Frame. Number of bases within feature before first in-frame codon (if coding)
9
Attributes. “gene_id” and “transcript_id” are required
- For more info
- as_psl(self)¶
Formats
SegmentChainas PSL (blat) output.- Returns
- str
PSL-representation of BLAT alignment
- Raises
- AttributeError
If not all of the attributes listed above are defined
Notes
This will raise an
AttributeErrorunless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:Column
Key
1
match_length2
mismatches3
rep_matches4
N5
query_gap_count6
query_gap_bases7
target_gap_count8
target_gap_bases9
strand10
query_name11
query_length12
query_start13
query_end14
target_name15
target_length16
target_start17
target_end19
q_starts: list of integers20
l_starts: list of integersThese keys are defined only if the
SegmentChainwas created bySegmentChain.from_psl(), or if the user has defined them.See the PSL spec for more information.
- covers(self, other)¶
Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length
SegmentChainsare not covered by other chains.- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- static from_bed(unicode line, extra_columns=0)¶
Create a
Transcriptfrom a BED line with 4 or more columns. thickstart and thickend columns, if present, are assumed to specify CDS boundaries, a convention that, while common, is formally outside the BED specification.See the UCSC file format faq for more details.
- Parameters
- line
Line from a BED file with at least 4 columns
- extra_columns: int or list, optional
Extra, non-BED columns in BED file corresponding to feature attributes. This is common in ENCODE-specific BED variants.
if extra-columns is:
an
int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of theSegmentChain, under names like custom0, custom1, … , customN.a
listofstr, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored undera
listoftuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of theSegmentChainwill be set to formatter_func(column_value).
(Default: 0)
- Returns
- static from_psl(unicode psl_line)¶
- static from_str(unicode inp)¶
Create a
SegmentChainfrom a string formatted bySegmentChain.__str__():chrom:start-end^start-end(strand)
where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.
- Parameters
- inpstr
String formatted in manner of
SegmentChain.__str__()
- Returns
- get_antisense(self) SegmentChain¶
Returns an
SegmentChainantisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChainantisense to self
- get_cds(self, **extra_attr)¶
Retrieve
SegmentChaincovering the coding region of self, including the stop codon. If no coding region is present, returns an emptySegmentChain.The following attributes are passed from self.attr to the new
SegmentChaintranscript_id, taken from
SegmentChain.get_name()gene_id, taken from
SegmentChain.get_gene()ID, generated as “%s_CDS % self.get_name()
- Parameters
- extra_attrkeyword arguments
Values that will be included in the CDS subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChainCDS region of self if present, otherwise empty
SegmentChain
- get_counts(self, ga, stranded=True)¶
Return list of counts or values drawn from ga at each position in self
- Parameters
- gaGenomeArray from which to fetch counts
- strandedbool, optional
If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)
- Returns
- numpy.ndarray
Array of counts from ga covering self
- get_fasta(self, genome, stranded=True)¶
Formats sequence of SegmentChain as FASTA output
- Parameters
- genomedict or
twobitreader.TwoBitFile Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecordobjects- strandedbool
If True and the
SegmentChainis on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
FASTA-formatted seuqence of
SegmentChainextracted from genome
- get_gene(self)¶
Return name of gene associated with
SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made fromget_name().- Returns
- str
Returns in order of preference, gene_id from self.attr, Parent from self.attr or
'gene_%s' % self.get_name()
- get_genomic_coordinate(self, x, stranded=True)¶
Finds genomic coordinate corresponding to position x in self
- Parameters
- xint
position of interest, relative to
SegmentChain- strandedbool, optional
If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- Returns
- str
Chromosome name
- long
Genomic cordinate corresponding to position x
- str
Chromosome strand (‘+’, ‘-’, or ‘.’)
- Raises
- IndexError
if x is outside the bounds of the
SegmentChain
- get_junctions(self)¶
Returns a list of
GenomicSegmentsrepresenting spaces between theGenomicSegmentsin self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.- Returns
- list
List of
GenomicSegmentscovering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)
- get_masked_counts(self, ga, stranded=True, copy=False)¶
Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by
SegmentChain.add_mask()will be masked in the array- Parameters
- gndnon-abstract subclass of
AbstractGenomeArray GenomeArray from which to fetch counts
- strandedbool, optional
If true and the
SegmentChainis on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)- copybool, optional
If False (default) returns a view of the data; so changing values in the view changes the values in the
GenomeArrayif it is mutable. If True, a copy is returned instead.
- gndnon-abstract subclass of
- Returns
numpy.ma.masked_array
- get_masked_position_set(self) set¶
Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using
SegmentChain.add_masks()- Returns
- set
Set of genomic coordinates, as integers
- get_masks(self)¶
Return masked positions as a list of
GenomicSegments- Returns
- list
list of
GenomicSegmentsrepresenting masked positions
- get_masks_as_segmentchain(self)¶
Return masked positions as a
SegmentChain- Returns
SegmentChainMasked positions
- get_name(self)¶
Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name. If no value is found,
Transcript.__str__()is used.- Returns
- str
Returns in order of preference, transcript_id, ID, Name, or name from self.attr. If not found, returns
str(self)
- get_position_list(self)¶
Retrieve a sorted end-inclusive numpy array of genomic coordinates in this
SegmentChain- Returns
- list
Genomic coordinates in self, as integers, in genomic order
- get_position_set(self)¶
Retrieve an end-inclusive set of genomic coordinates included in this
SegmentChain- Returns
- set
Set of genomic coordinates, as integers
- get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)¶
Finds the
SegmentChaincoordinate corresponding to a genomic position- Parameters
- chromstr
Chromosome name
- genomic_xint
coordinate, in genomic space
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- strandedbool, optional
If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)
- Returns
- int
Position in
SegmentChain
- Raises
- KeyError
if position outside bounds of
SegmentChain
- get_sequence(self, genome, stranded=True)¶
Return spliced genomic sequence of
SegmentChainas a string- Parameters
- genomedict or
twobitreader.TwoBitFile Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or
Bio.Seq.SeqRecordobjects- strandedbool
If True and the
SegmentChainis on the minus strand, sequence will be reverse-complemented (Default: True)
- genomedict or
- Returns
- str
Nucleotide sequence of the
SegmentChainextracted from genome
- get_subchain(self, long start, long end, bool stranded=True, **extra_attr)¶
Retrieves a sub-
SegmentChaincorresponding a range of positions specified in coordinates relative thisSegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.- Parameters
- startint
position of interest in SegmentChain coordinates, 0-indexed
- endint
position of interest in SegmentChain coordinates, 0-indexed and half-open
- strandedbool, optional
If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
- extra_attrkeyword arguments
Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChaincovering parent chain positions start to end of self
- Raises
- IndexError
if start or end is outside the bounds of the
SegmentChain- TypeError
if start or end is None
- get_unstranded(self) SegmentChain¶
Returns an
SegmentChainantisense to self, with empty attr dict.- Returns
- SegmentChain
SegmentChainantisense to self
- get_utr3(self, **extra_attr)¶
Retrieve sub-
SegmentChaincovering 3’UTR of self, excluding the stop codon. If no coding region, returns an emptySegmentChainThe following attributes are passed from
self.attrto the newSegmentChaintranscript_id, taken from
SegmentChain.get_name()gene_id, taken from
SegmentChain.get_gene()ID, generated as “%s_3UTR” % self.get_name()
- Parameters
- extra_attrkeyword arguments
Values that will be included in the 3’ UTR subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChain3’ UTR region of self if present, otherwise empty
SegmentChain
- get_utr5(self, **extra_attr)¶
Retrieve sub-
SegmentChaincovering 5’UTR of self. If no coding region, returns an emptySegmentChainThe following attributes are passed from self.attr to the new
SegmentChaintranscript_id, taken from
SegmentChain.get_name()gene_id, taken from
SegmentChain.get_gene()ID, generated as “%s_5UTR” % self.get_name()
- Parameters
- extra_attrkeyword arguments
Values that will be included in the 5’UTR subchain’s attr dict. These can be used to overwrite values already present.
- Returns
SegmentChain5’ UTR region of self if present, otherwise empty
SegmentChain
- next(self)¶
Return next
GenomicSegmentin theSegmentChain, from left to right on the chromsome
- overlaps(self, other)¶
Return True if self and other share genomic positions on the same strand
- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome and strand; False otherwise.
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- reset_masks(self)¶
Removes masks added by
add_masks()See also
Returns a list of
GenomicSegmentthat are shared between self and other- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- list
List of
GenomicSegmentscommon to self and other
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- sort(self)¶
- unstranded_overlaps(self, other)¶
Return True if self and other share genomic positions on the same chromosome, regardless of their strands
- Parameters
- other
SegmentChainorGenomicSegment Query feature
- other
- Returns
- bool
True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match
- Raises
- TypeError
if other is not a
GenomicSegmentorSegmentChain
- attr¶
attr: dict
- c_strand¶
- cds_end¶
End of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_start, self.cds_genome_start and self.cds_genome_end to None
- cds_genome_end¶
Ending coordinate of coding region, relative to genome (i.e. leftmost; is stop codon for forward-strand features, start codon for reverse-strand features. Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_start to None
- cds_genome_start¶
Starting coordinate of coding region, relative to genome (i.e. leftmost; is start codon for forward-strand features, stop codon for reverse-strand features). Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_end to None
- cds_start¶
Start of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_end, self.cds_genome_start and self.cds_genome_end to None
- chrom¶
Chromosome the SegmentChain resides on
- length¶
- mask_segments¶
Copy of list of
GenomicSegmentsrepresenting regions masked in self. Changing this list will do nothing to the masks in self.
- masked_length¶
- segments¶
Copy of list of
GenomicSegmentsthat comprise self. Changing this list will do nothing to self.
- spanning_segment¶
- strand¶
Strand of the SegmentChain
- plastid.genomics.roitools.add_three_for_stop_codon(Transcript tx) Transcript¶
Extend an annotated CDS region, if present, by three nucleotides at the threeprime end. Use in cases when annotation files exclude the stop codon from the annotated CDS.
- Parameters
- tx
Transcript query transcript
- tx
- Returns
TranscriptTranscriptwith same attributes as tx, but with CDS extended by one codon
- Raises
- IndexError
if a three prime UTR is defined that terminates before the complete stop codon
- plastid.genomics.roitools.merge_segments(list segments) list¶
Merge all overlapping
GenomicSegmentsin segments, so that all segments returned are guaranteed to be sorted and non-overlapping.Note
All segments are assumed to be on the same strand and chromosome.
- Parameters
- segmentslist
List of
GenomicSegments, all on the same strand and chromosome
- Returns
- list
List of sorted, non-overlapping
GenomicSegments
- plastid.genomics.roitools.positionlist_to_segments(unicode chrom, unicode strand, list positions) list¶
Construct
GenomicSegmentsfrom a chromosome name, a strand, and a list of chromosomal positions.- Parameters
- chromstr
Chromosome name
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- positionslist of unique integers
Sorted, end-inclusive list of positions to include in final
GenomicSegment
- Returns
- list
List of
GenomicSegmentscovering positionsWarning
This function is meant to quickly without excessive type conversions. So, the elements positions must be UNIQUE and SORTED. If they are not, use
positions_to_segments()instead.
- plastid.genomics.roitools.positions_to_segments(unicode chrom, unicode strand, positions) list¶
Construct
GenomicSegmentsfrom a chromosome name, a strand, and a list of chromosomal positions.- Parameters
- chromstr
Chromosome name
- strandstr
Chromosome strand (‘+’, ‘-’, or ‘.’)
- positionslist of integers
End-inclusive list, tuple, or set of positions to include in final
GenomicSegment
- Returns
- list
List of
GenomicSegmentscovering positions