plastid.genomics.roitools module

This module contains classes for representing and manipulating genomic features.

Summary

Genomic features are represented as SegmentChains, which can contain zero or more continuous spans of the genome (GenomicSegments), as well as rich annotation data. For the specific case of RNA transcripts, a subclass of SegmentChain, called Transcript is provided.

Module contents

GenomicSegment Building block for SegmentChain: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.
SegmentChain Base class for genomic features, composed of zero or more GenomicSegments.
Transcript Subclass of SegmentChain specifically for RNA transcripts.
positions_to_segments(str chrom, str strand, …) Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions
add_three_for_stop_codon(Transcript tx) Extend an annotated CDS region, if present, by three nucleotides at the threeprime end.

Examples

SegmentChains may be read directly from annotation files using the readers in plastid.readers:

>>> from plastid import *
>>> chains = list(BED_Reader(open("some_file.bed")))

or constructed from GenomicSegments:

>>> seg1 = GenomicSegment("chrA",5,200,"-")
>>> seg2 = GenomicSegment("chrA",250,300,"-")
>>> my_chain = SegmentChain(seg1,seg2,ID="some_chain", ... , some_attribute="some_value")

SegmentChains contain convenience methods for a number of comman tasks, for example:

  • converting coordinates between the spliced space of the chain, and the genome:

    >>> # get coordinate of 50th position from 5' end
    >>> my_chain.get_genomic_coordinate(50)
    ('chrA', 199, '-')
            
    # get coordinate of 49th position. splicing is taken care of!
    >>> my_chain.get_genomic_coordinate(49)
    ('chrA', 250, '-')
    
    # get coordinate in chain corresponding to genomic coordinate 118
    >>> my_chain.get_segmentchain_coordinate("chrA",118,"-")
    131
    
    # get a subchain containing positions 45-70
    >>> subchain = my_chain.get_subchain(45,70)
    >>> subchain
    <SegmentChain segments=2 bounds=chrA:180-255(-) name=some_chain_subchain>
    
    # the subchain preserves the discontinuity found in `my_chain`
    >>> subchain.segments
    [<GenomicSegment chrA:180-200 strand='-'>,
     <GenomicSegment chrA:250-255 strand='-'>]
    
  • fetching numpy arrays of data at each position in the chain. The data is assumed to be kept in a GenomeArray:

    >>> ga = BAMGenomeArray(["some_file.bam"],mapping=ThreePrimeMapFactory(offset=15))
    >>> my_chain.get_counts(ga)
    array([843, 854, 153,  86, 462, 359, 290,  38,  38, 758, 342, 299, 430,
           628, 324, 437, 231, 417, 536, 673, 243, 981, 661, 415, 207, 446,
           197, 520, 653, 468, 863,   3, 272, 754, 352, 960, 966, 913, 367,
           ...
           ])
    
  • similarly, fetching spliced sequence, reverse-complemented if necessary for minus-strand features. As input, the SegmentChain expects a dictionary-like object mapping chromosome names to string-like sequences (e.g. as in BioPython or twobitreader):

    >>> seqdict = { "chrA" : "TCTACATA ..." } # some string of chrA sequence
    >>> my_chain.get_sequence(seqdict)
    "ACTGTGTACTGTACGATCGATCGTACGTACGATCGATCGTACGTAGCTAGTCAGCTAGCTAGCTAGCTGA..." 
    
  • testing for overlap, containment, equality with other SegmentChains:

    >>> other_chain = SegmentChain(GenomicSegment("chrA",200,300,"-"),
    >>>                            GenomicSegment("chrA",800,900,"-"))
                                   
    >>>  my_chain.overlaps(other_chain)
    True
    
    >>> other_chain in my_chain
    False
    
    >>> my_chain in my_chain
    True
    
    >>> my_chain.covers(other_chain)
    False
    
    >>> my_chain == other_chain
    False
    
    >>> my_chain == my_chain
    True
    
  • export to BED, GTF2, or GFF3:

    >>> my_chain.as_bed()
    chrA    5    300    some_chain    0    -    5    5    0,0,0    2    195,50,    0,245,
    
    >>> my_chain.as_gtf()
    chrA    .    exon    6    200    .    -    .    gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";
    chrA    .    exon    251  300    .    -    .    gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";
    
class plastid.genomics.roitools.GenomicSegment

Bases: object

Building block for SegmentChain: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.

Examples

GenomicSegments sort lexically by chromosome, start position, end position, and finally strand:

>>> GenomicSegment("chrA",50,100,"+") < GenomicSegment("chrB",0,10,"+")
True

>>> GenomicSegment("chrA",50,100,"+") < GenomicSegment("chrA",75,100,"+")
True

>>> GenomicSegment("chrA",50,100,"+") < GenomicSegment("chrA",55,75,"+")
True

>>> GenomicSegment("chrA",50,100,"+") < GenomicSegment("chrA",50,150,"+")
True

>>> GenomicSegment("chrA",50,100,"+") < GenomicSegment("chrA",50,100,"-")
True

They also provide a few convenience methods for containment or overlap. To be contained, a segment must be on the same chromosome and strand as its container, and its coordinates must be within or equal to its endpoints:

>>> GenomicSegment("chrA",50,100,"+") in GenomicSegment("chrA",25,100,"+")
True

>>> GenomicSegment("chrA",50,100,"+") in GenomicSegment("chrA",50,100,"+")
True

>>> GenomicSegment("chrA",50,100,"+") in GenomicSegment("chrA",25,100,"-")
False

>>> GenomicSegment("chrA",50,100,"+") in GenomicSegment("chrA",75,200,"+")
False

Similarly, to overlap, GenomicSegments must be on the same strand and chromosome.

Attributes:
chrom : str

Chromosome where GenomicSegment resides

start : int

Zero-indexed (Pythonic) start coordinate of GenomicSegment

end : int

Zero-indexed, half-open (Pythonic) end coordinate of GenomicSegment

strand : str

Strand of GenomicSegment

Methods

as_igv_str(self) Format as an IGV location string
contains(self, GenomicSegment other) Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.
from_igv_str(str loc_str, str strand=) Construct GenomicSegment from IGV location string
from_str(str inp) Construct a GenomicSegment from its str() representation
overlaps(self, GenomicSegment other) Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.
as_igv_str(self) → str

Format as an IGV location string

contains(self, GenomicSegment other) → bool

Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.

Parameters:
other : GenomicSegment

Query segment

Returns:
bool
static from_igv_str(str loc_str, str strand='.')

Construct GenomicSegment from IGV location string

Parameters:
igvloc : str

IGV location string, in format ‘chromosome:start-end’, where start and end are 1-indexed and half-open

strand : str

The chromosome strand (‘+’, ‘-‘, or ‘.’)

Returns:
|GenomicSegment|
static from_str(str inp)

Construct a GenomicSegment from its str() representation

Parameters:
inp : str

String representation of GenomicSegment as chrom:start-end(strand) where start and end are in 0-indexed, half-open coordinates

Returns:
|GenomicSegment|
overlaps(self, GenomicSegment other) → bool

Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.

Parameters:
other : GenomicSegment

Query segment

Returns:
bool
c_strand
chrom

Chromosome where GenomicSegment resides

end

Zero-indexed, half-open (Pythonic) end coordinate of GenomicSegment

start

Zero-indexed (Pythonic) start coordinate of GenomicSegment

strand

Strand of GenomicSegment

  • ‘+’ for forward / Watson strand
  • ‘-‘ for reverse / Crick strand
  • ‘.’ for unstranded / both strands
class plastid.genomics.roitools.SegmentChain

Bases: object

Base class for genomic features, composed of zero or more GenomicSegments. SegmentChains can therefore model discontinuous, features – such as multi-exon transcripts or gapped alignments – in addition, to continuous features.

Numerous convenience functions are supplied for:

  • converting between coordinates relative to the genome and relative to the internal coordinates of a spliced SegmentChain
  • fetching genomic sequence, read alignments, or count data, accounting for splicing of the segments, and, in the case of reverse-strand features, reverse-complementing
  • slicing or fetching sub-regions of a SegmentChain
  • testing equality, inequality, overlap, containment, coverage of, or sharing of segments with other SegmentChain or GenomicSegment objects
  • import/export to BED, PSL, GTF2, and GFF3 formats, for use in other software packages or in a genome browser.

Intervals are sorted from lowest to greatest starting coordinate on their reference sequence, regardless of strand. Iteration over the SegmentChain will yield intervals from left-to-right in the genome.

Parameters:
*segments : GenomicSegment

0 or more GenomicSegments on the same strand

**attr : keyword arguments

Arbitrary attributes, including, for example:

Attribute Description
type A feature type used for GTF2/GFF3 export of each interval in the SegmentChain. (Default: ‘exon’)
ID A unique ID for the SegmentChain.
transcript_id A transcript ID used for GTF2 export
gene_id A gene ID used for GTF2 export
Attributes:
spanning_segment : GenomicSegment

A GenomicSegment spanning the endpoints of the SegmentChain

strand : str

Strand of the SegmentChain

chrom : str

Chromosome the SegmentChain resides on

attr : dict

attr: dict

segments : list

Copy of list of GenomicSegments that comprise self.

mask_segments : list

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

Methods

add_masks(self, *mask_segments) Adds one or more GenomicSegment to the collection of masks.
add_segments(self, *segments) Add 1 or more GenomicSegments to the SegmentChain.
antisense_overlaps(self, other) Returns True if self and other share genomic positions on opposite strands
as_bed(self[, thickstart, thickend, as_int, …]) Format SegmentChain as a string of BED12[+X] output.
as_gff3(self, str feature_type=None, …) Format self as a line of GFF3 output.
as_gtf(self, str feature_type=None, …) Format SegmentChain as a block of GTF2 output.
as_psl(self) Formats SegmentChain as PSL (blat) output.
covers(self, other) Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
from_bed(str line[, extra_columns]) Create a SegmentChain from a line from a BED file.
from_psl(psl_line) Create a SegmentChain from a line from a PSL (BLAT) file
from_str(str inp) Create a SegmentChain from a string formatted by SegmentChain.__str__():
get_antisense(self) Returns an SegmentChain antisense to self, with empty attr dict.
get_counts(self, ga[, stranded]) Return list of counts or values drawn from ga at each position in self
get_fasta(self, genome[, stranded]) Formats sequence of SegmentChain as FASTA output
get_gene(self) Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent.
get_genomic_coordinate(self, x[, stranded]) Finds genomic coordinate corresponding to position x in self
get_junctions(self) Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns.
get_length(self) Return total length, in nucleotides, of self
get_masked_counts(self, ga[, stranded, copy]) Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
get_masked_length(self) Return the total length, in nucleotides, of positions in self that have not been masked using SegmentChain.add_masks()
get_masked_position_set(self) Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()
get_masks(self) Return masked positions as a list of GenomicSegments
get_masks_as_segmentchain(self) Return masked positions as a SegmentChain
get_name(self) Returns the name of this SegmentChain, first searching through self.attr for the keys ID, Name, and name.
get_position_list(self) Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain
get_position_set(self) Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain
get_segmentchain_coordinate(self, str chrom, …) Finds the SegmentChain coordinate corresponding to a genomic position
get_sequence(self, genome[, stranded]) Return spliced genomic sequence of SegmentChain as a string
get_subchain(self, long start, long end, …) Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain.
get_unstranded(self) Returns an SegmentChain antisense to self, with empty attr dict.
next
overlaps(self, other) Return True if self and other share genomic positions on the same strand
reset_masks(self) Removes masks added by add_masks()
shares_segments_with(self, other) Returns a list of GenomicSegment that are shared between self and other
sort(self)
unstranded_overlaps(self, other) Return True if self and other share genomic positions on the same chromosome, regardless of their strands
add_masks(self, *mask_segments)

Adds one or more GenomicSegment to the collection of masks. Masks will be trimmed to the positions of the SegmentChain during addition.

Parameters:
mask_segments : GenomicSegment

One or more segments, in genomic coordinates, covering positions to exclude from return values of get_masked_position_set(), get_masked_counts(), or get_masked_length()

add_segments(self, *segments)

Add 1 or more GenomicSegments to the SegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.

Parameters:
segments : GenomicSegment

One or more GenomicSegment to add to SegmentChain

antisense_overlaps(self, other)

Returns True if self and other share genomic positions on opposite strands

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

as_bed(self, thickstart=None, thickend=None, as_int=True, color=None, extra_columns=None, empty_value='')

Format SegmentChain as a string of BED12[+X] output.

If the SegmentChain was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.

Parameters:
thickstart : int or None, optional

If not None, overrides the genome coordinate that starts thick plotting in genome browser found in self.attr[‘thickstart’]

thickend : int or None, optional

If not None, overrides the genome coordinate that stops thick plotting in genome browser found in self.attr[‘thickend’]

as_int : bool, optional

Force score to integer (Default: True)

color : str or None, optional

Color represented as RGB hex string. If not none, overrides the color in self.attr[‘color’]

extra_columns : None or list-like, optional

If None, and the SegmentChain was imported using the extra_columns keyword of from_bed(), the SegmentChain will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, the SegmentChain will be exported as a BED12 line.

If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the SegmentChain, it will be exported with the value of empty_value

If an empty list, no extra columns will be exported; the SegmentChain will be formatted as a BED12 line.

empty_value : str, optional

Value to export for extra_columns that are not defined (Default: “”)

Returns:
str

Line of BED12[+X]-formatted text

Notes

BED12 columns are as follows:
Column Contains
1 Contig or chromosome
2 Start of first block in feature (0-indexed)
3 End of last block in feature (half-open)
4 Feature name
5 Feature score
6 Strand
7 thickstart (in chromosomal coordinates)
8 thickend (in chromosomal coordinates)
9 Feature color as RGB tuple
10 Number of blocks in feature
11 Block lengths
12 Block starts, relative to start of first block
For more details
See the UCSC file format faq
as_gff3(self, str feature_type=None, bool escape=True, list excludes=None)

Format self as a line of GFF3 output.

Because GFF3 files permit many schemas of parent-child hierarchy, and in order to reduce confusion and overhead, attempts to export a multi-interval SegmentChain will raise an AttributeError.

Instead, users may export the individual features from which the multi-interval SegmentChain was constructed, or construct features for them, setting ID, Parent, and type attributes following their own conventions.

Parameters:
feature_type : str

If not None, overrides the type attribute of self.attr

escape : bool, optional

Escape tokens in column 9 of GFF3 output (Default: True)

excludes : list, optional

List of attribute key names to exclude from column 9 (Default: [])

Returns:
str

Line of GFF3-formatted text

Raises:
AttributeError

if the SegmentChain has multiple intervals

Notes

Columns of GFF3 are as follows
Column Contains
1 Contig or chromosome
2 Source of annotation
3 Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4 Start (1-indexed)
5 End (fully-closed)
6 Score
7 Strand
8 Frame. Number of bases within feature before first in-frame codon (if coding)
9 Attributes
For further information, see
as_gtf(self, str feature_type=None, bool escape=True, list excludes=None)

Format SegmentChain as a block of GTF2 output.

The frame or phase attribute (GTF2 column 8) is valid only for ‘CDS’ features, and, if not present in self.attr, is calculated assuming the SegmentChain contains the entire coding region. If the SegmentChain contains multiple intervals, the frame or phase attribute will always be recalculated.

All attributes in self.attr, except those created upon import, will be propagated to all of the features that are generated.

Parameters:
feature_type : str

If not None, overrides the “type” attribute of self.attr

escape : bool, optional

Escape tokens in column 9 of GTF output (Default: True)

excludes : list, optional

List of attribute key names to exclude from column 8 (Default: [])

Returns:
str

Block of GTF2-formatted text

Notes

gene_id and transcript_id are required
The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in SegmentChain.get_gene() and SegmentChain.get_name(), respectively.
Beware of attribute loss
To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this Transcript have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.
Columns of GTF2 are as follows
Column Contains
1 Contig or chromosome
2 Source of annotation
3 Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4 Start (1-indexed)
5 End (fully-closed)
6 Score
7 Strand
8 Frame. Number of bases within feature before first in-frame codon (if coding)
9 Attributes. “gene_id” and “transcript_id” are required
For more info
as_psl(self)

Formats SegmentChain as PSL (blat) output.

Returns:
str

PSL-representation of BLAT alignment

Raises:
AttributeError

If not all of the attributes listed above are defined

Notes

This will raise an AttributeError unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:

Column Key
1 match_length
2 mismatches
3 rep_matches
4 N
5 query_gap_count
6 query_gap_bases
7 target_gap_count
8 target_gap_bases
9 strand
10 query_name
11 query_length
12 query_start
13 query_end
14 target_name
15 target_length
16 target_start
17 target_end
19 q_starts : list of integers
20 l_starts : list of integers

These keys are defined only if the SegmentChain was created by SegmentChain.from_psl(), or if the user has defined them.

See the PSL spec for more information.

covers(self, other)

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length SegmentChains are not covered by other chains.

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

static from_bed(str line, extra_columns=0)

Create a SegmentChain from a line from a BED file. The BED line may contain 4 to 12 columns, per the specification. These will be auto-detected and parsed appropriately.

See the UCSC file format faq for more details.

Parameters:
line

Line from a BED file, containing 4 or more columns

extra_columns: int or list optional

Extra, non-BED columns in :term:`Extended BED`_ format file corresponding to feature attributes. This is common in ENCODE-specific BED variants.

if extra-columns is:

  • an int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of the SegmentChain, under names like custom0, custom1, … , customN.
  • a list of str, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under their respective names in the attr dict.
  • a list of tuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of the SegmentChain will be set to formatter_func(column_value).

(Default: 0)

Returns:
|SegmentChain|
static from_psl(psl_line)

Create a SegmentChain from a line from a PSL (BLAT) file

See the PSL spec

Parameters:
psl_line : str

Line from a PSL file

Returns:
|SegmentChain|
static from_str(str inp)

Create a SegmentChain from a string formatted by SegmentChain.__str__():

chrom:start-end^start-end(strand)

where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-‘, or ‘.’. Coordinates are 0-indexed and half-open.

Parameters:
inp : str

String formatted in manner of SegmentChain.__str__()

Returns:
|SegmentChain|
get_antisense(self) → SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns:
SegmentChain

SegmentChain antisense to self

get_counts(self, ga, stranded=True)

Return list of counts or values drawn from ga at each position in self

Parameters:
ga : GenomeArray from which to fetch counts
stranded : bool, optional

If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

Returns:
numpy.ndarray

Array of counts from ga covering self

get_fasta(self, genome, stranded=True)

Formats sequence of SegmentChain as FASTA output

Parameters:
genome : dict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

stranded : bool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns:
str

FASTA-formatted seuqence of SegmentChain extracted from genome

get_gene(self)

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made from get_name().

Returns:
str

Returns in order of preference, gene_id from self.attr, Parent from self.attr or 'gene_%s' % self.get_name()

get_genomic_coordinate(self, x, stranded=True)

Finds genomic coordinate corresponding to position x in self

Parameters:
x : int

position of interest, relative to SegmentChain

stranded : bool, optional

If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

Returns:
str

Chromosome name

long

Genomic cordinate corresponding to position x

str

Chromosome strand (‘+’, ‘-‘, or ‘.’)

Raises:
IndexError

if x is outside the bounds of the SegmentChain

get_junctions(self)

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.

Returns:
list

List of GenomicSegments covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)

get_length(self)

Return total length, in nucleotides, of self

Returns:
int
get_masked_counts(self, ga, stranded=True, copy=False)

Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by SegmentChain.add_mask() will be masked in the array

Parameters:
gnd : non-abstract subclass of AbstractGenomeArray

GenomeArray from which to fetch counts

stranded : bool, optional

If true and the SegmentChain is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

copy : bool, optional

If False (default) returns a view of the data; so changing values in the view changes the values in the GenomeArray if it is mutable. If True, a copy is returned instead.

Returns:
:py:class:`numpy.ma.masked_array`
get_masked_length(self)

Return the total length, in nucleotides, of positions in self that have not been masked using SegmentChain.add_masks()

Returns:
int
get_masked_position_set(self) → set

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

Returns:
set

Set of genomic coordinates, as integers

get_masks(self)

Return masked positions as a list of GenomicSegments

Returns:
list

list of GenomicSegments representing masked positions

get_masks_as_segmentchain(self)

Return masked positions as a SegmentChain

Returns:
|SegmentChain|

Masked positions

get_name(self)

Returns the name of this SegmentChain, first searching through self.attr for the keys ID, Name, and name. If no value is found for any of those keys, a name is generated using SegmentChain.__str__()

Returns:
str

In order of preference, ID from self.attr, Name from self.attr, name from self.attr or str(self)

get_position_list(self)

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

Returns:
list

Genomic coordinates in self, as integers, in genomic order

get_position_set(self)

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

Returns:
set

Set of genomic coordinates, as integers

get_segmentchain_coordinate(self, str chrom, long genomic_x, str strand, bool stranded=True)

Finds the SegmentChain coordinate corresponding to a genomic position

Parameters:
chrom : str

Chromosome name

genomic_x : int

coordinate, in genomic space

strand : str

Chromosome strand (‘+’, ‘-‘, or ‘.’)

stranded : bool, optional

If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)

Returns:
int

Position in SegmentChain

Raises:
KeyError

if position outside bounds of SegmentChain

get_sequence(self, genome, stranded=True)

Return spliced genomic sequence of SegmentChain as a string

Parameters:
genome : dict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

stranded : bool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns:
str

Nucleotide sequence of the SegmentChain extracted from genome

get_subchain(self, long start, long end, bool stranded=True, **extra_attr)

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.

Parameters:
start : int

position of interest in SegmentChain coordinates, 0-indexed

end : int

position of interest in SegmentChain coordinates, 0-indexed and half-open

stranded : bool, optional

If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

extra_attr : keyword arguments

Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.

Returns:
|SegmentChain|

covering parent chain positions start to end of self

Raises:
IndexError

if start or end is outside the bounds of the SegmentChain

TypeError

if start or end is None

get_unstranded(self) → SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns:
SegmentChain

SegmentChain antisense to self

overlaps(self, other)

Return True if self and other share genomic positions on the same strand

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share genomic positions on the same chromosome and strand; False otherwise.

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

reset_masks(self)

Removes masks added by add_masks()

shares_segments_with(self, other)

Returns a list of GenomicSegment that are shared between self and other

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
list

List of GenomicSegments common to self and other

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

sort(self)
unstranded_overlaps(self, other)

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

attr

attr: dict

c_strand
chrom

Chromosome the SegmentChain resides on

length
mask_segments

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

masked_length
next
segments

Copy of list of GenomicSegments that comprise self. Changing this list will do nothing to self.

spanning_segment
strand

Strand of the SegmentChain

class plastid.genomics.roitools.Transcript

Bases: plastid.genomics.roitools.SegmentChain

Subclass of SegmentChain specifically for RNA transcripts. In addition to coordinate-conversion, count fetching, sequence fetching, and various other methods inherited from SegmentChain, Transcript provides convenience methods for fetching sub-chains corresponding to CDS features, 5’ UTRs, and 3’ UTRs.

Parameters:
*segments : GenomicSegment

0 or more GenomicSegments on the same strand

**attr : keyword arguments

Arbitrary attributes, including, for example:

Attribute Description
cds_genome_start Location of CDS start, in genomic coordinates
cds_genome_start Location of CDS end, in genomic coordinates
ID A unique ID for the SegmentChain.
transcript_id A transcript ID used for GTF2 export
gene_id A gene ID used for GTF2 export
Attributes:
cds_genome_start : int or None

Starting coordinate of coding region, relative to genome (i.e.

cds_genome_end : int or None

Ending coordinate of coding region, relative to genome (i.e.

cds_start : int or None

Start of coding region relative to 5’ end of transcript, in direction of transcript.

cds_end : int or None

End of coding region relative to 5’ end of transcript, in direction of transcript.

spanning_segment : GenomicSegment

A GenomicSegment spanning the endpoints of the Transcript

strand : str

Strand of the SegmentChain

chrom : str

Chromosome the SegmentChain resides on

segments : list

Copy of list of GenomicSegments that comprise self.

mask_segments : list

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

attr : dict

attr: dict

Methods

add_masks(self, *mask_segments) Adds one or more GenomicSegment to the collection of masks.
add_segments(self, *segments) Add 1 or more GenomicSegments to the SegmentChain.
antisense_overlaps(self, other) Returns True if self and other share genomic positions on opposite strands
as_bed(self[, as_int, color, extra_columns, …]) Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr
as_gff3(self, bool escape=True, …) Format a Transcript as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53
as_gtf(self, str feature_type=, …) Format self as a GTF2 block.
as_psl(self) Formats SegmentChain as PSL (blat) output.
covers(self, other) Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
from_bed(str line[, extra_columns]) Create a Transcript from a BED line with 4 or more columns.
from_psl(str psl_line)
from_str(str inp) Create a SegmentChain from a string formatted by SegmentChain.__str__():
get_antisense(self) Returns an SegmentChain antisense to self, with empty attr dict.
get_cds(self, **extra_attr) Retrieve SegmentChain covering the coding region of self, including the stop codon.
get_counts(self, ga[, stranded]) Return list of counts or values drawn from ga at each position in self
get_fasta(self, genome[, stranded]) Formats sequence of SegmentChain as FASTA output
get_gene(self) Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent.
get_genomic_coordinate(self, x[, stranded]) Finds genomic coordinate corresponding to position x in self
get_junctions(self) Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns.
get_length(self) Return total length, in nucleotides, of self
get_masked_counts(self, ga[, stranded, copy]) Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
get_masked_length(self) Return the total length, in nucleotides, of positions in self that have not been masked using SegmentChain.add_masks()
get_masked_position_set(self) Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()
get_masks(self) Return masked positions as a list of GenomicSegments
get_masks_as_segmentchain(self) Return masked positions as a SegmentChain
get_name(self) Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name.
get_position_list(self) Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain
get_position_set(self) Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain
get_segmentchain_coordinate(self, str chrom, …) Finds the SegmentChain coordinate corresponding to a genomic position
get_sequence(self, genome[, stranded]) Return spliced genomic sequence of SegmentChain as a string
get_subchain(self, long start, long end, …) Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain.
get_unstranded(self) Returns an SegmentChain antisense to self, with empty attr dict.
get_utr3(self, **extra_attr) Retrieve sub-SegmentChain covering 3’UTR of self, excluding the stop codon.
get_utr5(self, **extra_attr) Retrieve sub-SegmentChain covering 5’UTR of self.
next
overlaps(self, other) Return True if self and other share genomic positions on the same strand
reset_masks(self) Removes masks added by add_masks()
shares_segments_with(self, other) Returns a list of GenomicSegment that are shared between self and other
sort(self)
unstranded_overlaps(self, other) Return True if self and other share genomic positions on the same chromosome, regardless of their strands
add_masks(self, *mask_segments)

Adds one or more GenomicSegment to the collection of masks. Masks will be trimmed to the positions of the SegmentChain during addition.

Parameters:
mask_segments : GenomicSegment

One or more segments, in genomic coordinates, covering positions to exclude from return values of get_masked_position_set(), get_masked_counts(), or get_masked_length()

add_segments(self, *segments)

Add 1 or more GenomicSegments to the SegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.

Parameters:
segments : GenomicSegment

One or more GenomicSegment to add to SegmentChain

antisense_overlaps(self, other)

Returns True if self and other share genomic positions on opposite strands

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

as_bed(self, as_int=True, color=None, extra_columns=None, empty_value='')

Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr

If the SegmentChain was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.

Parameters:
as_int : bool, optional

Force “score” to integer (Default: True)

color : str or None, optional

Color represented as RGB hex string. If not none, overrides the color in self.attr[“color”]

extra_columns : None or list-like, optional

If None, and the SegmentChain was imported using the extra_columns keyword of from_bed(), the SegmentChain will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, the SegmentChain will be exported as a BED12 line.

If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the SegmentChain, it will be exported with the value of empty_value

If an empty list, no extra columns will be exported; the SegmentChain will be formatted as a BED12 line.

empty_value : str, optional
Returns:
str

Line of BED12-formatted text

Notes

BED12 columns are as follows
Column Contains
0 Contig or chromosome
1 Start of first block in feature (0-indexed)
2 End of last block in feature (half-open)
3 Feature name
4 Feature score
5 Strand
6 thickstart
7 thickend
8 Feature color as RGB tuple
9 Number of blocks in feature
10 Block lengths
11 Block starts, relative to start of first block
Fore more information
See the UCSC file format faq
as_gff3(self, bool escape=True, list excludes=None, str rna_type='mRNA')

Format a Transcript as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53

The Transcript will be formatted according to the following rules:

  1. A feature of type rna_type will be created, with Parent attribute set to the value of self.get_gene(), and ID attribute set to self.get_name()
  2. For each GenomicSegment in self, a child feature of type exon will be created. The Parent attribute of these features will be set to the value of self.get_name(). These will have unique IDs generated from self.get_name().
  3. If self is coding (i.e. has none-None value for self.cds_genome_start and self.cds_genome_end), child features of type ‘five_prime_UTR’, ‘CDS’, and ‘three_prime_UTR’ will be created, with Parent attributes set to self.get_name(). These will have unique IDs generated from self.get_name().
Parameters:
escape : bool, optional

Escape tokens in column 9 of GFF3 output (Default: True)

excludes : list, optional

List of attribute key names to exclude from column 9 (Default: [])

rna_type : str, optional

Feature type to export RNA as (e.g. ‘tRNA’, ‘noncoding_RNA’, et c. Default: ‘mRNA’)

Returns:
str

Multiline block of GFF3-formatted text

Notes

Beware of attribute loss
This Transcript was assembled from multiple individual component features (e.g. single exons), which may or may not have had their own unique attributes in their original annotation. To reduce overhead, these individual attributes (if they were present) have not been (entirely) stored, and consequently will not (all) be exported. If this poses problems, consider instead importing, modifying, and exporting the component features
GFF3 schemas vary
Different GFF3s have different schemas (parent-child relationships between features). Here we adopt the commonly-used schema set by Sequence Ontology (SO) v2.53, which may or may not match your schema.
Columns of GFF3 are as follows
Column Contains
1 Contig or chromosome
2 Source of annotation
3 Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4 Start (1-indexed)
5 End (fully-closed)
6 Score
7 Strand
8 Frame. Number of bases within feature before first in-frame codon (if coding)
9 Attributes
For futher information, see
as_gtf(self, str feature_type='exon', bool escape=True, list excludes=None)

Format self as a GTF2 block. GenomicSegments are formatted as GTF2 ‘exon’ features. Coding regions, if peresent, are formatted as GTF2 ‘CDS’ features. Stop codons are excluded in the ‘CDS’ features, per the GTF2 specification, and exported separately.

All attributes from self.attr are propagated to the exon and CDS features that are generated.

Parameters:
feature_type : str

If not None, overrides the ‘type’ attribute of self.attr

escape : bool, optional

URL escape tokens in column 9 of GTF2 output (Default: True)

Returns:
str

Block of GTF2-formatted text

Notes

gene_id and transcript_id are required
The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in SegmentChain.get_gene() and SegmentChain.get_name(), respectively.
Beware of attribute loss
To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this Transcript have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.

Columns of GTF2 are as follows:

Column Contains
1 Contig or chromosome
2 Source of annotation
3 Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4 Start (1-indexed)
5 End (fully-closed)
6 Score
7 Strand
8 Frame. Number of bases within feature before first in-frame codon (if coding)
9 Attributes. “gene_id” and “transcript_id” are required
For more info
as_psl(self)

Formats SegmentChain as PSL (blat) output.

Returns:
str

PSL-representation of BLAT alignment

Raises:
AttributeError

If not all of the attributes listed above are defined

Notes

This will raise an AttributeError unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:

Column Key
1 match_length
2 mismatches
3 rep_matches
4 N
5 query_gap_count
6 query_gap_bases
7 target_gap_count
8 target_gap_bases
9 strand
10 query_name
11 query_length
12 query_start
13 query_end
14 target_name
15 target_length
16 target_start
17 target_end
19 q_starts : list of integers
20 l_starts : list of integers

These keys are defined only if the SegmentChain was created by SegmentChain.from_psl(), or if the user has defined them.

See the PSL spec for more information.

covers(self, other)

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length SegmentChains are not covered by other chains.

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

static from_bed(str line, extra_columns=0)

Create a Transcript from a BED line with 4 or more columns. thickstart and thickend columns, if present, are assumed to specify CDS boundaries, a convention that, while common, is formally outside the BED specification.

See the UCSC file format faq for more details.

Parameters:
line

Line from a BED file with at least 4 columns

extra_columns: int or list, optional

Extra, non-BED columns in BED file corresponding to feature attributes. This is common in ENCODE-specific BED variants.

if extra-columns is:

  • an int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of the SegmentChain, under names like custom0, custom1, … , customN.
  • a list of str, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under
  • a list of tuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of the SegmentChain will be set to formatter_func(column_value).

(Default: 0)

Returns:
|Transcript|
static from_psl(str psl_line)
static from_str(str inp)

Create a SegmentChain from a string formatted by SegmentChain.__str__():

chrom:start-end^start-end(strand)

where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-‘, or ‘.’. Coordinates are 0-indexed and half-open.

Parameters:
inp : str

String formatted in manner of SegmentChain.__str__()

Returns:
|SegmentChain|
get_antisense(self) → SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns:
SegmentChain

SegmentChain antisense to self

get_cds(self, **extra_attr)

Retrieve SegmentChain covering the coding region of self, including the stop codon. If no coding region is present, returns an empty SegmentChain.

The following attributes are passed from self.attr to the new SegmentChain

  1. transcript_id, taken from SegmentChain.get_name()
  2. gene_id, taken from SegmentChain.get_gene()
  3. ID, generated as “%s_CDS % self.get_name()
Parameters:
extra_attr : keyword arguments

Values that will be included in the CDS subchain’s attr dict. These can be used to overwrite values already present.

Returns:
|SegmentChain|

CDS region of self if present, otherwise empty SegmentChain

get_counts(self, ga, stranded=True)

Return list of counts or values drawn from ga at each position in self

Parameters:
ga : GenomeArray from which to fetch counts
stranded : bool, optional

If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

Returns:
numpy.ndarray

Array of counts from ga covering self

get_fasta(self, genome, stranded=True)

Formats sequence of SegmentChain as FASTA output

Parameters:
genome : dict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

stranded : bool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns:
str

FASTA-formatted seuqence of SegmentChain extracted from genome

get_gene(self)

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made from get_name().

Returns:
str

Returns in order of preference, gene_id from self.attr, Parent from self.attr or 'gene_%s' % self.get_name()

get_genomic_coordinate(self, x, stranded=True)

Finds genomic coordinate corresponding to position x in self

Parameters:
x : int

position of interest, relative to SegmentChain

stranded : bool, optional

If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

Returns:
str

Chromosome name

long

Genomic cordinate corresponding to position x

str

Chromosome strand (‘+’, ‘-‘, or ‘.’)

Raises:
IndexError

if x is outside the bounds of the SegmentChain

get_junctions(self)

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.

Returns:
list

List of GenomicSegments covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)

get_length(self)

Return total length, in nucleotides, of self

Returns:
int
get_masked_counts(self, ga, stranded=True, copy=False)

Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by SegmentChain.add_mask() will be masked in the array

Parameters:
gnd : non-abstract subclass of AbstractGenomeArray

GenomeArray from which to fetch counts

stranded : bool, optional

If true and the SegmentChain is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

copy : bool, optional

If False (default) returns a view of the data; so changing values in the view changes the values in the GenomeArray if it is mutable. If True, a copy is returned instead.

Returns:
:py:class:`numpy.ma.masked_array`
get_masked_length(self)

Return the total length, in nucleotides, of positions in self that have not been masked using SegmentChain.add_masks()

Returns:
int
get_masked_position_set(self) → set

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

Returns:
set

Set of genomic coordinates, as integers

get_masks(self)

Return masked positions as a list of GenomicSegments

Returns:
list

list of GenomicSegments representing masked positions

get_masks_as_segmentchain(self)

Return masked positions as a SegmentChain

Returns:
|SegmentChain|

Masked positions

get_name(self)

Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name. If no value is found, Transcript.__str__() is used.

Returns:
str

Returns in order of preference, transcript_id, ID, Name, or name from self.attr. If not found, returns str(self)

get_position_list(self)

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

Returns:
list

Genomic coordinates in self, as integers, in genomic order

get_position_set(self)

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

Returns:
set

Set of genomic coordinates, as integers

get_segmentchain_coordinate(self, str chrom, long genomic_x, str strand, bool stranded=True)

Finds the SegmentChain coordinate corresponding to a genomic position

Parameters:
chrom : str

Chromosome name

genomic_x : int

coordinate, in genomic space

strand : str

Chromosome strand (‘+’, ‘-‘, or ‘.’)

stranded : bool, optional

If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)

Returns:
int

Position in SegmentChain

Raises:
KeyError

if position outside bounds of SegmentChain

get_sequence(self, genome, stranded=True)

Return spliced genomic sequence of SegmentChain as a string

Parameters:
genome : dict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

stranded : bool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns:
str

Nucleotide sequence of the SegmentChain extracted from genome

get_subchain(self, long start, long end, bool stranded=True, **extra_attr)

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.

Parameters:
start : int

position of interest in SegmentChain coordinates, 0-indexed

end : int

position of interest in SegmentChain coordinates, 0-indexed and half-open

stranded : bool, optional

If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

extra_attr : keyword arguments

Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.

Returns:
|SegmentChain|

covering parent chain positions start to end of self

Raises:
IndexError

if start or end is outside the bounds of the SegmentChain

TypeError

if start or end is None

get_unstranded(self) → SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns:
SegmentChain

SegmentChain antisense to self

get_utr3(self, **extra_attr)

Retrieve sub-SegmentChain covering 3’UTR of self, excluding the stop codon. If no coding region, returns an empty SegmentChain

The following attributes are passed from self.attr to the new SegmentChain

  1. transcript_id, taken from SegmentChain.get_name()
  2. gene_id, taken from SegmentChain.get_gene()
  3. ID, generated as “%s_3UTR” % self.get_name()
Parameters:
extra_attr : keyword arguments

Values that will be included in the 3’ UTR subchain’s attr dict. These can be used to overwrite values already present.

Returns:
|SegmentChain|

3’ UTR region of self if present, otherwise empty SegmentChain

get_utr5(self, **extra_attr)

Retrieve sub-SegmentChain covering 5’UTR of self. If no coding region, returns an empty SegmentChain

The following attributes are passed from self.attr to the new SegmentChain

  1. transcript_id, taken from SegmentChain.get_name()
  2. gene_id, taken from SegmentChain.get_gene()
  3. ID, generated as “%s_5UTR” % self.get_name()
Parameters:
extra_attr : keyword arguments

Values that will be included in the 5’UTR subchain’s attr dict. These can be used to overwrite values already present.

Returns:
|SegmentChain|

5’ UTR region of self if present, otherwise empty SegmentChain

overlaps(self, other)

Return True if self and other share genomic positions on the same strand

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share genomic positions on the same chromosome and strand; False otherwise.

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

reset_masks(self)

Removes masks added by add_masks()

shares_segments_with(self, other)

Returns a list of GenomicSegment that are shared between self and other

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
list

List of GenomicSegments common to self and other

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

sort(self)
unstranded_overlaps(self, other)

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

Parameters:
other : SegmentChain or GenomicSegment

Query feature

Returns:
bool

True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match

Raises:
TypeError

if other is not a GenomicSegment or SegmentChain

attr

attr: dict

c_strand
cds_end

End of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_start, self.cds_genome_start and self.cds_genome_end to None

cds_genome_end

Ending coordinate of coding region, relative to genome (i.e. leftmost; is stop codon for forward-strand features, start codon for reverse-strand features. Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_start to None

cds_genome_start

Starting coordinate of coding region, relative to genome (i.e. leftmost; is start codon for forward-strand features, stop codon for reverse-strand features). Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_end to None

cds_start

Start of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_end, self.cds_genome_start and self.cds_genome_end to None

chrom

Chromosome the SegmentChain resides on

length
mask_segments

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

masked_length
next
segments

Copy of list of GenomicSegments that comprise self. Changing this list will do nothing to self.

spanning_segment
strand

Strand of the SegmentChain

plastid.genomics.roitools.add_three_for_stop_codon(Transcript tx) → Transcript

Extend an annotated CDS region, if present, by three nucleotides at the threeprime end. Use in cases when annotation files exclude the stop codon from the annotated CDS.

Parameters:
tx : Transcript

query transcript

Returns:
|Transcript|

Transcript with same attributes as tx, but with CDS extended by one codon

Raises:
IndexError

if a three prime UTR is defined that terminates before the complete stop codon

plastid.genomics.roitools.merge_segments(list segments) → list

Merge all overlapping GenomicSegments in segments, so that all segments returned are guaranteed to be sorted and non-overlapping.

Note

All segments are assumed to be on the same strand and chromosome.

Parameters:
segments : list

List of GenomicSegments, all on the same strand and chromosome

Returns:
list

List of sorted, non-overlapping GenomicSegments

plastid.genomics.roitools.positionlist_to_segments(str chrom, str strand, list positions) → list

Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions

Parameters:
chrom : str

Chromosome name

strand : str

Chromosome strand (‘+’, ‘-‘, or ‘.’)

positions : list of unique integers

Sorted, end-inclusive list of positions to include in final GenomicSegment

Returns:
list

List of GenomicSegments covering positions

Warning

This function is meant to quickly without excessive type conversions. So, the elements positions must be UNIQUE and SORTED. If they are not, use positions_to_segments() instead.

plastid.genomics.roitools.positions_to_segments(str chrom, str strand, positions) → list

Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions

Parameters:
chrom : str

Chromosome name

strand : str

Chromosome strand (‘+’, ‘-‘, or ‘.’)

positions : list of integers

End-inclusive list, tuple, or set of positions to include in final GenomicSegment

Returns:
list

List of GenomicSegments covering positions