plastid.genomics.roitools module

This module contains classes for representing and manipulating genomic features.

Summary

Genomic features are represented as SegmentChains, which can contain zero or more continuous spans of the genome (GenomicSegments), as well as rich annotation data. For the specific case of RNA transcripts, a subclass of SegmentChain, called Transcript is provided.

Module contents

GenomicSegment(chrom, start, end, strand)

Building block for SegmentChain: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.

SegmentChain(*segments, **attributes)

Base class for genomic features, composed of zero or more GenomicSegments.

Transcript(*segments, **attributes)

Subclass of SegmentChain specifically for RNA transcripts.

positions_to_segments(unicode chrom, ...)

Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions.

add_three_for_stop_codon(Transcript tx)

Extend an annotated CDS region, if present, by three nucleotides at the threeprime end.

Examples

SegmentChains may be read directly from annotation files using the readers in plastid.readers:

>>> from plastid import *
>>> chains = list(BED_Reader(open("some_file.bed")))

or constructed from GenomicSegments:

>>> seg1 = GenomicSegment("chrA", 5, 200, "-")
>>> seg2 = GenomicSegment("chrA", 250, 300, "-")
>>> my_chain = SegmentChain(seg1, seg2, ID="some_chain", ... , some_attribute="some_value")

SegmentChains contain convenience methods for a number of comman tasks, for example:

  • converting coordinates between the spliced space of the chain, and the genome:

    >>> # get coordinate of 50th position from 5' end
    >>> my_chain.get_genomic_coordinate(50)
    ('chrA', 199, '-')
    
    # get coordinate of 49th position. splicing is taken care of!
    >>> my_chain.get_genomic_coordinate(49)
    ('chrA', 250, '-')
    
    # get coordinate in chain corresponding to genomic coordinate 118
    >>> my_chain.get_segmentchain_coordinate("chrA", 118, "-")
    131
    
    # get a subchain containing positions 45-70
    >>> subchain = my_chain.get_subchain(45, 70)
    >>> subchain
    <SegmentChain segments=2 bounds=chrA:180-255(-) name=some_chain_subchain>
    
    # the subchain preserves the discontinuity found in `my_chain`
    >>> subchain.segments
    [<GenomicSegment chrA:180-200 strand='-'>,
     <GenomicSegment chrA:250-255 strand='-'>]
    
  • fetching numpy arrays of data at each position in the chain. The data is assumed to be kept in a GenomeArray:

    >>> ga = BAMGenomeArray(["some_file.bam"], mapping=ThreePrimeMapFactory(offset=15))
    >>> my_chain.get_counts(ga)
    array([843, 854, 153,  86, 462, 359, 290,  38,  38, 758, 342, 299, 430,
           628, 324, 437, 231, 417, 536, 673, 243, 981, 661, 415, 207, 446,
           197, 520, 653, 468, 863,   3, 272, 754, 352, 960, 966, 913, 367,
           ...
           ])
    
  • similarly, fetching spliced sequence, reverse-complemented if necessary for minus-strand features. As input, the SegmentChain expects a dictionary-like object mapping chromosome names to string-like sequences (e.g. as in BioPython or twobitreader):

    >>> seqdict = { "chrA" : "TCTACATA ..." } # some string of chrA sequence
    >>> my_chain.get_sequence(seqdict)
    "ACTGTGTACTGTACGATCGATCGTACGTACGATCGATCGTACGTAGCTAGTCAGCTAGCTAGCTAGCTGA..."
    
  • testing for overlap, containment, equality with other SegmentChains:

    >>> other_chain = SegmentChain(GenomicSegment("chrA", 200, 300, "-"),
    >>>                            GenomicSegment("chrA", 800, 900, "-"))
    
    >>>  my_chain.overlaps(other_chain)
    True
    
    >>> other_chain in my_chain
    False
    
    >>> my_chain in my_chain
    True
    
    >>> my_chain.covers(other_chain)
    False
    
    >>> my_chain == other_chain
    False
    
    >>> my_chain == my_chain
    True
    
  • export to BED, GTF2, or GFF3:

    >>> my_chain.as_bed()
    chrA    5    300    some_chain    0    -    5    5    0,0,0    2    195,50,    0,245,
    
    >>> my_chain.as_gtf()
    chrA    .    exon    6    200    .    -    .    gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";
    chrA    .    exon    251  300    .    -    .    gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";
    
class plastid.genomics.roitools.GenomicSegment(chrom, start, end, strand)

Bases: object

Building block for SegmentChain: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.

Examples

GenomicSegments sort lexically by chromosome, start position, end position, and finally strand:

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrB", 0, 10, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 75, 100, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 55, 75, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 150, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 100, "-")
True

They also provide a few convenience methods for containment or overlap. To be contained, a segment must be on the same chromosome and strand as its container, and its coordinates must be within or equal to its endpoints:

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 50, 100, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "-")
False

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 75, 200, "+")
False

Similarly, to overlap, GenomicSegments must be on the same strand and chromosome.

Attributes
chromstr

Chromosome where GenomicSegment resides

startint

Zero-indexed (Pythonic) start coordinate of GenomicSegment

endint

Zero-indexed, half-open (Pythonic) end coordinate of GenomicSegment

strandstr

Strand of GenomicSegment

Methods

as_igv_str(self)

Format as an IGV location string

contains(self, GenomicSegment other)

Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.

from_igv_str(unicode loc_str, unicode strand=u)

Construct GenomicSegment from IGV location string

from_str(unicode inp)

Construct a GenomicSegment from its str() representation

overlaps(self, GenomicSegment other)

Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.

as_igv_str(self) unicode

Format as an IGV location string

contains(self, GenomicSegment other) bool

Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.

Parameters
otherGenomicSegment

Query segment

Returns
bool
static from_igv_str(unicode loc_str, unicode strand=u'.')

Construct GenomicSegment from IGV location string

Parameters
igvlocstr

IGV location string, in format ‘chromosome:start-end’, where start and end are 1-indexed and half-open

strandstr

The chromosome strand (‘+’, ‘-’, or ‘.’)

Returns
GenomicSegment
static from_str(unicode inp)

Construct a GenomicSegment from its str() representation

Parameters
inpstr

String representation of GenomicSegment as chrom:start-end(strand) where start and end are in 0-indexed, half-open coordinates

Returns
GenomicSegment
overlaps(self, GenomicSegment other) bool

Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.

Parameters
otherGenomicSegment

Query segment

Returns
bool
c_strand
chrom

Chromosome where GenomicSegment resides

end

Zero-indexed, half-open (Pythonic) end coordinate of GenomicSegment

start

Zero-indexed (Pythonic) start coordinate of GenomicSegment

strand

Strand of GenomicSegment

  • ‘+’ for forward / Watson strand

  • ‘-’ for reverse / Crick strand

  • ‘.’ for unstranded / both strands

class plastid.genomics.roitools.SegmentChain(*segments, **attributes)

Bases: object

Base class for genomic features, composed of zero or more GenomicSegments. SegmentChains can therefore model discontinuous, features – such as multi-exon transcripts or gapped alignments – in addition, to continuous features.

Numerous convenience functions are supplied for:

  • converting between coordinates relative to the genome and relative to the internal coordinates of a spliced SegmentChain

  • fetching genomic sequence, read alignments, or count data, accounting for splicing of the segments, and, in the case of reverse-strand features, reverse-complementing

  • slicing or fetching sub-regions of a SegmentChain

  • testing equality, inequality, overlap, containment, coverage of, or sharing of segments with other SegmentChain or GenomicSegment objects

  • import/export to BED, PSL, GTF2, and GFF3 formats, for use in other software packages or in a genome browser.

Intervals are sorted from lowest to greatest starting coordinate on their reference sequence, regardless of strand. Iteration over the SegmentChain will yield intervals from left-to-right in the genome.

Parameters
*segmentsGenomicSegment

0 or more GenomicSegments on the same strand

**attrkeyword arguments

Arbitrary attributes, including, for example:

Attribute

Description

type

A feature type used for GTF2/GFF3 export of each interval in the SegmentChain. (Default: ‘exon’)

ID

A unique ID for the SegmentChain.

transcript_id

A transcript ID used for GTF2 export

gene_id

A gene ID used for GTF2 export

See also

Transcript

Transcript subclass, additionally providing richer GTF2, GFF3, and BED export, as well as methods for fetching coding regions and UTRs as subsegments

Attributes
spanning_segmentGenomicSegment

A GenomicSegment spanning the endpoints of the SegmentChain

strandstr

Strand of the SegmentChain

chromstr

Chromosome the SegmentChain resides on

attrdict

attr: dict

segmentslist

Copy of list of GenomicSegments that comprise self.

mask_segmentslist

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

Methods

add_masks(self, *mask_segments)

Adds one or more GenomicSegment to the collection of masks.

add_segments(self, *segments)

Add 1 or more GenomicSegments to the SegmentChain.

antisense_overlaps(self, other)

Returns True if self and other share genomic positions on opposite strands

as_bed(self[, thickstart, thickend, as_int, ...])

Format SegmentChain as a string of BED12[+X] output.

as_gff3(self, unicode feature_type=None, ...)

Format self as a line of GFF3 output.

as_gtf(self, unicode feature_type=None, ...)

Format SegmentChain as a block of GTF2 output.

as_psl(self)

Formats SegmentChain as PSL (blat) output.

covers(self, other)

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.

from_bed(unicode line[, extra_columns])

Create a SegmentChain from a line from a BED file.

from_psl(psl_line)

Create a SegmentChain from a line from a PSL (BLAT) file

from_str(unicode inp)

Create a SegmentChain from a string formatted by SegmentChain.__str__():

get_antisense(self)

Returns an SegmentChain antisense to self, with empty attr dict.

get_counts(self, ga[, stranded])

Return list of counts or values drawn from ga at each position in self

get_fasta(self, genome[, stranded])

Formats sequence of SegmentChain as FASTA output

get_gene(self)

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent.

get_genomic_coordinate(self, x[, stranded])

Finds genomic coordinate corresponding to position x in self

get_junctions(self)

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns.

get_masked_counts(self, ga[, stranded, copy])

Return counts covering self in dataset gnd as a masked array, in transcript coordinates.

get_masked_position_set(self)

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

get_masks(self)

Return masked positions as a list of GenomicSegments

get_masks_as_segmentchain(self)

Return masked positions as a SegmentChain

get_name(self)

Returns the name of this SegmentChain, first searching through self.attr for the keys ID, Name, and name.

get_position_list(self)

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

get_position_set(self)

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

get_segmentchain_coordinate(self, ...)

Finds the SegmentChain coordinate corresponding to a genomic position

get_sequence(self, genome[, stranded])

Return spliced genomic sequence of SegmentChain as a string

get_subchain(self, long start, long end, ...)

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain.

get_unstranded(self)

Returns an SegmentChain antisense to self, with empty attr dict.

next(self)

Return next GenomicSegment in the SegmentChain, from left to right on the chromsome

overlaps(self, other)

Return True if self and other share genomic positions on the same strand

reset_masks(self)

Removes masks added by add_masks()

shares_segments_with(self, other)

Returns a list of GenomicSegment that are shared between self and other

sort(self)

unstranded_overlaps(self, other)

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

add_masks(self, *mask_segments)

Adds one or more GenomicSegment to the collection of masks. Masks will be trimmed to the positions of the SegmentChain during addition.

Parameters
mask_segmentsGenomicSegment

One or more segments, in genomic coordinates, covering positions to exclude from return values of get_masked_position_set(), get_masked_counts(), or get_masked_length()

add_segments(self, *segments)

Add 1 or more GenomicSegments to the SegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.

Parameters
segmentsGenomicSegment

One or more GenomicSegment to add to SegmentChain

antisense_overlaps(self, other)

Returns True if self and other share genomic positions on opposite strands

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

as_bed(self, thickstart=None, thickend=None, as_int=True, color=None, extra_columns=None, empty_value='')

Format SegmentChain as a string of BED12[+X] output.

If the SegmentChain was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.

Parameters
thickstartint or None, optional

If not None, overrides the genome coordinate that starts thick plotting in genome browser found in self.attr[‘thickstart’]

thickendint or None, optional

If not None, overrides the genome coordinate that stops thick plotting in genome browser found in self.attr[‘thickend’]

as_intbool, optional

Force score to integer (Default: True)

colorstr or None, optional

Color represented as RGB hex string. If not none, overrides the color in self.attr[‘color’]

extra_columnsNone or list-like, optional

If None, and the SegmentChain was imported using the extra_columns keyword of from_bed(), the SegmentChain will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, the SegmentChain will be exported as a BED12 line.

If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the SegmentChain, it will be exported with the value of empty_value

If an empty list, no extra columns will be exported; the SegmentChain will be formatted as a BED12 line.

empty_valuestr, optional

Value to export for extra_columns that are not defined (Default: “”)

Returns
str

Line of BED12[+X]-formatted text

Notes

BED12 columns are as follows:

Column

Contains

1

Contig or chromosome

2

Start of first block in feature (0-indexed)

3

End of last block in feature (half-open)

4

Feature name

5

Feature score

6

Strand

7

thickstart (in chromosomal coordinates)

8

thickend (in chromosomal coordinates)

9

Feature color as RGB tuple

10

Number of blocks in feature

11

Block lengths

12

Block starts, relative to start of first block

For more details

See the UCSC file format faq

as_gff3(self, unicode feature_type=None, bool escape=True, list excludes=None)

Format self as a line of GFF3 output.

Because GFF3 files permit many schemas of parent-child hierarchy, and in order to reduce confusion and overhead, attempts to export a multi-interval SegmentChain will raise an AttributeError.

Instead, users may export the individual features from which the multi-interval SegmentChain was constructed, or construct features for them, setting ID, Parent, and type attributes following their own conventions.

Parameters
feature_typestr

If not None, overrides the type attribute of self.attr

escapebool, optional

Escape tokens in column 9 of GFF3 output (Default: True)

excludeslist, optional

List of attribute key names to exclude from column 9 (Default: [])

Returns
str

Line of GFF3-formatted text

Raises
AttributeError

if the SegmentChain has multiple intervals

Notes

Columns of GFF3 are as follows

Column

Contains

1

Contig or chromosome

2

Source of annotation

3

Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)

4

Start (1-indexed)

5

End (fully-closed)

6

Score

7

Strand

8

Frame. Number of bases within feature before first in-frame codon (if coding)

9

Attributes

For further information, see
as_gtf(self, unicode feature_type=None, bool escape=True, list excludes=None)

Format SegmentChain as a block of GTF2 output.

The frame or phase attribute (GTF2 column 8) is valid only for ‘CDS’ features, and, if not present in self.attr, is calculated assuming the SegmentChain contains the entire coding region. If the SegmentChain contains multiple intervals, the frame or phase attribute will always be recalculated.

All attributes in self.attr, except those created upon import, will be propagated to all of the features that are generated.

Parameters
feature_typestr

If not None, overrides the “type” attribute of self.attr

escapebool, optional

Escape tokens in column 9 of GTF output (Default: True)

excludeslist, optional

List of attribute key names to exclude from column 8 (Default: [])

Returns
str

Block of GTF2-formatted text

Notes

gene_id and transcript_id are required

The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in SegmentChain.get_gene() and SegmentChain.get_name(), respectively.

Beware of attribute loss

To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this Transcript have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.

Columns of GTF2 are as follows

Column

Contains

1

Contig or chromosome

2

Source of annotation

3

Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)

4

Start (1-indexed)

5

End (fully-closed)

6

Score

7

Strand

8

Frame. Number of bases within feature before first in-frame codon (if coding)

9

Attributes. “gene_id” and “transcript_id” are required

For more info
as_psl(self)

Formats SegmentChain as PSL (blat) output.

Returns
str

PSL-representation of BLAT alignment

Raises
AttributeError

If not all of the attributes listed above are defined

Notes

This will raise an AttributeError unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:

Column

Key

1

match_length

2

mismatches

3

rep_matches

4

N

5

query_gap_count

6

query_gap_bases

7

target_gap_count

8

target_gap_bases

9

strand

10

query_name

11

query_length

12

query_start

13

query_end

14

target_name

15

target_length

16

target_start

17

target_end

19

q_starts : list of integers

20

l_starts : list of integers

These keys are defined only if the SegmentChain was created by SegmentChain.from_psl(), or if the user has defined them.

See the PSL spec for more information.

covers(self, other)

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length SegmentChains are not covered by other chains.

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

static from_bed(unicode line, extra_columns=0)

Create a SegmentChain from a line from a BED file. The BED line may contain 4 to 12 columns, per the specification. These will be auto-detected and parsed appropriately.

See the UCSC file format faq for more details.

Parameters
line

Line from a BED file, containing 4 or more columns

extra_columns: int or list optional

Extra, non-BED columns in :term:`Extended BED`_ format file corresponding to feature attributes. This is common in ENCODE-specific BED variants.

if extra-columns is:

  • an int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of the SegmentChain, under names like custom0, custom1, … , customN.

  • a list of str, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under their respective names in the attr dict.

  • a list of tuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of the SegmentChain will be set to formatter_func(column_value).

(Default: 0)

Returns
SegmentChain
static from_psl(psl_line)

Create a SegmentChain from a line from a PSL (BLAT) file

See the PSL spec

Parameters
psl_linestr

Line from a PSL file

Returns
SegmentChain
static from_str(unicode inp)

Create a SegmentChain from a string formatted by SegmentChain.__str__():

chrom:start-end^start-end(strand)

where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.

Parameters
inpstr

String formatted in manner of SegmentChain.__str__()

Returns
SegmentChain
get_antisense(self) SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns
SegmentChain

SegmentChain antisense to self

get_counts(self, ga, stranded=True)

Return list of counts or values drawn from ga at each position in self

Parameters
gaGenomeArray from which to fetch counts
strandedbool, optional

If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

Returns
numpy.ndarray

Array of counts from ga covering self

get_fasta(self, genome, stranded=True)

Formats sequence of SegmentChain as FASTA output

Parameters
genomedict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

strandedbool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns
str

FASTA-formatted seuqence of SegmentChain extracted from genome

get_gene(self)

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made from get_name().

Returns
str

Returns in order of preference, gene_id from self.attr, Parent from self.attr or 'gene_%s' % self.get_name()

get_genomic_coordinate(self, x, stranded=True)

Finds genomic coordinate corresponding to position x in self

Parameters
xint

position of interest, relative to SegmentChain

strandedbool, optional

If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

Returns
str

Chromosome name

long

Genomic cordinate corresponding to position x

str

Chromosome strand (‘+’, ‘-’, or ‘.’)

Raises
IndexError

if x is outside the bounds of the SegmentChain

get_junctions(self)

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.

Returns
list

List of GenomicSegments covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)

get_masked_counts(self, ga, stranded=True, copy=False)

Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by SegmentChain.add_mask() will be masked in the array

Parameters
gndnon-abstract subclass of AbstractGenomeArray

GenomeArray from which to fetch counts

strandedbool, optional

If true and the SegmentChain is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

copybool, optional

If False (default) returns a view of the data; so changing values in the view changes the values in the GenomeArray if it is mutable. If True, a copy is returned instead.

Returns
numpy.ma.masked_array
get_masked_position_set(self) set

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

Returns
set

Set of genomic coordinates, as integers

get_masks(self)

Return masked positions as a list of GenomicSegments

Returns
list

list of GenomicSegments representing masked positions

get_masks_as_segmentchain(self)

Return masked positions as a SegmentChain

Returns
SegmentChain

Masked positions

get_name(self)

Returns the name of this SegmentChain, first searching through self.attr for the keys ID, Name, and name. If no value is found for any of those keys, a name is generated using SegmentChain.__str__()

Returns
str

In order of preference, ID from self.attr, Name from self.attr, name from self.attr or str(self)

get_position_list(self)

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

Returns
list

Genomic coordinates in self, as integers, in genomic order

get_position_set(self)

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

Returns
set

Set of genomic coordinates, as integers

get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)

Finds the SegmentChain coordinate corresponding to a genomic position

Parameters
chromstr

Chromosome name

genomic_xint

coordinate, in genomic space

strandstr

Chromosome strand (‘+’, ‘-’, or ‘.’)

strandedbool, optional

If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)

Returns
int

Position in SegmentChain

Raises
KeyError

if position outside bounds of SegmentChain

get_sequence(self, genome, stranded=True)

Return spliced genomic sequence of SegmentChain as a string

Parameters
genomedict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

strandedbool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns
str

Nucleotide sequence of the SegmentChain extracted from genome

get_subchain(self, long start, long end, bool stranded=True, **extra_attr)

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.

Parameters
startint

position of interest in SegmentChain coordinates, 0-indexed

endint

position of interest in SegmentChain coordinates, 0-indexed and half-open

strandedbool, optional

If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

extra_attrkeyword arguments

Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.

Returns
SegmentChain

covering parent chain positions start to end of self

Raises
IndexError

if start or end is outside the bounds of the SegmentChain

TypeError

if start or end is None

get_unstranded(self) SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns
SegmentChain

SegmentChain antisense to self

next(self)

Return next GenomicSegment in the SegmentChain, from left to right on the chromsome

overlaps(self, other)

Return True if self and other share genomic positions on the same strand

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share genomic positions on the same chromosome and strand; False otherwise.

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

reset_masks(self)

Removes masks added by add_masks()

shares_segments_with(self, other)

Returns a list of GenomicSegment that are shared between self and other

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
list

List of GenomicSegments common to self and other

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

sort(self)
unstranded_overlaps(self, other)

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

attr

attr: dict

c_strand
chrom

Chromosome the SegmentChain resides on

length
mask_segments

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

masked_length
segments

Copy of list of GenomicSegments that comprise self. Changing this list will do nothing to self.

spanning_segment
strand

Strand of the SegmentChain

class plastid.genomics.roitools.Transcript(*segments, **attributes)

Bases: plastid.genomics.roitools.SegmentChain

Subclass of SegmentChain specifically for RNA transcripts. In addition to coordinate-conversion, count fetching, sequence fetching, and various other methods inherited from SegmentChain, Transcript provides convenience methods for fetching sub-chains corresponding to CDS features, 5’ UTRs, and 3’ UTRs.

Parameters
*segmentsGenomicSegment

0 or more GenomicSegments on the same strand

**attrkeyword arguments

Arbitrary attributes, including, for example:

Attribute

Description

cds_genome_start

Location of CDS start, in genomic coordinates

cds_genome_start

Location of CDS end, in genomic coordinates

ID

A unique ID for the SegmentChain.

transcript_id

A transcript ID used for GTF2 export

gene_id

A gene ID used for GTF2 export

Attributes
cds_genome_startint or None

Starting coordinate of coding region, relative to genome (i.e.

cds_genome_endint or None

Ending coordinate of coding region, relative to genome (i.e.

cds_startint or None

Start of coding region relative to 5’ end of transcript, in direction of transcript.

cds_endint or None

End of coding region relative to 5’ end of transcript, in direction of transcript.

spanning_segmentGenomicSegment

A GenomicSegment spanning the endpoints of the Transcript

strandstr

Strand of the SegmentChain

chromstr

Chromosome the SegmentChain resides on

segmentslist

Copy of list of GenomicSegments that comprise self.

mask_segmentslist

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

attrdict

attr: dict

Methods

add_masks(self, *mask_segments)

Adds one or more GenomicSegment to the collection of masks.

add_segments(self, *segments)

Add 1 or more GenomicSegments to the SegmentChain.

antisense_overlaps(self, other)

Returns True if self and other share genomic positions on opposite strands

as_bed(self[, as_int, color, extra_columns, ...])

Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr

as_gff3(self, bool escape=True, ...)

Format a Transcript as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53

as_gtf(self, unicode feature_type=u, ...)

Format self as a GTF2 block.

as_psl(self)

Formats SegmentChain as PSL (blat) output.

covers(self, other)

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.

from_bed(unicode line[, extra_columns])

Create a Transcript from a BED line with 4 or more columns.

from_psl(unicode psl_line)

from_str(unicode inp)

Create a SegmentChain from a string formatted by SegmentChain.__str__():

get_antisense(self)

Returns an SegmentChain antisense to self, with empty attr dict.

get_cds(self, **extra_attr)

Retrieve SegmentChain covering the coding region of self, including the stop codon.

get_counts(self, ga[, stranded])

Return list of counts or values drawn from ga at each position in self

get_fasta(self, genome[, stranded])

Formats sequence of SegmentChain as FASTA output

get_gene(self)

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent.

get_genomic_coordinate(self, x[, stranded])

Finds genomic coordinate corresponding to position x in self

get_junctions(self)

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns.

get_masked_counts(self, ga[, stranded, copy])

Return counts covering self in dataset gnd as a masked array, in transcript coordinates.

get_masked_position_set(self)

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

get_masks(self)

Return masked positions as a list of GenomicSegments

get_masks_as_segmentchain(self)

Return masked positions as a SegmentChain

get_name(self)

Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name.

get_position_list(self)

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

get_position_set(self)

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

get_segmentchain_coordinate(self, ...)

Finds the SegmentChain coordinate corresponding to a genomic position

get_sequence(self, genome[, stranded])

Return spliced genomic sequence of SegmentChain as a string

get_subchain(self, long start, long end, ...)

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain.

get_unstranded(self)

Returns an SegmentChain antisense to self, with empty attr dict.

get_utr3(self, **extra_attr)

Retrieve sub-SegmentChain covering 3'UTR of self, excluding the stop codon.

get_utr5(self, **extra_attr)

Retrieve sub-SegmentChain covering 5'UTR of self.

next(self)

Return next GenomicSegment in the SegmentChain, from left to right on the chromsome

overlaps(self, other)

Return True if self and other share genomic positions on the same strand

reset_masks(self)

Removes masks added by add_masks()

shares_segments_with(self, other)

Returns a list of GenomicSegment that are shared between self and other

sort(self)

unstranded_overlaps(self, other)

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

add_masks(self, *mask_segments)

Adds one or more GenomicSegment to the collection of masks. Masks will be trimmed to the positions of the SegmentChain during addition.

Parameters
mask_segmentsGenomicSegment

One or more segments, in genomic coordinates, covering positions to exclude from return values of get_masked_position_set(), get_masked_counts(), or get_masked_length()

add_segments(self, *segments)

Add 1 or more GenomicSegments to the SegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.

Parameters
segmentsGenomicSegment

One or more GenomicSegment to add to SegmentChain

antisense_overlaps(self, other)

Returns True if self and other share genomic positions on opposite strands

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

as_bed(self, as_int=True, color=None, extra_columns=None, empty_value='')

Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr

If the SegmentChain was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.

Parameters
as_intbool, optional

Force “score” to integer (Default: True)

colorstr or None, optional

Color represented as RGB hex string. If not none, overrides the color in self.attr[“color”]

extra_columnsNone or list-like, optional

If None, and the SegmentChain was imported using the extra_columns keyword of from_bed(), the SegmentChain will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, the SegmentChain will be exported as a BED12 line.

If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the SegmentChain, it will be exported with the value of empty_value

If an empty list, no extra columns will be exported; the SegmentChain will be formatted as a BED12 line.

empty_valuestr, optional
Returns
str

Line of BED12-formatted text

Notes

BED12 columns are as follows

Column

Contains

0

Contig or chromosome

1

Start of first block in feature (0-indexed)

2

End of last block in feature (half-open)

3

Feature name

4

Feature score

5

Strand

6

thickstart

7

thickend

8

Feature color as RGB tuple

9

Number of blocks in feature

10

Block lengths

11

Block starts, relative to start of first block

Fore more information

See the UCSC file format faq

as_gff3(self, bool escape=True, list excludes=None, unicode rna_type=u'mRNA')

Format a Transcript as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53

The Transcript will be formatted according to the following rules:

  1. A feature of type rna_type will be created, with Parent attribute set to the value of self.get_gene(), and ID attribute set to self.get_name()

  2. For each GenomicSegment in self, a child feature of type exon will be created. The Parent attribute of these features will be set to the value of self.get_name(). These will have unique IDs generated from self.get_name().

  3. If self is coding (i.e. has none-None value for self.cds_genome_start and self.cds_genome_end), child features of type ‘five_prime_UTR’, ‘CDS’, and ‘three_prime_UTR’ will be created, with Parent attributes set to self.get_name(). These will have unique IDs generated from self.get_name().

Parameters
escapebool, optional

Escape tokens in column 9 of GFF3 output (Default: True)

excludeslist, optional

List of attribute key names to exclude from column 9 (Default: [])

rna_typestr, optional

Feature type to export RNA as (e.g. ‘tRNA’, ‘noncoding_RNA’, et c. Default: ‘mRNA’)

Returns
str

Multiline block of GFF3-formatted text

Notes

Beware of attribute loss

This Transcript was assembled from multiple individual component features (e.g. single exons), which may or may not have had their own unique attributes in their original annotation. To reduce overhead, these individual attributes (if they were present) have not been (entirely) stored, and consequently will not (all) be exported. If this poses problems, consider instead importing, modifying, and exporting the component features

GFF3 schemas vary

Different GFF3s have different schemas (parent-child relationships between features). Here we adopt the commonly-used schema set by Sequence Ontology (SO) v2.53, which may or may not match your schema.

Columns of GFF3 are as follows

Column

Contains

1

Contig or chromosome

2

Source of annotation

3

Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)

4

Start (1-indexed)

5

End (fully-closed)

6

Score

7

Strand

8

Frame. Number of bases within feature before first in-frame codon (if coding)

9

Attributes

For futher information, see
as_gtf(self, unicode feature_type=u'exon', bool escape=True, list excludes=None)

Format self as a GTF2 block. GenomicSegments are formatted as GTF2 ‘exon’ features. Coding regions, if peresent, are formatted as GTF2 ‘CDS’ features. Stop codons are excluded in the ‘CDS’ features, per the GTF2 specification, and exported separately.

All attributes from self.attr are propagated to the exon and CDS features that are generated.

Parameters
feature_typestr

If not None, overrides the ‘type’ attribute of self.attr

escapebool, optional

URL escape tokens in column 9 of GTF2 output (Default: True)

Returns
str

Block of GTF2-formatted text

Notes

gene_id and transcript_id are required

The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in SegmentChain.get_gene() and SegmentChain.get_name(), respectively.

Beware of attribute loss

To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this Transcript have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.

Columns of GTF2 are as follows:

Column

Contains

1

Contig or chromosome

2

Source of annotation

3

Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)

4

Start (1-indexed)

5

End (fully-closed)

6

Score

7

Strand

8

Frame. Number of bases within feature before first in-frame codon (if coding)

9

Attributes. “gene_id” and “transcript_id” are required

For more info
as_psl(self)

Formats SegmentChain as PSL (blat) output.

Returns
str

PSL-representation of BLAT alignment

Raises
AttributeError

If not all of the attributes listed above are defined

Notes

This will raise an AttributeError unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:

Column

Key

1

match_length

2

mismatches

3

rep_matches

4

N

5

query_gap_count

6

query_gap_bases

7

target_gap_count

8

target_gap_bases

9

strand

10

query_name

11

query_length

12

query_start

13

query_end

14

target_name

15

target_length

16

target_start

17

target_end

19

q_starts : list of integers

20

l_starts : list of integers

These keys are defined only if the SegmentChain was created by SegmentChain.from_psl(), or if the user has defined them.

See the PSL spec for more information.

covers(self, other)

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length SegmentChains are not covered by other chains.

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

static from_bed(unicode line, extra_columns=0)

Create a Transcript from a BED line with 4 or more columns. thickstart and thickend columns, if present, are assumed to specify CDS boundaries, a convention that, while common, is formally outside the BED specification.

See the UCSC file format faq for more details.

Parameters
line

Line from a BED file with at least 4 columns

extra_columns: int or list, optional

Extra, non-BED columns in BED file corresponding to feature attributes. This is common in ENCODE-specific BED variants.

if extra-columns is:

  • an int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of the SegmentChain, under names like custom0, custom1, … , customN.

  • a list of str, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under

  • a list of tuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of the SegmentChain will be set to formatter_func(column_value).

(Default: 0)

Returns
Transcript
static from_psl(unicode psl_line)
static from_str(unicode inp)

Create a SegmentChain from a string formatted by SegmentChain.__str__():

chrom:start-end^start-end(strand)

where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.

Parameters
inpstr

String formatted in manner of SegmentChain.__str__()

Returns
SegmentChain
get_antisense(self) SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns
SegmentChain

SegmentChain antisense to self

get_cds(self, **extra_attr)

Retrieve SegmentChain covering the coding region of self, including the stop codon. If no coding region is present, returns an empty SegmentChain.

The following attributes are passed from self.attr to the new SegmentChain

  1. transcript_id, taken from SegmentChain.get_name()

  2. gene_id, taken from SegmentChain.get_gene()

  3. ID, generated as “%s_CDS % self.get_name()

Parameters
extra_attrkeyword arguments

Values that will be included in the CDS subchain’s attr dict. These can be used to overwrite values already present.

Returns
SegmentChain

CDS region of self if present, otherwise empty SegmentChain

get_counts(self, ga, stranded=True)

Return list of counts or values drawn from ga at each position in self

Parameters
gaGenomeArray from which to fetch counts
strandedbool, optional

If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

Returns
numpy.ndarray

Array of counts from ga covering self

get_fasta(self, genome, stranded=True)

Formats sequence of SegmentChain as FASTA output

Parameters
genomedict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

strandedbool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns
str

FASTA-formatted seuqence of SegmentChain extracted from genome

get_gene(self)

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made from get_name().

Returns
str

Returns in order of preference, gene_id from self.attr, Parent from self.attr or 'gene_%s' % self.get_name()

get_genomic_coordinate(self, x, stranded=True)

Finds genomic coordinate corresponding to position x in self

Parameters
xint

position of interest, relative to SegmentChain

strandedbool, optional

If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

Returns
str

Chromosome name

long

Genomic cordinate corresponding to position x

str

Chromosome strand (‘+’, ‘-’, or ‘.’)

Raises
IndexError

if x is outside the bounds of the SegmentChain

get_junctions(self)

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.

Returns
list

List of GenomicSegments covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)

get_masked_counts(self, ga, stranded=True, copy=False)

Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by SegmentChain.add_mask() will be masked in the array

Parameters
gndnon-abstract subclass of AbstractGenomeArray

GenomeArray from which to fetch counts

strandedbool, optional

If true and the SegmentChain is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

copybool, optional

If False (default) returns a view of the data; so changing values in the view changes the values in the GenomeArray if it is mutable. If True, a copy is returned instead.

Returns
numpy.ma.masked_array
get_masked_position_set(self) set

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

Returns
set

Set of genomic coordinates, as integers

get_masks(self)

Return masked positions as a list of GenomicSegments

Returns
list

list of GenomicSegments representing masked positions

get_masks_as_segmentchain(self)

Return masked positions as a SegmentChain

Returns
SegmentChain

Masked positions

get_name(self)

Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name. If no value is found, Transcript.__str__() is used.

Returns
str

Returns in order of preference, transcript_id, ID, Name, or name from self.attr. If not found, returns str(self)

get_position_list(self)

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

Returns
list

Genomic coordinates in self, as integers, in genomic order

get_position_set(self)

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

Returns
set

Set of genomic coordinates, as integers

get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)

Finds the SegmentChain coordinate corresponding to a genomic position

Parameters
chromstr

Chromosome name

genomic_xint

coordinate, in genomic space

strandstr

Chromosome strand (‘+’, ‘-’, or ‘.’)

strandedbool, optional

If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)

Returns
int

Position in SegmentChain

Raises
KeyError

if position outside bounds of SegmentChain

get_sequence(self, genome, stranded=True)

Return spliced genomic sequence of SegmentChain as a string

Parameters
genomedict or twobitreader.TwoBitFile

Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects

strandedbool

If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns
str

Nucleotide sequence of the SegmentChain extracted from genome

get_subchain(self, long start, long end, bool stranded=True, **extra_attr)

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.

Parameters
startint

position of interest in SegmentChain coordinates, 0-indexed

endint

position of interest in SegmentChain coordinates, 0-indexed and half-open

strandedbool, optional

If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

extra_attrkeyword arguments

Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.

Returns
SegmentChain

covering parent chain positions start to end of self

Raises
IndexError

if start or end is outside the bounds of the SegmentChain

TypeError

if start or end is None

get_unstranded(self) SegmentChain

Returns an SegmentChain antisense to self, with empty attr dict.

Returns
SegmentChain

SegmentChain antisense to self

get_utr3(self, **extra_attr)

Retrieve sub-SegmentChain covering 3’UTR of self, excluding the stop codon. If no coding region, returns an empty SegmentChain

The following attributes are passed from self.attr to the new SegmentChain

  1. transcript_id, taken from SegmentChain.get_name()

  2. gene_id, taken from SegmentChain.get_gene()

  3. ID, generated as “%s_3UTR” % self.get_name()

Parameters
extra_attrkeyword arguments

Values that will be included in the 3’ UTR subchain’s attr dict. These can be used to overwrite values already present.

Returns
SegmentChain

3’ UTR region of self if present, otherwise empty SegmentChain

get_utr5(self, **extra_attr)

Retrieve sub-SegmentChain covering 5’UTR of self. If no coding region, returns an empty SegmentChain

The following attributes are passed from self.attr to the new SegmentChain

  1. transcript_id, taken from SegmentChain.get_name()

  2. gene_id, taken from SegmentChain.get_gene()

  3. ID, generated as “%s_5UTR” % self.get_name()

Parameters
extra_attrkeyword arguments

Values that will be included in the 5’UTR subchain’s attr dict. These can be used to overwrite values already present.

Returns
SegmentChain

5’ UTR region of self if present, otherwise empty SegmentChain

next(self)

Return next GenomicSegment in the SegmentChain, from left to right on the chromsome

overlaps(self, other)

Return True if self and other share genomic positions on the same strand

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share genomic positions on the same chromosome and strand; False otherwise.

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

reset_masks(self)

Removes masks added by add_masks()

shares_segments_with(self, other)

Returns a list of GenomicSegment that are shared between self and other

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
list

List of GenomicSegments common to self and other

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

sort(self)
unstranded_overlaps(self, other)

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

Parameters
otherSegmentChain or GenomicSegment

Query feature

Returns
bool

True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match

Raises
TypeError

if other is not a GenomicSegment or SegmentChain

attr

attr: dict

c_strand
cds_end

End of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_start, self.cds_genome_start and self.cds_genome_end to None

cds_genome_end

Ending coordinate of coding region, relative to genome (i.e. leftmost; is stop codon for forward-strand features, start codon for reverse-strand features. Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_start to None

cds_genome_start

Starting coordinate of coding region, relative to genome (i.e. leftmost; is start codon for forward-strand features, stop codon for reverse-strand features). Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_end to None

cds_start

Start of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_end, self.cds_genome_start and self.cds_genome_end to None

chrom

Chromosome the SegmentChain resides on

length
mask_segments

Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

masked_length
segments

Copy of list of GenomicSegments that comprise self. Changing this list will do nothing to self.

spanning_segment
strand

Strand of the SegmentChain

plastid.genomics.roitools.add_three_for_stop_codon(Transcript tx) Transcript

Extend an annotated CDS region, if present, by three nucleotides at the threeprime end. Use in cases when annotation files exclude the stop codon from the annotated CDS.

Parameters
txTranscript

query transcript

Returns
Transcript

Transcript with same attributes as tx, but with CDS extended by one codon

Raises
IndexError

if a three prime UTR is defined that terminates before the complete stop codon

plastid.genomics.roitools.merge_segments(list segments) list

Merge all overlapping GenomicSegments in segments, so that all segments returned are guaranteed to be sorted and non-overlapping.

Note

All segments are assumed to be on the same strand and chromosome.

Parameters
segmentslist

List of GenomicSegments, all on the same strand and chromosome

Returns
list

List of sorted, non-overlapping GenomicSegments

plastid.genomics.roitools.positionlist_to_segments(unicode chrom, unicode strand, list positions) list

Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions.

Parameters
chromstr

Chromosome name

strandstr

Chromosome strand (‘+’, ‘-’, or ‘.’)

positionslist of unique integers

Sorted, end-inclusive list of positions to include in final GenomicSegment

Returns
list

List of GenomicSegments covering positions

Warning

This function is meant to quickly without excessive type conversions. So, the elements positions must be UNIQUE and SORTED. If they are not, use positions_to_segments() instead.

plastid.genomics.roitools.positions_to_segments(unicode chrom, unicode strand, positions) list

Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions.

Parameters
chromstr

Chromosome name

strandstr

Chromosome strand (‘+’, ‘-’, or ‘.’)

positionslist of integers

End-inclusive list, tuple, or set of positions to include in final GenomicSegment

Returns
list

List of GenomicSegments covering positions