plastid.genomics.roitools module¶

This module contains classes for representing and manipulating genomic features.

Summary
- Module contents
Examples

Summary ¶

Genomic features are represented as SegmentChains, which can contain zero or more continuous spans of the genome (GenomicSegments), as well as rich annotation data. For the specific case of RNA transcripts, a subclass of SegmentChain, called Transcript is provided.

Module contents ¶

`GenomicSegment`(chrom, start, end, strand)	Building block for `SegmentChain`: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.
`SegmentChain`(segments, *attributes)	Base class for genomic features, composed of zero or more `GenomicSegments`.
`Transcript`(segments, *attributes)	Subclass of `SegmentChain` specifically for RNA transcripts.
`positions_to_segments`(unicode chrom, ...)	Construct `GenomicSegments` from a chromosome name, a strand, and a list of chromosomal positions.
`add_three_for_stop_codon`(Transcript tx)	Extend an annotated CDS region, if present, by three nucleotides at the threeprime end.

Examples ¶

SegmentChains may be read directly from annotation files using the readers in plastid.readers:

>>> from plastid import *
>>> chains = list(BED_Reader(open("some_file.bed")))

or constructed from GenomicSegments:

>>> seg1 = GenomicSegment("chrA", 5, 200, "-")
>>> seg2 = GenomicSegment("chrA", 250, 300, "-")
>>> my_chain = SegmentChain(seg1, seg2, ID="some_chain", ... , some_attribute="some_value")

SegmentChains contain convenience methods for a number of comman tasks, for example:

converting coordinates between the spliced space of the chain, and the genome:

>>> # get coordinate of 50th position from 5' end
>>> my_chain.get_genomic_coordinate(50)
('chrA', 199, '-')

# get coordinate of 49th position. splicing is taken care of!
>>> my_chain.get_genomic_coordinate(49)
('chrA', 250, '-')

# get coordinate in chain corresponding to genomic coordinate 118
>>> my_chain.get_segmentchain_coordinate("chrA", 118, "-")
131

# get a subchain containing positions 45-70
>>> subchain = my_chain.get_subchain(45, 70)
>>> subchain
<SegmentChain segments=2 bounds=chrA:180-255(-) name=some_chain_subchain>

# the subchain preserves the discontinuity found in `my_chain`
>>> subchain.segments
[<GenomicSegment chrA:180-200 strand='-'>,
 <GenomicSegment chrA:250-255 strand='-'>]

fetching numpy arrays of data at each position in the chain. The data is assumed to be kept in a GenomeArray:

>>> ga = BAMGenomeArray(["some_file.bam"], mapping=ThreePrimeMapFactory(offset=15))
>>> my_chain.get_counts(ga)
array([843, 854, 153,  86, 462, 359, 290,  38,  38, 758, 342, 299, 430,
       628, 324, 437, 231, 417, 536, 673, 243, 981, 661, 415, 207, 446,
       197, 520, 653, 468, 863,   3, 272, 754, 352, 960, 966, 913, 367,
       ...
       ])

similarly, fetching spliced sequence, reverse-complemented if necessary for minus-strand features. As input, the SegmentChain expects a dictionary-like object mapping chromosome names to string-like sequences (e.g. as in BioPython or twobitreader):
```
>>> seqdict = { "chrA" : "TCTACATA ..." } # some string of chrA sequence
>>> my_chain.get_sequence(seqdict)
"ACTGTGTACTGTACGATCGATCGTACGTACGATCGATCGTACGTAGCTAGTCAGCTAGCTAGCTAGCTGA..."
```

testing for overlap, containment, equality with other SegmentChains:

>>> other_chain = SegmentChain(GenomicSegment("chrA", 200, 300, "-"),
>>>                            GenomicSegment("chrA", 800, 900, "-"))

>>>  my_chain.overlaps(other_chain)
True

>>> other_chain in my_chain
False

>>> my_chain in my_chain
True

>>> my_chain.covers(other_chain)
False

>>> my_chain == other_chain
False

>>> my_chain == my_chain
True

export to BED, GTF2, or GFF3:

>>> my_chain.as_bed()
chrA    5    300    some_chain    0    -    5    5    0,0,0    2    195,50,    0,245,

>>> my_chain.as_gtf()
chrA    .    exon    6    200    .    -    .    gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";
chrA    .    exon    251  300    .    -    .    gene_id "gene_some_chain"; transcript_id "some_chain"; some_attribute "some_value"; ID "some_chain";

class plastid.genomics.roitools.GenomicSegment(chrom, start, end, strand)¶

Bases: object

Building block for SegmentChain: a continuous segment of the genome defined by a chromosome name, start coordinate, end coordinate, and strand.

Examples

GenomicSegments sort lexically by chromosome, start position, end position, and finally strand:

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrB", 0, 10, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 75, 100, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 55, 75, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 150, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") < GenomicSegment("chrA", 50, 100, "-")
True

They also provide a few convenience methods for containment or overlap. To be contained, a segment must be on the same chromosome and strand as its container, and its coordinates must be within or equal to its endpoints:

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 50, 100, "+")
True

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 25, 100, "-")
False

>>> GenomicSegment("chrA", 50, 100, "+") in GenomicSegment("chrA", 75, 200, "+")
False

Similarly, to overlap, GenomicSegments must be on the same strand and chromosome.

Attributes

chromstr: Chromosome where GenomicSegment resides
startint: Zero-indexed (Pythonic) start coordinate of GenomicSegment
endint: Zero-indexed, half-open (Pythonic) end coordinate of GenomicSegment
strandstr: Strand of GenomicSegment

Methods

`as_igv_str`(self)	Format as an IGV location string
`contains`(self, GenomicSegment other)	Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.
`from_igv_str`(unicode loc_str, unicode strand=u)	Construct `GenomicSegment` from IGV location string
`from_str`(unicode inp)	Construct a `GenomicSegment` from its `str()` representation
`overlaps`(self, GenomicSegment other)	Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.

as_igv_str(self) → unicode¶: Format as an IGV location string

contains(self, GenomicSegment other) → bool¶

Test whether this segment contains other, where containment is defined as all positions in other being present in self, when both self and other share the same chromosome and strand.

Parameters

otherGenomicSegment: Query segment

Returns

bool

static from_igv_str(unicode loc_str, unicode strand=u'.')¶

Construct GenomicSegment from IGV location string

Parameters

igvlocstr: IGV location string, in format ‘chromosome:start-end’, where start and end are 1-indexed and half-open
strandstr: The chromosome strand (‘+’, ‘-’, or ‘.’)

Returns

GenomicSegment

static from_str(unicode inp)¶

Construct a GenomicSegment from its str() representation

Parameters

inpstr: String representation of GenomicSegment as chrom:start-end(strand) where start and end are in 0-indexed, half-open coordinates

Returns

GenomicSegment

overlaps(self, GenomicSegment other) → bool¶

Test whether this segment overlaps other, where overlap is defined as sharing: a chromosome, a strand, and a subset of coordinates.

Parameters

otherGenomicSegment: Query segment

Returns

bool

c_strand¶

chrom¶: Chromosome where GenomicSegment resides

end¶: Zero-indexed, half-open (Pythonic) end coordinate of GenomicSegment

start¶: Zero-indexed (Pythonic) start coordinate of GenomicSegment

strand¶

Strand of GenomicSegment

‘+’ for forward / Watson strand
‘-’ for reverse / Crick strand
‘.’ for unstranded / both strands

class plastid.genomics.roitools.SegmentChain(*segments, **attributes)¶

Bases: object

Base class for genomic features, composed of zero or more GenomicSegments. SegmentChains can therefore model discontinuous, features – such as multi-exon transcripts or gapped alignments – in addition, to continuous features.

Numerous convenience functions are supplied for:

converting between coordinates relative to the genome and relative to the internal coordinates of a spliced SegmentChain

fetching genomic sequence, read alignments, or count data, accounting for splicing of the segments, and, in the case of reverse-strand features, reverse-complementing

slicing or fetching sub-regions of a SegmentChain

testing equality, inequality, overlap, containment, coverage of, or sharing of segments with other SegmentChain or GenomicSegment objects

import/export to BED, PSL, GTF2, and GFF3 formats, for use in other software packages or in a genome browser.

Intervals are sorted from lowest to greatest starting coordinate on their reference sequence, regardless of strand. Iteration over the SegmentChain will yield intervals from left-to-right in the genome.

Parameters

*segmentsGenomicSegment

0 or more GenomicSegments on the same strand

**attrkeyword arguments

Arbitrary attributes, including, for example:

Attribute	Description
`type`	A feature type used for GTF2/GFF3 export of each interval in the `SegmentChain`. (Default: ‘exon’)
`ID`	A unique ID for the `SegmentChain`.
`transcript_id`	A transcript ID used for GTF2 export
`gene_id`	A gene ID used for GTF2 export

See also

Transcript: Transcript subclass, additionally providing richer GTF2, GFF3, and BED export, as well as methods for fetching coding regions and UTRs as subsegments

Attributes

spanning_segmentGenomicSegment: A GenomicSegment spanning the endpoints of the SegmentChain
strandstr: Strand of the SegmentChain
chromstr: Chromosome the SegmentChain resides on
attrdict: attr: dict
segmentslist: Copy of list of GenomicSegments that comprise self.
mask_segmentslist: Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

Methods

`add_masks`(self, *mask_segments)	Adds one or more `GenomicSegment` to the collection of masks.
`add_segments`(self, *segments)	Add 1 or more `GenomicSegments` to the `SegmentChain`.
`antisense_overlaps`(self, other)	Returns True if self and other share genomic positions on opposite strands
`as_bed`(self[, thickstart, thickend, as_int, ...])	Format `SegmentChain` as a string of BED12[+X] output.
`as_gff3`(self, unicode feature_type=None, ...)	Format self as a line of GFF3 output.
`as_gtf`(self, unicode feature_type=None, ...)	Format `SegmentChain` as a block of GTF2 output.
`as_psl`(self)	Formats `SegmentChain` as PSL (blat) output.
`covers`(self, other)	Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
`from_bed`(unicode line[, extra_columns])	Create a `SegmentChain` from a line from a BED file.
`from_psl`(psl_line)	Create a `SegmentChain` from a line from a PSL (BLAT) file
`from_str`(unicode inp)	Create a `SegmentChain` from a string formatted by `SegmentChain.__str__()`:
`get_antisense`(self)	Returns an `SegmentChain` antisense to self, with empty attr dict.
`get_counts`(self, ga[, stranded])	Return list of counts or values drawn from ga at each position in self
`get_fasta`(self, genome[, stranded])	Formats sequence of SegmentChain as FASTA output
`get_gene`(self)	Return name of gene associated with `SegmentChain`, if any, by searching through self.attr for the keys gene_id and Parent.
`get_genomic_coordinate`(self, x[, stranded])	Finds genomic coordinate corresponding to position x in self
`get_junctions`(self)	Returns a list of `GenomicSegments` representing spaces between the `GenomicSegments` in self In the case of a transcript, these would represent introns.
`get_masked_counts`(self, ga[, stranded, copy])	Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
`get_masked_position_set`(self)	Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using `SegmentChain.add_masks()`
`get_masks`(self)	Return masked positions as a list of `GenomicSegments`
`get_masks_as_segmentchain`(self)	Return masked positions as a `SegmentChain`
`get_name`(self)	Returns the name of this `SegmentChain`, first searching through self.attr for the keys ID, Name, and name.
`get_position_list`(self)	Retrieve a sorted end-inclusive numpy array of genomic coordinates in this `SegmentChain`
`get_position_set`(self)	Retrieve an end-inclusive set of genomic coordinates included in this `SegmentChain`
`get_segmentchain_coordinate`(self, ...)	Finds the `SegmentChain` coordinate corresponding to a genomic position
`get_sequence`(self, genome[, stranded])	Return spliced genomic sequence of `SegmentChain` as a string
`get_subchain`(self, long start, long end, ...)	Retrieves a sub-`SegmentChain` corresponding a range of positions specified in coordinates relative this `SegmentChain`.
`get_unstranded`(self)	Returns an `SegmentChain` antisense to self, with empty attr dict.
`next`(self)	Return next `GenomicSegment` in the `SegmentChain`, from left to right on the chromsome
`overlaps`(self, other)	Return True if self and other share genomic positions on the same strand
`reset_masks`(self)	Removes masks added by `add_masks()`
`shares_segments_with`(self, other)	Returns a list of `GenomicSegment` that are shared between self and other
`sort`(self)
`unstranded_overlaps`(self, other)	Return True if self and other share genomic positions on the same chromosome, regardless of their strands

add_masks(self, *mask_segments)¶

Adds one or more GenomicSegment to the collection of masks. Masks will be trimmed to the positions of the SegmentChain during addition.

Parameters

mask_segmentsGenomicSegment: One or more segments, in genomic coordinates, covering positions to exclude from return values of get_masked_position_set(), get_masked_counts(), or get_masked_length()

See also

SegmentChain.get_masks
SegmentChain.get_masks_as_segmentchain
SegmentChain.reset_masks

add_segments(self, *segments)¶

Add 1 or more GenomicSegments to the SegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.

Parameters

segmentsGenomicSegment: One or more GenomicSegment to add to SegmentChain

antisense_overlaps(self, other)¶

Returns True if self and other share genomic positions on opposite strands

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

as_bed(self, thickstart=None, thickend=None, as_int=True, color=None, extra_columns=None, empty_value='')¶

Format SegmentChain as a string of BED12[+X] output.

If the SegmentChain was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.

Parameters

thickstartint or None, optional

If not None, overrides the genome coordinate that starts thick plotting in genome browser found in self.attr[‘thickstart’]

thickendint or None, optional

If not None, overrides the genome coordinate that stops thick plotting in genome browser found in self.attr[‘thickend’]

as_intbool, optional

Force score to integer (Default: True)

colorstr or None, optional

Color represented as RGB hex string. If not none, overrides the color in self.attr[‘color’]

extra_columnsNone or list-like, optional

If None, and the SegmentChain was imported using the extra_columns keyword of from_bed(), the SegmentChain will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, the SegmentChain will be exported as a BED12 line.

If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the SegmentChain, it will be exported with the value of empty_value

If an empty list, no extra columns will be exported; the SegmentChain will be formatted as a BED12 line.

empty_valuestr, optional

Value to export for extra_columns that are not defined (Default: “”)

Returns

str: Line of BED12[+X]-formatted text

Notes

BED12 columns are as follows:

Column	Contains
1	Contig or chromosome
2	Start of first block in feature (0-indexed)
3	End of last block in feature (half-open)
4	Feature name
5	Feature score
6	Strand
7	thickstart (in chromosomal coordinates)
8	thickend (in chromosomal coordinates)
9	Feature color as RGB tuple
10	Number of blocks in feature
11	Block lengths
12	Block starts, relative to start of first block

For more details

See the UCSC file format faq

as_gff3(self, unicode feature_type=None, bool escape=True, list excludes=None)¶

Format self as a line of GFF3 output.

Because GFF3 files permit many schemas of parent-child hierarchy, and in order to reduce confusion and overhead, attempts to export a multi-interval SegmentChain will raise an AttributeError.

Instead, users may export the individual features from which the multi-interval SegmentChain was constructed, or construct features for them, setting ID, Parent, and type attributes following their own conventions.

Parameters

feature_typestr: If not None, overrides the type attribute of self.attr
escapebool, optional: Escape tokens in column 9 of GFF3 output (Default: True)
excludeslist, optional: List of attribute key names to exclude from column 9 (Default: [])

Returns

str: Line of GFF3-formatted text

Raises

AttributeError: if the SegmentChain has multiple intervals

Notes

Columns of GFF3 are as follows

Column	Contains
1	Contig or chromosome
2	Source of annotation
3	Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4	Start (1-indexed)
5	End (fully-closed)
6	Score
7	Strand
8	Frame. Number of bases within feature before first in-frame codon (if coding)
9	Attributes

For further information, see

as_gtf(self, unicode feature_type=None, bool escape=True, list excludes=None)¶

Format SegmentChain as a block of GTF2 output.

The frame or phase attribute (GTF2 column 8) is valid only for ‘CDS’ features, and, if not present in self.attr, is calculated assuming the SegmentChain contains the entire coding region. If the SegmentChain contains multiple intervals, the frame or phase attribute will always be recalculated.

All attributes in self.attr, except those created upon import, will be propagated to all of the features that are generated.

Parameters

feature_typestr: If not None, overrides the “type” attribute of self.attr
escapebool, optional: Escape tokens in column 9 of GTF output (Default: True)
excludeslist, optional: List of attribute key names to exclude from column 8 (Default: [])

Returns

str: Block of GTF2-formatted text

Notes

gene_id and transcript_id are required

The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in SegmentChain.get_gene() and SegmentChain.get_name(), respectively.

Beware of attribute loss

To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this Transcript have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.

Columns of GTF2 are as follows

Column	Contains
1	Contig or chromosome
2	Source of annotation
3	Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4	Start (1-indexed)
5	End (fully-closed)
6	Score
7	Strand
8	Frame. Number of bases within feature before first in-frame codon (if coding)
9	Attributes. “gene_id” and “transcript_id” are required

For more info

as_psl(self)¶

Formats SegmentChain as PSL (blat) output.

Returns

str: PSL-representation of BLAT alignment

Raises

AttributeError: If not all of the attributes listed above are defined

Notes

This will raise an AttributeError unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:

Column

Key

1

match_length

2

mismatches

3

rep_matches

4

N

5

query_gap_count

6

query_gap_bases

7

target_gap_count

8

target_gap_bases

9

strand

10

query_name

11

query_length

12

query_start

13

query_end

14

target_name

15

target_length

16

target_start

17

target_end

19

q_starts : list of integers

20

l_starts : list of integers

These keys are defined only if the SegmentChain was created by SegmentChain.from_psl(), or if the user has defined them.

See the PSL spec for more information.

covers(self, other)¶

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length SegmentChains are not covered by other chains.

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

static from_bed(unicode line, extra_columns=0)¶

Create a SegmentChain from a line from a BED file. The BED line may contain 4 to 12 columns, per the specification. These will be auto-detected and parsed appropriately.

See the UCSC file format faq for more details.

Parameters

line

Line from a BED file, containing 4 or more columns

extra_columns: int or list optional

Extra, non-BED columns in :term:`Extended BED`_ format file corresponding to feature attributes. This is common in ENCODE-specific BED variants.

if extra-columns is:

an int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of the SegmentChain, under names like custom0, custom1, … , customN.

a list of str, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under their respective names in the attr dict.

a list of tuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of the SegmentChain will be set to formatter_func(column_value).

(Default: 0)

Returns

SegmentChain

static from_psl(psl_line)¶

Create a SegmentChain from a line from a PSL (BLAT) file

See the PSL spec

Parameters

psl_linestr: Line from a PSL file

Returns

SegmentChain

static from_str(unicode inp)¶

Create a SegmentChain from a string formatted by SegmentChain.__str__():

chrom:start-end^start-end(strand)

where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.

Parameters

inpstr: String formatted in manner of SegmentChain.__str__()

Returns

SegmentChain

get_antisense(self) → SegmentChain¶

Returns an SegmentChain antisense to self, with empty attr dict.

Returns

SegmentChain: SegmentChain antisense to self

get_counts(self, ga, stranded=True)¶

Return list of counts or values drawn from ga at each position in self

Parameters

gaGenomeArray from which to fetch counts
strandedbool, optional: If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

Returns

numpy.ndarray: Array of counts from ga covering self

get_fasta(self, genome, stranded=True)¶

Formats sequence of SegmentChain as FASTA output

Parameters

genomedict or twobitreader.TwoBitFile: Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects
strandedbool: If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns

str: FASTA-formatted seuqence of SegmentChain extracted from genome

get_gene(self)¶

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made from get_name().

Returns

str: Returns in order of preference, gene_id from self.attr, Parent from self.attr or 'gene_%s' % self.get_name()

get_genomic_coordinate(self, x, stranded=True)¶

Finds genomic coordinate corresponding to position x in self

Parameters

xint: position of interest, relative to SegmentChain
strandedbool, optional: If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

Returns

str: Chromosome name
long: Genomic cordinate corresponding to position x
str: Chromosome strand (‘+’, ‘-’, or ‘.’)

Raises

IndexError: if x is outside the bounds of the SegmentChain

get_junctions(self)¶

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.

Returns

list: List of GenomicSegments covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)

get_masked_counts(self, ga, stranded=True, copy=False)¶

Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by SegmentChain.add_mask() will be masked in the array

Parameters

gndnon-abstract subclass of AbstractGenomeArray: GenomeArray from which to fetch counts
strandedbool, optional: If true and the SegmentChain is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)
copybool, optional: If False (default) returns a view of the data; so changing values in the view changes the values in the GenomeArray if it is mutable. If True, a copy is returned instead.

Returns

numpy.ma.masked_array

get_masked_position_set(self) → set¶

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

Returns

set: Set of genomic coordinates, as integers

get_masks(self)¶

Return masked positions as a list of GenomicSegments

Returns

list: list of GenomicSegments representing masked positions

See also

SegmentChain.get_masks_as_segmentchain
SegmentChain.add_masks
SegmentChain.reset_masks

get_masks_as_segmentchain(self)¶

Return masked positions as a SegmentChain

Returns

SegmentChain: Masked positions

See also

SegmentChain.get_masks
SegmentChain.add_masks
SegmentChain.reset_masks

get_name(self)¶

Returns the name of this SegmentChain, first searching through self.attr for the keys ID, Name, and name. If no value is found for any of those keys, a name is generated using SegmentChain.__str__()

Returns

str: In order of preference, ID from self.attr, Name from self.attr, name from self.attr or str(self)

get_position_list(self)¶

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

Returns

list: Genomic coordinates in self, as integers, in genomic order

get_position_set(self)¶

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

Returns

set: Set of genomic coordinates, as integers

get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)¶

Finds the SegmentChain coordinate corresponding to a genomic position

Parameters

chromstr: Chromosome name
genomic_xint: coordinate, in genomic space
strandstr: Chromosome strand (‘+’, ‘-’, or ‘.’)
strandedbool, optional: If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)

Returns

int: Position in SegmentChain

Raises

KeyError: if position outside bounds of SegmentChain

get_sequence(self, genome, stranded=True)¶

Return spliced genomic sequence of SegmentChain as a string

Parameters

genomedict or twobitreader.TwoBitFile: Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects
strandedbool: If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns

str: Nucleotide sequence of the SegmentChain extracted from genome

get_subchain(self, long start, long end, bool stranded=True, **extra_attr)¶

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.

Parameters

startint: position of interest in SegmentChain coordinates, 0-indexed
endint: position of interest in SegmentChain coordinates, 0-indexed and half-open
strandedbool, optional: If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
extra_attrkeyword arguments: Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.

Returns

SegmentChain: covering parent chain positions start to end of self

Raises

IndexError: if start or end is outside the bounds of the SegmentChain
TypeError: if start or end is None

get_unstranded(self) → SegmentChain¶

Returns an SegmentChain antisense to self, with empty attr dict.

Returns

SegmentChain: SegmentChain antisense to self

next(self)¶: Return next GenomicSegment in the SegmentChain, from left to right on the chromsome

overlaps(self, other)¶

Return True if self and other share genomic positions on the same strand

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share genomic positions on the same chromosome and strand; False otherwise.

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

reset_masks(self)¶

Removes masks added by add_masks()

See also

SegmentChain.add_masks

shares_segments_with(self, other)¶

Returns a list of GenomicSegment that are shared between self and other

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

list: List of GenomicSegments common to self and other

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

sort(self)¶

unstranded_overlaps(self, other)¶

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

attr¶: attr: dict

c_strand¶

chrom¶: Chromosome the SegmentChain resides on

length¶

mask_segments¶: Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

masked_length¶

segments¶: Copy of list of GenomicSegments that comprise self. Changing this list will do nothing to self.

spanning_segment¶

strand¶: Strand of the SegmentChain

class plastid.genomics.roitools.Transcript(*segments, **attributes)¶

Bases: plastid.genomics.roitools.SegmentChain

Subclass of SegmentChain specifically for RNA transcripts. In addition to coordinate-conversion, count fetching, sequence fetching, and various other methods inherited from SegmentChain, Transcript provides convenience methods for fetching sub-chains corresponding to CDS features, 5’ UTRs, and 3’ UTRs.

Parameters

*segmentsGenomicSegment

0 or more GenomicSegments on the same strand

**attrkeyword arguments

Arbitrary attributes, including, for example:

Attribute	Description
`cds_genome_start`	Location of CDS start, in genomic coordinates
`cds_genome_start`	Location of CDS end, in genomic coordinates
`ID`	A unique ID for the `SegmentChain`.
`transcript_id`	A transcript ID used for GTF2 export
`gene_id`	A gene ID used for GTF2 export

Attributes

cds_genome_startint or None: Starting coordinate of coding region, relative to genome (i.e.
cds_genome_endint or None: Ending coordinate of coding region, relative to genome (i.e.
cds_startint or None: Start of coding region relative to 5’ end of transcript, in direction of transcript.
cds_endint or None: End of coding region relative to 5’ end of transcript, in direction of transcript.
spanning_segmentGenomicSegment: A GenomicSegment spanning the endpoints of the Transcript
strandstr: Strand of the SegmentChain
chromstr: Chromosome the SegmentChain resides on
segmentslist: Copy of list of GenomicSegments that comprise self.
mask_segmentslist: Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.
attrdict: attr: dict

Methods

`add_masks`(self, *mask_segments)	Adds one or more `GenomicSegment` to the collection of masks.
`add_segments`(self, *segments)	Add 1 or more `GenomicSegments` to the `SegmentChain`.
`antisense_overlaps`(self, other)	Returns True if self and other share genomic positions on opposite strands
`as_bed`(self[, as_int, color, extra_columns, ...])	Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr
`as_gff3`(self, bool escape=True, ...)	Format a `Transcript` as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53
`as_gtf`(self, unicode feature_type=u, ...)	Format self as a GTF2 block.
`as_psl`(self)	Formats `SegmentChain` as PSL (blat) output.
`covers`(self, other)	Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self.
`from_bed`(unicode line[, extra_columns])	Create a `Transcript` from a BED line with 4 or more columns.
`from_psl`(unicode psl_line)
`from_str`(unicode inp)	Create a `SegmentChain` from a string formatted by `SegmentChain.__str__()`:
`get_antisense`(self)	Returns an `SegmentChain` antisense to self, with empty attr dict.
`get_cds`(self, **extra_attr)	Retrieve `SegmentChain` covering the coding region of self, including the stop codon.
`get_counts`(self, ga[, stranded])	Return list of counts or values drawn from ga at each position in self
`get_fasta`(self, genome[, stranded])	Formats sequence of SegmentChain as FASTA output
`get_gene`(self)	Return name of gene associated with `SegmentChain`, if any, by searching through self.attr for the keys gene_id and Parent.
`get_genomic_coordinate`(self, x[, stranded])	Finds genomic coordinate corresponding to position x in self
`get_junctions`(self)	Returns a list of `GenomicSegments` representing spaces between the `GenomicSegments` in self In the case of a transcript, these would represent introns.
`get_masked_counts`(self, ga[, stranded, copy])	Return counts covering self in dataset gnd as a masked array, in transcript coordinates.
`get_masked_position_set`(self)	Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using `SegmentChain.add_masks()`
`get_masks`(self)	Return masked positions as a list of `GenomicSegments`
`get_masks_as_segmentchain`(self)	Return masked positions as a `SegmentChain`
`get_name`(self)	Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name.
`get_position_list`(self)	Retrieve a sorted end-inclusive numpy array of genomic coordinates in this `SegmentChain`
`get_position_set`(self)	Retrieve an end-inclusive set of genomic coordinates included in this `SegmentChain`
`get_segmentchain_coordinate`(self, ...)	Finds the `SegmentChain` coordinate corresponding to a genomic position
`get_sequence`(self, genome[, stranded])	Return spliced genomic sequence of `SegmentChain` as a string
`get_subchain`(self, long start, long end, ...)	Retrieves a sub-`SegmentChain` corresponding a range of positions specified in coordinates relative this `SegmentChain`.
`get_unstranded`(self)	Returns an `SegmentChain` antisense to self, with empty attr dict.
`get_utr3`(self, **extra_attr)	Retrieve sub-`SegmentChain` covering 3'UTR of self, excluding the stop codon.
`get_utr5`(self, **extra_attr)	Retrieve sub-`SegmentChain` covering 5'UTR of self.
`next`(self)	Return next `GenomicSegment` in the `SegmentChain`, from left to right on the chromsome
`overlaps`(self, other)	Return True if self and other share genomic positions on the same strand
`reset_masks`(self)	Removes masks added by `add_masks()`
`shares_segments_with`(self, other)	Returns a list of `GenomicSegment` that are shared between self and other
`sort`(self)
`unstranded_overlaps`(self, other)	Return True if self and other share genomic positions on the same chromosome, regardless of their strands

add_masks(self, *mask_segments)¶

Adds one or more GenomicSegment to the collection of masks. Masks will be trimmed to the positions of the SegmentChain during addition.

Parameters

mask_segmentsGenomicSegment: One or more segments, in genomic coordinates, covering positions to exclude from return values of get_masked_position_set(), get_masked_counts(), or get_masked_length()

See also

SegmentChain.get_masks
SegmentChain.get_masks_as_segmentchain
SegmentChain.reset_masks

add_segments(self, *segments)¶

Add 1 or more GenomicSegments to the SegmentChain. If there are already segments in the chain, the incoming segments must be on the same strand and chromosome as all others present.

Parameters

segmentsGenomicSegment: One or more GenomicSegment to add to SegmentChain

antisense_overlaps(self, other)¶

Returns True if self and other share genomic positions on opposite strands

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share genomic positions on the same chromosome but opposite strand; False otherwise.

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

as_bed(self, as_int=True, color=None, extra_columns=None, empty_value='')¶

Format self as a BED12[+X] line, assigning CDS boundaries to the thickstart and thickend columns from self.attr

If the SegmentChain was imported as a BED file with extra columns, these will be output in the same order, after the BED columns.

Parameters

as_intbool, optional

Force “score” to integer (Default: True)

colorstr or None, optional

Color represented as RGB hex string. If not none, overrides the color in self.attr[“color”]

extra_columnsNone or list-like, optional

If None, and the SegmentChain was imported using the extra_columns keyword of from_bed(), the SegmentChain will be exported in BED 12+X format, in which extra columns are in the same order as they were upon import. If no extra columns were present, the SegmentChain will be exported as a BED12 line.

If a list of attribute names, these attributes will be exported as extra columns in order, overriding whatever happened upon import. If an attribute name is not in the attr dict of the SegmentChain, it will be exported with the value of empty_value

If an empty list, no extra columns will be exported; the SegmentChain will be formatted as a BED12 line.

empty_valuestr, optional

Returns

str: Line of BED12-formatted text

Notes

BED12 columns are as follows

Column	Contains
0	Contig or chromosome
1	Start of first block in feature (0-indexed)
2	End of last block in feature (half-open)
3	Feature name
4	Feature score
5	Strand
6	thickstart
7	thickend
8	Feature color as RGB tuple
9	Number of blocks in feature
10	Block lengths
11	Block starts, relative to start of first block

Fore more information

See the UCSC file format faq

as_gff3(self, bool escape=True, list excludes=None, unicode rna_type=u'mRNA')¶

Format a Transcript as a block of GFF3 output, following the schema set out in the Sequence Ontology (SO) v2.53

The Transcript will be formatted according to the following rules:

A feature of type rna_type will be created, with Parent attribute set to the value of self.get_gene(), and ID attribute set to self.get_name()

For each GenomicSegment in self, a child feature of type exon will be created. The Parent attribute of these features will be set to the value of self.get_name(). These will have unique IDs generated from self.get_name().

If self is coding (i.e. has none-None value for self.cds_genome_start and self.cds_genome_end), child features of type ‘five_prime_UTR’, ‘CDS’, and ‘three_prime_UTR’ will be created, with Parent attributes set to self.get_name(). These will have unique IDs generated from self.get_name().

Parameters

escapebool, optional: Escape tokens in column 9 of GFF3 output (Default: True)
excludeslist, optional: List of attribute key names to exclude from column 9 (Default: [])
rna_typestr, optional: Feature type to export RNA as (e.g. ‘tRNA’, ‘noncoding_RNA’, et c. Default: ‘mRNA’)

Returns

str: Multiline block of GFF3-formatted text

Notes

Beware of attribute loss

This Transcript was assembled from multiple individual component features (e.g. single exons), which may or may not have had their own unique attributes in their original annotation. To reduce overhead, these individual attributes (if they were present) have not been (entirely) stored, and consequently will not (all) be exported. If this poses problems, consider instead importing, modifying, and exporting the component features

GFF3 schemas vary

Different GFF3s have different schemas (parent-child relationships between features). Here we adopt the commonly-used schema set by Sequence Ontology (SO) v2.53, which may or may not match your schema.

Columns of GFF3 are as follows

Column	Contains
1	Contig or chromosome
2	Source of annotation
3	Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)
4	Start (1-indexed)
5	End (fully-closed)
6	Score
7	Strand
8	Frame. Number of bases within feature before first in-frame codon (if coding)
9	Attributes

For futher information, see

GFF3 file format specification
Sequence Ontology (SO) v2.53 <http://www.sequenceontology.org/browser/>
SO releases
UCSC file format FAQ

as_gtf(self, unicode feature_type=u'exon', bool escape=True, list excludes=None)¶

Format self as a GTF2 block. GenomicSegments are formatted as GTF2 ‘exon’ features. Coding regions, if peresent, are formatted as GTF2 ‘CDS’ features. Stop codons are excluded in the ‘CDS’ features, per the GTF2 specification, and exported separately.

All attributes from self.attr are propagated to the exon and CDS features that are generated.

Parameters

feature_typestr: If not None, overrides the ‘type’ attribute of self.attr
escapebool, optional: URL escape tokens in column 9 of GTF2 output (Default: True)

Returns

str: Block of GTF2-formatted text

Notes

gene_id and transcript_id are required: The GTF2 specification requires that attributes gene_id and transcript_id be defined. If these are not present in self.attr, their values will be guessed following the rules in SegmentChain.get_gene() and SegmentChain.get_name(), respectively.
Beware of attribute loss: To save memory, only the attributes shared by all of the individual sub-features (e.g. exons) that were used to assemble this Transcript have been stored in self.attr. This means that upon re-export to GTF2, these sub-features will be lacking any attributes that were specific to them individually. Formally, this is compliant with the GTF2 specification, which states explicitly that only the attributes gene_id and transcript_id are supported.

Columns of GTF2 are as follows:

Column

Contains

1

Contig or chromosome

2

Source of annotation

3

Type of feature (“exon”, “CDS”, “start_codon”, “stop_codon”)

4

Start (1-indexed)

5

End (fully-closed)

6

Score

7

Strand

8

Frame. Number of bases within feature before first in-frame codon (if coding)

9

Attributes. “gene_id” and “transcript_id” are required

For more info

as_psl(self)¶

Formats SegmentChain as PSL (blat) output.

Returns

str: PSL-representation of BLAT alignment

Raises

AttributeError: If not all of the attributes listed above are defined

Notes

This will raise an AttributeError unless the following keys are present and defined in self.attr, corresponding to the columns of a PSL file:

Column

Key

1

match_length

2

mismatches

3

rep_matches

4

N

5

query_gap_count

6

query_gap_bases

7

target_gap_count

8

target_gap_bases

9

strand

10

query_name

11

query_length

12

query_start

13

query_end

14

target_name

15

target_length

16

target_start

17

target_end

19

q_starts : list of integers

20

l_starts : list of integers

These keys are defined only if the SegmentChain was created by SegmentChain.from_psl(), or if the user has defined them.

See the PSL spec for more information.

covers(self, other)¶

Return True if self and other share a chromosome and strand, and all genomic positions in other are present in self. By convention, zero-length SegmentChains are not covered by other chains.

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share a chromosome and strand, and all genomic positions in other are present in self. Otherwise False

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

static from_bed(unicode line, extra_columns=0)¶

Create a Transcript from a BED line with 4 or more columns. thickstart and thickend columns, if present, are assumed to specify CDS boundaries, a convention that, while common, is formally outside the BED specification.

See the UCSC file format faq for more details.

Parameters

line

Line from a BED file with at least 4 columns

extra_columns: int or list, optional

Extra, non-BED columns in BED file corresponding to feature attributes. This is common in ENCODE-specific BED variants.

if extra-columns is:

an int: it is taken to be the number of attribute columns. Attributes will be stored in the attr dictionary of the SegmentChain, under names like custom0, custom1, … , customN.

a list of str, it is taken to be the names of the attribute columns, in order, from left to right in the file. In this case, attributes in extra columns will be stored under

a list of tuple, each tuple is taken to be a pair of (attribute_name, formatter_func). In this case, the value of attribute_name in the attr dict of the SegmentChain will be set to formatter_func(column_value).

(Default: 0)

Returns

Transcript

static from_psl(unicode psl_line)¶

static from_str(unicode inp)¶

Create a SegmentChain from a string formatted by SegmentChain.__str__():

chrom:start-end^start-end(strand)

where ‘^’ indicates a splice junction between regions specified by start and end and strand is ‘+’, ‘-’, or ‘.’. Coordinates are 0-indexed and half-open.

Parameters

inpstr: String formatted in manner of SegmentChain.__str__()

Returns

SegmentChain

get_antisense(self) → SegmentChain¶

Returns an SegmentChain antisense to self, with empty attr dict.

Returns

SegmentChain: SegmentChain antisense to self

get_cds(self, **extra_attr)¶

Retrieve SegmentChain covering the coding region of self, including the stop codon. If no coding region is present, returns an empty SegmentChain.

The following attributes are passed from self.attr to the new SegmentChain

transcript_id, taken from SegmentChain.get_name()

gene_id, taken from SegmentChain.get_gene()

ID, generated as “%s_CDS % self.get_name()

Parameters

extra_attrkeyword arguments: Values that will be included in the CDS subchain’s attr dict. These can be used to overwrite values already present.

Returns

SegmentChain: CDS region of self if present, otherwise empty SegmentChain

get_counts(self, ga, stranded=True)¶

Return list of counts or values drawn from ga at each position in self

Parameters

gaGenomeArray from which to fetch counts
strandedbool, optional: If True and self is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)

Returns

numpy.ndarray: Array of counts from ga covering self

get_fasta(self, genome, stranded=True)¶

Formats sequence of SegmentChain as FASTA output

Parameters

genomedict or twobitreader.TwoBitFile: Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects
strandedbool: If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns

str: FASTA-formatted seuqence of SegmentChain extracted from genome

get_gene(self)¶

Return name of gene associated with SegmentChain, if any, by searching through self.attr for the keys gene_id and Parent. If one is not found, a generated gene name for the SegmentChain is made from get_name().

Returns

str: Returns in order of preference, gene_id from self.attr, Parent from self.attr or 'gene_%s' % self.get_name()

get_genomic_coordinate(self, x, stranded=True)¶

Finds genomic coordinate corresponding to position x in self

Parameters

xint: position of interest, relative to SegmentChain
strandedbool, optional: If True, x is assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, coordinates assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)

Returns

str: Chromosome name
long: Genomic cordinate corresponding to position x
str: Chromosome strand (‘+’, ‘-’, or ‘.’)

Raises

IndexError: if x is outside the bounds of the SegmentChain

get_junctions(self)¶

Returns a list of GenomicSegments representing spaces between the GenomicSegments in self In the case of a transcript, these would represent introns. In the case of an alignment, these would represent gaps in the query compared to the reference.

Returns

list: List of GenomicSegments covering spaces between the intervals in self (e.g. introns in the case of a transcript, or gaps in the case of an alignment)

get_masked_counts(self, ga, stranded=True, copy=False)¶

Return counts covering self in dataset gnd as a masked array, in transcript coordinates. Positions masked by SegmentChain.add_mask() will be masked in the array

Parameters

gndnon-abstract subclass of AbstractGenomeArray: GenomeArray from which to fetch counts
strandedbool, optional: If true and the SegmentChain is on the minus strand, count order will be reversed relative to genome so that the array positions march from the 5’ to 3’ end of the chain. (Default: True)
copybool, optional: If False (default) returns a view of the data; so changing values in the view changes the values in the GenomeArray if it is mutable. If True, a copy is returned instead.

Returns

numpy.ma.masked_array

get_masked_position_set(self) → set¶

Returns a set of genomic coordinates corresponding to positions in self that HAVE NOT been masked using SegmentChain.add_masks()

Returns

set: Set of genomic coordinates, as integers

get_masks(self)¶

Return masked positions as a list of GenomicSegments

Returns

list: list of GenomicSegments representing masked positions

See also

SegmentChain.get_masks_as_segmentchain
SegmentChain.add_masks
SegmentChain.reset_masks

get_masks_as_segmentchain(self)¶

Return masked positions as a SegmentChain

Returns

SegmentChain: Masked positions

See also

SegmentChain.get_masks
SegmentChain.add_masks
SegmentChain.reset_masks

get_name(self)¶

Return the name of self, first searching through self.attr for the keys transcript_id, ID, Name, and name. If no value is found, Transcript.__str__() is used.

Returns

str: Returns in order of preference, transcript_id, ID, Name, or name from self.attr. If not found, returns str(self)

get_position_list(self)¶

Retrieve a sorted end-inclusive numpy array of genomic coordinates in this SegmentChain

Returns

list: Genomic coordinates in self, as integers, in genomic order

get_position_set(self)¶

Retrieve an end-inclusive set of genomic coordinates included in this SegmentChain

Returns

set: Set of genomic coordinates, as integers

get_segmentchain_coordinate(self, unicode chrom, long genomic_x, unicode strand, bool stranded=True)¶

Finds the SegmentChain coordinate corresponding to a genomic position

Parameters

chromstr: Chromosome name
genomic_xint: coordinate, in genomic space
strandstr: Chromosome strand (‘+’, ‘-’, or ‘.’)
strandedbool, optional: If True, coordinates are given in stranded space (i.e. from 5’ end of chain, as one might expect for a transcript). If False, coordinates are given from the left end of self, regardless of strand. (Default: True)

Returns

int: Position in SegmentChain

Raises

KeyError: if position outside bounds of SegmentChain

get_sequence(self, genome, stranded=True)¶

Return spliced genomic sequence of SegmentChain as a string

Parameters

genomedict or twobitreader.TwoBitFile: Dictionary mapping chromosome names to sequences. Sequences may be strings, string-like, or Bio.Seq.SeqRecord objects
strandedbool: If True and the SegmentChain is on the minus strand, sequence will be reverse-complemented (Default: True)

Returns

str: Nucleotide sequence of the SegmentChain extracted from genome

get_subchain(self, long start, long end, bool stranded=True, **extra_attr)¶

Retrieves a sub-SegmentChain corresponding a range of positions specified in coordinates relative this SegmentChain. Attributes in self.attr are copied to the child SegmentChain, with the exception of ID, to which the suffix ‘subchain’ is appended.

Parameters

startint: position of interest in SegmentChain coordinates, 0-indexed
endint: position of interest in SegmentChain coordinates, 0-indexed and half-open
strandedbool, optional: If True, start and end are assumed to be in stranded space (i.e. counted from 5’ end of chain, as one might expect for a transcript). If False, they assumed to be counted the left end of the self, regardless of the strand of self. (Default: True)
extra_attrkeyword arguments: Values that will be included in the subchain’s attr dict. These can be used to overwrite values already present.

Returns

SegmentChain: covering parent chain positions start to end of self

Raises

IndexError: if start or end is outside the bounds of the SegmentChain
TypeError: if start or end is None

get_unstranded(self) → SegmentChain¶

Returns an SegmentChain antisense to self, with empty attr dict.

Returns

SegmentChain: SegmentChain antisense to self

get_utr3(self, **extra_attr)¶

Retrieve sub-SegmentChain covering 3’UTR of self, excluding the stop codon. If no coding region, returns an empty SegmentChain

The following attributes are passed from self.attr to the new SegmentChain

transcript_id, taken from SegmentChain.get_name()

gene_id, taken from SegmentChain.get_gene()

ID, generated as “%s_3UTR” % self.get_name()

Parameters

extra_attrkeyword arguments: Values that will be included in the 3’ UTR subchain’s attr dict. These can be used to overwrite values already present.

Returns

SegmentChain: 3’ UTR region of self if present, otherwise empty SegmentChain

get_utr5(self, **extra_attr)¶

Retrieve sub-SegmentChain covering 5’UTR of self. If no coding region, returns an empty SegmentChain

The following attributes are passed from self.attr to the new SegmentChain

transcript_id, taken from SegmentChain.get_name()

gene_id, taken from SegmentChain.get_gene()

ID, generated as “%s_5UTR” % self.get_name()

Parameters

extra_attrkeyword arguments: Values that will be included in the 5’UTR subchain’s attr dict. These can be used to overwrite values already present.

Returns

SegmentChain: 5’ UTR region of self if present, otherwise empty SegmentChain

next(self)¶: Return next GenomicSegment in the SegmentChain, from left to right on the chromsome

overlaps(self, other)¶

Return True if self and other share genomic positions on the same strand

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share genomic positions on the same chromosome and strand; False otherwise.

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

reset_masks(self)¶

Removes masks added by add_masks()

See also

SegmentChain.add_masks

shares_segments_with(self, other)¶

Returns a list of GenomicSegment that are shared between self and other

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

list: List of GenomicSegments common to self and other

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

sort(self)¶

unstranded_overlaps(self, other)¶

Return True if self and other share genomic positions on the same chromosome, regardless of their strands

Parameters

otherSegmentChain or GenomicSegment: Query feature

Returns

bool: True if self and other share genomic positions on the same chromosome, False otherwise. Strands of self and other need not match

Raises

TypeError: if other is not a GenomicSegment or SegmentChain

attr¶: attr: dict

c_strand¶

cds_end¶: End of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_start, self.cds_genome_start and self.cds_genome_end to None

cds_genome_end¶: Ending coordinate of coding region, relative to genome (i.e. leftmost; is stop codon for forward-strand features, start codon for reverse-strand features. Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_start to None

cds_genome_start¶: Starting coordinate of coding region, relative to genome (i.e. leftmost; is start codon for forward-strand features, stop codon for reverse-strand features). Setting to None also sets self.cds_start, self.cds_end, and self.cds_genome_end to None

cds_start¶: Start of coding region relative to 5’ end of transcript, in direction of transcript. Setting to None also sets self.cds_end, self.cds_genome_start and self.cds_genome_end to None

chrom¶: Chromosome the SegmentChain resides on

length¶

mask_segments¶: Copy of list of GenomicSegments representing regions masked in self. Changing this list will do nothing to the masks in self.

masked_length¶

segments¶: Copy of list of GenomicSegments that comprise self. Changing this list will do nothing to self.

spanning_segment¶

strand¶: Strand of the SegmentChain

plastid.genomics.roitools.add_three_for_stop_codon(Transcript tx) → Transcript¶

Extend an annotated CDS region, if present, by three nucleotides at the threeprime end. Use in cases when annotation files exclude the stop codon from the annotated CDS.

Parameters

txTranscript: query transcript

Returns

Transcript: Transcript with same attributes as tx, but with CDS extended by one codon

Raises

IndexError: if a three prime UTR is defined that terminates before the complete stop codon

plastid.genomics.roitools.merge_segments(list segments) → list¶

Merge all overlapping GenomicSegments in segments, so that all segments returned are guaranteed to be sorted and non-overlapping.

Note

All segments are assumed to be on the same strand and chromosome.

Parameters

segmentslist: List of GenomicSegments, all on the same strand and chromosome

Returns

list: List of sorted, non-overlapping GenomicSegments

plastid.genomics.roitools.positionlist_to_segments(unicode chrom, unicode strand, list positions) → list¶

Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions.

Parameters

chromstr: Chromosome name
strandstr: Chromosome strand (‘+’, ‘-’, or ‘.’)
positionslist of unique integers: Sorted, end-inclusive list of positions to include in final GenomicSegment

Returns

list: List of GenomicSegments covering positions

Warning

This function is meant to quickly without excessive type conversions. So, the elements positions must be UNIQUE and SORTED. If they are not, use positions_to_segments() instead.

plastid.genomics.roitools.positions_to_segments(unicode chrom, unicode strand, positions) → list¶

Construct GenomicSegments from a chromosome name, a strand, and a list of chromosomal positions.

Parameters

chromstr: Chromosome name
strandstr: Chromosome strand (‘+’, ‘-’, or ‘.’)
positionslist of integers: End-inclusive list, tuple, or set of positions to include in final GenomicSegment

Returns

list: List of GenomicSegments covering positions

plastid.genomics.roitools module¶

Summary¶

Module contents¶

Examples¶

Summary ¶

Module contents ¶

Examples ¶