plastid.bin.slidejuncs module

Compare splice junctions discovered in a dataset to those in an annotation of known splice junctions, amending misplaced junctions, and identifying junctions that fall within repetitive areas of the genome.

Known splice junctions can be misidentified as novel or non-canonical junctions when intronic sequence immediately downstream of the fiveprime splice site exactly matches the exonic sequence immediately downstream of the threeprime splice site. In fact, the junction point could appear anywhere in this locally-repeated region with equal support from sequencing data. For example, suppose we have a splice junction as follows:

            Exon 1 [0,6)            Intron                                  Exon 2 [16,24)
            ---------------------   --------------------------------------  ------------------------------
Sequence    G   C   T   C   T   A   C   T   A   G   N   N   N   C   T   A   C   T   A   G   A   T   G   G
Position    0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23
Repeated                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this case, the splice junction could be moved 3 bases to the left, or four bases to the right, without losing consistency with the sequence of any cDNA or read alignment covering the junction.

To identify this and other causes of false positive splice junction calls, the following operations are performed on each query junction:

  1. If a mask file from crossmap is provided, junctions in which one or more of the 5’ and 3’ splice sites appear in a repetitive region of the genome are flagged as non-informative and written to a separate file.

  2. For remaining splice junctions, the extent of locally repeated nucleotide sequence, if any, surrounding the query junction’s splice donor and acceptor sites, are determined in both the 5’ and 3’ directions.

    This is the maximum window (equal-support region) across which the actual splice junction could be moved without reducing sequence support.

  3. If there is one or more known splice junctions in this region, the query junction is assumed to match these, and the known junctions are reported rather than the query.

  4. If (3) is not satisfied, and the query junction is a canonical splice junction, it is reported as is.

  5. If (3) is not satisfied, and the query junction represents a non-canonical splice junction, the program determines if one or more canonical splice junctions is present in the equal-support region. If so, these canonical splice junction are reported rather than the query junction.

  6. If (5) is not satisfied, the non-canonical query junction is reported as-is.

Output files

The following files are written, where OUTBASE is a string supplied by the user. Scores of splice junctions, if present in the input, are ignored. Each record in each BED file represents a single exon-exon junction, rather than a transcript:

OUTBASE_repetitive.bed
Splice junctions in which one or more of the splice sites lands in a repetitive/degenerate region of the genome, which gives rise to mapping ambiguities (step 1 above)
OUTBASE_shifted_known.bed
The result of shifting query splice junctions to known splice junctions with equal sequence support (step 3 above)
OUTBASE_shifted_canonical.bed
The result of shifting non-canonical query splice junctions to canonical splice junctions with equal sequence support (step 5 above)
OUTBASE_untouched.bed
Query junctions reported without changes (steps 4 and 6 above)

where OUTBASE is given by the user.


Command-line arguments

Positional arguments

Argument Description
input.bed BED file describing discovered junctions
outbase Basename for output files

Optional arguments

Argument Description
-h, --help show this help message and exit
--maxslide  MAXSLIDE Maximum number of nt to search 5’ and 3’ of intron boundaries (Default: 10)
--ref  ref.bed Reference file describing known splice junctions
--slide_canonical Slide junctions to canonical junctions if present within equal support region

Warning/error options

Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)

Sequence options

Argument Description
--sequence_file  infile.[fasta | fastq | twobit | genbank | embl] A file of DNA sequence
--sequence_format  {fasta,fastq,twobit,genbank,embl} Format of sequence_file (Default: fasta).

Mask file options (optional)

Add mask file(s) that annotate regions that should be excluded from analyses (e.g. repetitive genomic regions).

Argument Description
--mask_annotation_files  infile.[BED | BigBed | GTF2 | GFF3 | PSL] [infile.[BED | BigBed | GTF2 | GFF3 | PSL] ...] Zero or more annotation files (max 1 file if BigBed)
--mask_annotation_format  {BED,BigBed,GTF2,GFF3,PSL} Format of mask_annotation_files (Default: GTF2). Note: GFF3 assembly assumes SO v.2.5.2 feature ontologies, which may or may not match your specific file.
--mask_add_three If supplied, coding regions will be extended by 3 nucleotides at their 3’ ends (except for GTF2 files that explicitly include stop_codon features). Use if your annotation file excludes stop codons from CDS.
--mask_tabix mask_annotation_files are tabix-compressed and indexed (Default: False). Ignored for BigBed files.
--mask_sorted mask_annotation_files are sorted by chromosomal position (Default: False)

Bed-specific options

Argument Description
--mask_bed_extra_columns  MASK_BED_EXTRA_COLUMNS [MASK_BED_EXTRA_COLUMNS ...] Number of extra columns in BED file (e.g. in custom ENCODE formats) or list of names for those columns. (Default: 0).

Bigbed-specific options

Argument Description
--mask_maxmem  MASK_MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)

Gff3-specific options

Argument Description
--mask_gff_transcript_types  MASK_GFF_TRANSCRIPT_TYPES [MASK_GFF_TRANSCRIPT_TYPES ...] GFF3 feature types to include as transcripts, even if no exons are present (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_exon_types  MASK_GFF_EXON_TYPES [MASK_GFF_EXON_TYPES ...] GFF3 feature types to include as exons (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_cds_types  MASK_GFF_CDS_TYPES [MASK_GFF_CDS_TYPES ...] GFF3 feature types to include as CDS (for GFF3 only; default: use SO v2.5.3 specification)

Script contents

plastid.bin.slidejuncs.covered_by_repetitive(query_junc, minus_range, plus_range, cross_hash)[source]

Determine whether one or both ends of a splice site overlap with a repetitive area of the genome.

Parameters:
query_junc : SegmentChain

A two-exon fragment representing a query splice junction

minus_range : int <= 0

Maximum number of nucleotides splice junction could be moved to the left without reducing sequence support for the junction see find_match_range()

plus_range : int >= 0

Maximum number of nucleotides splice junction could be moved to the right without reducing sequence support for the junction see find_match_range()

cross_hash : GenomeHash

GenomeHash of 1-length features denoting repetitive regions of the genome

Returns:
bool

True if any of the genomic positions within minus_range…plus_range of the 5’ or 3’ splice sites of query_junc overlap a repetitive region of the genome as annotated by cross_hash. Otherwise, False

plastid.bin.slidejuncs.find_canonicals_in_range(query_junc, minus_range, plus_range, genome, canonicals)[source]

Find any canonical splice junctions within in minus_range…plus_range of query_junc

To be classified as within the range, the boundaries of the canonical junction must be:

  1. within minus_range…plus_range of the boundaries of the the discovered junction.
  2. separated by a nucleotide distance equal to the distance separating the junction in query_junc.
  3. On the same chromosome and strand.
Parameters:
query_junc : SegmentChain

A two-exon fragment representing a query splice junction

minus_range : int <= 0

Maximum number of nucleotides splice junction could be moved to the left without reducing sequence support for the junction see find_match_range()

plus_range : int >= 0

Maximum number of nucleotides splice junction could be moved to the right without reducing sequence support for the junction see find_match_range()

genome : dict

dict mapping chromosome names to Bio.SeqRecord.SeqRecord s

canonicals : list

dinucleotide sequences to consider as canonical splice sites, as a list of tuples. e.g. [(“GT”,”AG”), (“GC”,”AG”)]

Returns:
list

List of SegmentChains representing canonical splice junctions in minus_range…plus_range of query_junc

plastid.bin.slidejuncs.find_known_in_range(query_junc, minus_range, plus_range, knownjunctions)[source]

Find any known splice junctions within in minus_range…plus_range of query_junc

To be classified as within the range, the boundaries of a known junction must be:

  1. within minus_range…plus_range of the boundaries of the the discovered junction.
  2. separated by a nucleotide distance equal to the distance separating the junction in query_junc.
  3. on the same chromosome and strand.
Parameters:
query_junc : SegmentChain

A two-exon fragment representing a query splice junction

minus_range : int <= 0

Maximum number of nucleotides splice junction could be moved to the left without reducing sequence support for the junction see find_match_range()

plus_range : int >= 0

Maximum number of nucleotides splice junction could be moved to the right without reducing sequence support for the junction see find_match_range()

knownjunctions : list of SegmentChains

known splice junctions

Returns:
list

List of SegmentChains representing known splice junctions in minus_range…plus_range of query_junc

plastid.bin.slidejuncs.find_match_range(seg, genome, maxslide)[source]

Find maximum distance over which a splice junction can be moved up- or down-stream without reducing sequencing support for that junction.

In other words, find locally repeated sequences surrounding exon-intron or intron-exon boundaries that can cause splice junction mapping to be ambiguous, due to identical and repeated sequence.

Parameters:
seg : SegmentChain

A two-exon fragment representing a query splice junction

genome : dict

dict mapping chromosome names to Bio.SeqRecord.SeqRecord s

maxslide : int

Maximum number of nucleotides from the boundary over which to check for extent of repeated sequence

Returns:
minus_range : int

Maximum number of nucleotides splice junction point could be moved to the left without reducing sequence support for the junction

plus_range : int

Maximum number of nucleotides splice junction point could be moved to the right without reducing sequence support for the junction

plastid.bin.slidejuncs.main(argv=['-T', '-E', '-b', 'readthedocs', '-d', '_build/doctrees-readthedocs', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

Parameters:
argv : list, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.

Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line