plastid.readers.gff_tokens module

This module contains functions for escaping, unescaping, and parsing tokens from the ninth column of GTF2 and GFF3 files.

Important methods

make_GTF2_tokens()
Format a dictionary of attributes as GTF2 column 9 attributes
make_GFF3_tokens()
Format a dictionary of attributes as GFF3 column 9 attributes
parse_GTF2_tokens()
Parse GTF2 column 9 tokens into a dictionary of key-value pairs
parse_GFF3_tokens()
Parse GFF3 column 9 tokens into a dictionary of key-value pairs

See also

plastid.readers.gff_tokens.escape(inp, char_pairs)[source]

Escape reserved characters specified in the list of tuples char_pairs

Parameters:
inp : str

Input string

chair_pairs : list

List of tuples of (character, escape sequence for character)

Returns:
str

Escaped output

See also

unescape_GFF3

plastid.readers.gff_tokens.escape_GFF3(inp)[source]

Escape reserved characters in GFF3 tokens using percentage notation.

In the GFF3 spec, reserved characters include:

  • control characters (ASCII 0-32, 127, and 128-159)
  • tab, newline, & carriage return
  • semicolons & commas
  • the percent sign
  • the equals sign
  • the ampersand
Parameters:
inp : str

Input string

chair_pairs : list

List of tuples of (character, escape sequence for character)

Returns:
str

Escaped output

See also

unescape_GFF3

plastid.readers.gff_tokens.escape_GTF2(inp)[source]

Escape reserved characters in GTF2 tokens using percentage notation. While the GTF2 spec is agnostic for escaping, it is useful when adding extra attributes to files. As a convention, we escape the characters specified in the GFF3 spec, as well as double quotation marks.

In the GTF2 spec, reserved characters include:

  • control characters (ASCII 0-32, 127, and 128-159)
  • tab, newline, & carriage return
  • semicolons & commas
  • the percent sign
  • the equals sign
  • the ampersand
Parameters:
inp : str

Input string

chair_pairs : list

List of tuples of (character, escape sequence for character)

Returns:
str

Escaped output

See also

unescape_GFF3

plastid.readers.gff_tokens.make_GFF3_tokens(attr, excludes=None, escape=True)[source]

Helper function to convert the attr dict of a SegmentChain into the string representation used in GFF3 files. This includes URL escaping of special characters, and catenating lists with ‘,’ before string conversion

Parameters:
attr : dict

Dictionary of key-value pairs to export

excludes : list, optional

List of keys to exclude from string

escape : bool, optional

If True, special characters in output are GFF3-escaped (Default: True)

Returns:
str

Data formatted for attributes column of GFF3 (column 9)

Examples

>>> d = {'a':1,'b':2,'c':3,'d':4,'e':5,'z':26,'text':"something; with escape sequences"}
>>> _make_GFF3_tokens(d)
'a=1;c=3;b=2;e=5;d=4;z=26;text=something%3B with escape sequences'
>>> excludes=['a','b','c']
>>> _make_GFF3_tokens(d,excludes)
'e=5;d=4;z=26;text=something%3B with escape sequences'
>>> d = {'a':1,'b':2,'c':[3,7],'d':4,'e':5,'z':26}
>>> _make_GFF3_tokens(d)
'a=1;c=3,7;b=2;e=5;d=4;z=26'
plastid.readers.gff_tokens.make_GTF2_tokens(attr, excludes=None, escape=True)[source]

Helper function to convert the attr dict of a SegmentChain into the string representation used in GTF2 files. By default, special characters defined in the GFF3 spec will be URL-escaped.

Parameters:
attr : dict

Dictionary of key-value pairs to export

excludes : list, optional

List of keys to exclude from string

escape : bool, optional

If True, special characters in output are GTF2-escaped (Default: True)

Returns:
str

Data formatted for attributes column of GTF2 (column 9)

Examples

>>> d = {'transcript_id' : 't;id', 'a':1,'b':2,'c':3,'d':4,'e':5,'z':26,
            'gene_id' : 'gid'}
>>> _make_GTF2_tokens(d)
'transcript_id "t%3Bid"; gene_id "gid"; a "1"; c "3"; b "2"; e "5"; d "4"; z "26";'
>>> excludes=['a','b','c']
>>> _make_GTF2_tokens(d,excludes)
'transcript_id "t%3Bid"; gene_id "gid"; e "5"; d "4"; z "26";'
plastid.readers.gff_tokens.parse_GFF3_tokens(inp, list_types=None)[source]
plastid.readers.gff_tokens.parse_GTF2_tokens(inp)[source]

Helper function to parse tokens in the final column of a GTF2 file into a dictionary of attributes. All attributes are returned as strings, and are unescaped if GFF escape sequences (e.g. ‘%2B’) are present.

If duplicate keys are present (e.g. as in GENCODE GTF2 files), their values are catenated, separated by a comma.

Parameters:
inp : str

Ninth column of GTF2 entry

Returns:
dict : key-value pairs

Examples

>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript";'
>>> parse_GTF2_tokens(tokens)
{'gene_id' : 'mygene', 'transcript_id' : 'mytranscript'}
>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript"'
>>> parse_GTF2_tokens(tokens)
{'gene_id' : 'mygene', 'transcript_id' : 'mytranscript'}
>>> tokens = 'gene_id "mygene;"; transcript_id "myt;ranscript"'
>>> parse_GTF2_tokens(tokens)
{'gene_id' : 'mygene;', 'transcript_id' : 'myt;ranscript'}
>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript"; tag "tag value";'
>>> parse_GTF2_tokens(tokens)
{'gene_id' : 'mygene', 'tag' : 'tag value', 'transcript_id' : 'mytranscript'}
>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript"; tag "tag value"; tag "tag value 2";'
>>> parse_GTF2_tokens(tokens)
{'gene_id' : 'mygene', 'tag' : 'tag value,tag value 2', 'transcript_id' : 'mytranscript'}
plastid.readers.gff_tokens.unescape(inp, char_pairs)[source]

Unescape reserved characters specified in the list of tuples char_pairs

Parameters:
inp : str

Input string

Returns:
str

Unescaped output

See also

escape_GFF3

plastid.readers.gff_tokens.unescape_GFF3(inp)[source]

Unescape reserved characters in GFF3 tokens using percentage notation.

In the GFF3 spec, reserved characters include:

  • control characters (ASCII 0-32, 127, and 128-159)
  • tab, newline, & carriage return
  • semicolons & commas
  • the percent sign
  • the equals sign
  • the ampersand
Parameters:
inp : str

Input string

Returns:
str

Unescaped output

See also

escape_GFF3

plastid.readers.gff_tokens.unescape_GTF2(inp)[source]

Unescape reserved characters in GTF2 tokens using percentage notation. While the GTF2 spec is agnostic for escaping, it is useful when adding extra attributes to files. As a convention, we escape the characters specified in the GFF3 spec, as well as single quotation marks.

In the GFF3 spec, reserved characters include:

  • control characters (ASCII 0-32, 127, and 128-159)
  • tab, newline, & carriage return
  • semicolons & commas
  • the percent sign
  • the equals sign
  • the ampersand
Parameters:
inp : str

Input string

Returns:
str

Unescaped output

See also

escape_GFF3