plastid.readers.gff_tokens module¶
This module contains functions for escaping, unescaping, and parsing tokens from the ninth column of GTF2 and GFF3 files.
Important methods¶
make_GTF2_tokens()
Format a dictionary of attributes as GTF2 column 9 attributes
make_GFF3_tokens()
Format a dictionary of attributes as GFF3 column 9 attributes
parse_GTF2_tokens()
Parse GTF2 column 9 tokens into a dictionary of key-value pairs
parse_GFF3_tokens()
Parse GFF3 column 9 tokens into a dictionary of key-value pairs
See also¶
- plastid.readers.gff_tokens.escape(inp, char_pairs)[source]¶
Escape reserved characters specified in the list of tuples char_pairs
- Parameters
- inpstr
Input string
- chair_pairslist
List of tuples of (character, escape sequence for character)
- Returns
- str
Escaped output
See also
- plastid.readers.gff_tokens.escape_GFF3(inp)[source]¶
Escape reserved characters in GFF3 tokens using percentage notation.
In the GFF3 spec, reserved characters include:
control characters (ASCII 0-32, 127, and 128-159)
tab, newline, & carriage return
semicolons & commas
the percent sign
the equals sign
the ampersand
- Parameters
- inpstr
Input string
- chair_pairslist
List of tuples of (character, escape sequence for character)
- Returns
- str
Escaped output
See also
- plastid.readers.gff_tokens.escape_GTF2(inp)[source]¶
Escape reserved characters in GTF2 tokens using percentage notation. While the GTF2 spec is agnostic for escaping, it is useful when adding extra attributes to files. As a convention, we escape the characters specified in the GFF3 spec, as well as double quotation marks.
In the GTF2 spec, reserved characters include:
control characters (ASCII 0-32, 127, and 128-159)
tab, newline, & carriage return
semicolons & commas
the percent sign
the equals sign
the ampersand
- Parameters
- inpstr
Input string
- chair_pairslist
List of tuples of (character, escape sequence for character)
- Returns
- str
Escaped output
See also
- plastid.readers.gff_tokens.make_GFF3_tokens(attr, excludes=None, escape=True)[source]¶
Helper function to convert the attr dict of a
SegmentChain
into the string representation used in GFF3 files. This includes URL escaping of special characters, and catenating lists with ‘,’ before string conversion- Parameters
- attrdict
Dictionary of key-value pairs to export
- excludeslist, optional
List of keys to exclude from string
- escapebool, optional
If True, special characters in output are GFF3-escaped (Default: True)
- Returns
- str
Data formatted for attributes column of GFF3 (column 9)
Examples
>>> d = {'a':1,'b':2,'c':3,'d':4,'e':5,'z':26,'text':"something; with escape sequences"} >>> _make_GFF3_tokens(d) 'a=1;c=3;b=2;e=5;d=4;z=26;text=something%3B with escape sequences'
>>> excludes=['a','b','c'] >>> _make_GFF3_tokens(d,excludes) 'e=5;d=4;z=26;text=something%3B with escape sequences'
>>> d = {'a':1,'b':2,'c':[3,7],'d':4,'e':5,'z':26} >>> _make_GFF3_tokens(d) 'a=1;c=3,7;b=2;e=5;d=4;z=26'
- plastid.readers.gff_tokens.make_GTF2_tokens(attr, excludes=None, escape=True)[source]¶
Helper function to convert the attr dict of a
SegmentChain
into the string representation used in GTF2 files. By default, special characters defined in the GFF3 spec will be URL-escaped.- Parameters
- attrdict
Dictionary of key-value pairs to export
- excludeslist, optional
List of keys to exclude from string
- escapebool, optional
If True, special characters in output are GTF2-escaped (Default: True)
- Returns
- str
Data formatted for attributes column of GTF2 (column 9)
Examples
>>> d = {'transcript_id' : 't;id', 'a':1,'b':2,'c':3,'d':4,'e':5,'z':26, 'gene_id' : 'gid'} >>> _make_GTF2_tokens(d) 'transcript_id "t%3Bid"; gene_id "gid"; a "1"; c "3"; b "2"; e "5"; d "4"; z "26";'
>>> excludes=['a','b','c'] >>> _make_GTF2_tokens(d,excludes) 'transcript_id "t%3Bid"; gene_id "gid"; e "5"; d "4"; z "26";'
- plastid.readers.gff_tokens.parse_GTF2_tokens(inp)[source]¶
Helper function to parse tokens in the final column of a GTF2 file into a dictionary of attributes. All attributes are returned as strings, and are unescaped if GFF escape sequences (e.g. ‘%2B’) are present.
If duplicate keys are present (e.g. as in GENCODE GTF2 files), their values are catenated, separated by a comma.
- Parameters
- inpstr
Ninth column of GTF2 entry
- Returns
- dictkey-value pairs
Examples
>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript";' >>> parse_GTF2_tokens(tokens) {'gene_id' : 'mygene', 'transcript_id' : 'mytranscript'}
>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript"' >>> parse_GTF2_tokens(tokens) {'gene_id' : 'mygene', 'transcript_id' : 'mytranscript'}
>>> tokens = 'gene_id "mygene;"; transcript_id "myt;ranscript"' >>> parse_GTF2_tokens(tokens) {'gene_id' : 'mygene;', 'transcript_id' : 'myt;ranscript'}
>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript"; tag "tag value";' >>> parse_GTF2_tokens(tokens) {'gene_id' : 'mygene', 'tag' : 'tag value', 'transcript_id' : 'mytranscript'}
>>> tokens = 'gene_id "mygene"; transcript_id "mytranscript"; tag "tag value"; tag "tag value 2";' >>> parse_GTF2_tokens(tokens) {'gene_id' : 'mygene', 'tag' : 'tag value,tag value 2', 'transcript_id' : 'mytranscript'}
- plastid.readers.gff_tokens.unescape(inp, char_pairs)[source]¶
Unescape reserved characters specified in the list of tuples char_pairs
- Parameters
- inpstr
Input string
- Returns
- str
Unescaped output
See also
- plastid.readers.gff_tokens.unescape_GFF3(inp)[source]¶
Unescape reserved characters in GFF3 tokens using percentage notation.
In the GFF3 spec, reserved characters include:
control characters (ASCII 0-32, 127, and 128-159)
tab, newline, & carriage return
semicolons & commas
the percent sign
the equals sign
the ampersand
- Parameters
- inpstr
Input string
- Returns
- str
Unescaped output
See also
- plastid.readers.gff_tokens.unescape_GTF2(inp)[source]¶
Unescape reserved characters in GTF2 tokens using percentage notation. While the GTF2 spec is agnostic for escaping, it is useful when adding extra attributes to files. As a convention, we escape the characters specified in the GFF3 spec, as well as single quotation marks.
In the GFF3 spec, reserved characters include:
control characters (ASCII 0-32, 127, and 128-159)
tab, newline, & carriage return
semicolons & commas
the percent sign
the equals sign
the ampersand
- Parameters
- inpstr
Input string
- Returns
- str
Unescaped output
See also