plastid.bin.reformat_transcripts module

Convert transcripts from BED, BigBed, GTF2, GFF3, or PSL format to BED, extended BED, or GTF2 format.


GFF3 schemas vary
Different GFF3s have different schemas of hierarchy. By default, we assume the ontology used by the Sequence Ontology consortium. Users that require a different schema may supply transcript_types and exon_types, to indicate which sorts of features should be included.
Identity relationships between elements vary between GFF3 files
GFF3 files can represent discontiguous features using two strategies. In one strategy, the exons of a transcript have unique IDs, but will share contain the same parent ID in their same Parent attribute in column 9 of the GFF. In another strategy different exons of the same transcript simply share the same ID, and don’t define a Parent. Here, both schemes are accepted, although what happens if they conflict within a single transcript is undefined.

Command-line arguments

Optional arguments

outfile.[ bed | gtf ]

Argument Description Output file
-h, --help show this help message and exit
--no_escape If specified and output format is GTF2, special characters in column 9 will be escaped (default: True)
--output_format  {BED,GTF2} Format of output file. (default: GTF2)
--extra_columns  EXTRA_COLUMNS [EXTRA_COLUMNS ...] Attributes (e.g. ‘gene_id’ to output as extra columns in extended BED format (BED output only).
--empty_value  EMPTY_VALUE Value to use of an attribute in extra_columns is not defined for a particular record (Default: ‘na’

Warning/error options

Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)

Annotation file options (one or more annotation files required)

Open one or more genome annotation files

Argument Description
--annotation_files  infile.[BED | BigBed | GTF2 | GFF3] [infile.[BED | BigBed | GTF2 | GFF3] ...] Zero or more annotation files (max 1 file if BigBed)
--annotation_format  {BED,BigBed,GTF2,GFF3} Format of annotation_files (Default: GTF2). Note: GFF3 assembly assumes SO v.2.5.2 feature ontologies, which may or may not match your specific file.
--add_three If supplied, coding regions will be extended by 3 nucleotides at their 3’ ends (except for GTF2 files that explicitly include stop_codon features). Use if your annotation file excludes stop codons from CDS.
--tabix annotation_files are tabix-compressed and indexed (Default: False). Ignored for BigBed files.
--sorted annotation_files are sorted by chromosomal position (Default: False)

Bed-specific options

Argument Description
--bed_extra_columns  BED_EXTRA_COLUMNS [BED_EXTRA_COLUMNS ...] Number of extra columns in BED file (e.g. in custom ENCODE formats) or list of names for those columns. (Default: 0).

Bigbed-specific options

Argument Description
--maxmem  MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)

Gff3-specific options

Argument Description
--gff_transcript_types  GFF_TRANSCRIPT_TYPES [GFF_TRANSCRIPT_TYPES ...] GFF3 feature types to include as transcripts, even if no exons are present (for GFF3 only; default: use SO v2.5.3 specification)
--gff_exon_types  GFF_EXON_TYPES [GFF_EXON_TYPES ...] GFF3 feature types to include as exons (for GFF3 only; default: use SO v2.5.3 specification)
--gff_cds_types  GFF_CDS_TYPES [GFF_CDS_TYPES ...] GFF3 feature types to include as CDS (for GFF3 only; default: use SO v2.5.3 specification)

Script contents

plastid.bin.reformat_transcripts.fix_name(inp, names_used)[source]

Append a number if an autoSql field name is duplicated.

plastid.bin.reformat_transcripts.main(argv=['-T', '-E', '-b', 'readthedocs', '-d', '_build/doctrees-readthedocs', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

argv : list, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.

Default: sys.argv[1:] (actually command-line arguments)