plastid.bin.cs module

Count the number of read alignments and calculate read densities (in RPKM) over genes, correcting gene boundaries for overlap with other genes or regions specified in a mask file.

Counts and densities are calculated separately per gene for exons, 5’ UTRs, coding regions, and 3’ UTRs. In addition, positions overlapped by multiple genes are excluded, as are positions annotated in mask annotation files, if one is provided.

The script’s operation is divided into three subprograms:

Generate

The generate mode pre-process a genome annotation as follows:

  1. All genes whose transcripts share exact exons are collapsed to “merged” genes.

  2. Positions covered by more than one merged gene on the same strand are excluded from analysis, and subtracted from each merged genes.

  3. Remaining positions in each merged gene are then divided into the following groups:

    exon

    all positions in all transcripts mapping to the merged gene

    CDS

    positions which appear in coding regions in all transcript isoforms mapping to the merged gene. i.e. These positions are never part of a fiveprime or threeprime UTR in any transcript mapping to the merged gene

    UTR5

    positions which are annotated only as 5’ UTR in all transcript isoforms mapping to the merged gene

    UTR3

    positions which are annotated only as 3 UTR in all transcript isoforms mapping to the merged gene

    masked

    positions excluded from analyses as directed in an optional mask file

The following files are output, where OUTBASE is a name supplied by the user:

OUTBASE_gene.positions
Tab-delimited text file. Each line is a merged gene, and columns indicate the genomic coordinates and lengths of each of the position sets above.
OUTBASE_transcript.positions
Tab-delimited text file. Each line is a transcript, and columns indicate the genomic coordinates and lengths of each of the position sets above.
OUTBASE_gene_REGION.bed
BED files showing position sets for REGION, where REGION is one of exon, CDS, UTR5, and UTR3 or masked. These contain the same information in OUTBASE_gene.positions, but can be visualized easily in a genome browser
Count
The count mode counts the number and density of read alignments in each sub-region (exon, CDS, UTR5, and UTR3) of each gene.
Chart
The chart mode takes output from one or more samples run under count mode and generates several tables and charts that provide broad overviews of the data.

See command-line help for each subprogram for details on each mode

See also

counts_in_region script
Calculate the number and density of read alignments covering any set of regions of interest, making no corrections for gene boundaries.

Command-line arguments

Optional arguments

Argument Description
-h, --help show this help message and exit

Subcommands

choose one of the following

Argument Description
generate Create unambiguous position file from GFF3 annotation
count Count reads in unambiguous gene positions
chart Produce charts comparing reads between samples

generate subcommand


Generate gene position files from gene annotations.

  1. Genes whose transcripts share exons are first collapsed into merged genes.
  2. Within merged genes, all positions are classified. All positions are included in a set called exon. All positions that appear as coding regions in all transcripts (i.e. are never part of a 5’UTR or 3’UTR) included in a set called CDS. Similarly, all positions that appear as 5’ UTR or 3’ UTR in all transcripts are included in sets called UTR5 or UTR3, respectively.
  3. Genomic positions that are overlapped by multiple merged genes are excluded from the position sets for those genes.
  4. If a mask file is supplied, positions annotated in the mask file are also excluded
  5. Output is given as a series of BED files and a positions file containing the same data.

Positional arguments
Argument Description
outbase Basename for output files
Optional arguments
Argument Description
-h, --help show this help message and exit
Warning/error options
Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)
Annotation file options (one or more annotation files required)

Open one or more genome annotation files

Argument Description
--annotation_files  infile.[BED | BigBed | GTF2 | GFF3] [infile.[BED | BigBed | GTF2 | GFF3] ...] Zero or more annotation files (max 1 file if BigBed)
--annotation_format  {BED,BigBed,GTF2,GFF3} Format of annotation_files (Default: GTF2). Note: GFF3 assembly assumes SO v.2.5.2 feature ontologies, which may or may not match your specific file.
--add_three If supplied, coding regions will be extended by 3 nucleotides at their 3’ ends (except for GTF2 files that explicitly include stop_codon features). Use if your annotation file excludes stop codons from CDS.
--tabix annotation_files are tabix-compressed and indexed (Default: False). Ignored for BigBed files.
--sorted annotation_files are sorted by chromosomal position (Default: False)
Bed-specific options
Argument Description
--bed_extra_columns  BED_EXTRA_COLUMNS [BED_EXTRA_COLUMNS ...] Number of extra columns in BED file (e.g. in custom ENCODE formats) or list of names for those columns. (Default: 0).
--mask_bed_extra_columns  MASK_BED_EXTRA_COLUMNS [MASK_BED_EXTRA_COLUMNS ...] Number of extra columns in BED file (e.g. in custom ENCODE formats) or list of names for those columns. (Default: 0).
Bigbed-specific options
Argument Description
--maxmem  MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)
--mask_maxmem  MASK_MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)
Gff3-specific options
Argument Description
--gff_transcript_types  GFF_TRANSCRIPT_TYPES [GFF_TRANSCRIPT_TYPES ...] GFF3 feature types to include as transcripts, even if no exons are present (for GFF3 only; default: use SO v2.5.3 specification)
--gff_exon_types  GFF_EXON_TYPES [GFF_EXON_TYPES ...] GFF3 feature types to include as exons (for GFF3 only; default: use SO v2.5.3 specification)
--gff_cds_types  GFF_CDS_TYPES [GFF_CDS_TYPES ...] GFF3 feature types to include as CDS (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_transcript_types  MASK_GFF_TRANSCRIPT_TYPES [MASK_GFF_TRANSCRIPT_TYPES ...] GFF3 feature types to include as transcripts, even if no exons are present (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_exon_types  MASK_GFF_EXON_TYPES [MASK_GFF_EXON_TYPES ...] GFF3 feature types to include as exons (for GFF3 only; default: use SO v2.5.3 specification)
--mask_gff_cds_types  MASK_GFF_CDS_TYPES [MASK_GFF_CDS_TYPES ...] GFF3 feature types to include as CDS (for GFF3 only; default: use SO v2.5.3 specification)
Mask file options (optional)

Add mask file(s) that annotate regions that should be excluded from analyses (e.g. repetitive genomic regions).

Argument Description
--mask_annotation_files  infile.[BED | BigBed | GTF2 | GFF3 | PSL] [infile.[BED | BigBed | GTF2 | GFF3 | PSL] ...] Zero or more annotation files (max 1 file if BigBed)
--mask_annotation_format  {BED,BigBed,GTF2,GFF3,PSL} Format of mask_annotation_files (Default: GTF2). Note: GFF3 assembly assumes SO v.2.5.2 feature ontologies, which may or may not match your specific file.
--mask_add_three If supplied, coding regions will be extended by 3 nucleotides at their 3’ ends (except for GTF2 files that explicitly include stop_codon features). Use if your annotation file excludes stop codons from CDS.
--mask_tabix mask_annotation_files are tabix-compressed and indexed (Default: False). Ignored for BigBed files.
--mask_sorted mask_annotation_files are sorted by chromosomal position (Default: False)

count subcommand


Count the number and density covering each merged gene in an annotation made made using the generate subcommand).


Positional arguments
Argument Description
file.positions File assigning positions to genes or transcripts (made using ‘generate’ subcommand)
outbase Basename for output files
Optional arguments
Argument Description
-h, --help show this help message and exit
Warning/error options
Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)
Count & alignment file options

Open alignment or count files and optionally set mapping rules

Argument Description
--count_files  COUNT_FILES [COUNT_FILES ...] One or more count or alignment file(s) from a single sample or set of samples to be pooled.
--countfile_format  {BAM,bigwig,bowtie,wiggle} Format of file containing alignments or counts (Default: BAM)
--sum  SUM Sum used in normalization of counts and RPKM/RPNT calculations (Default: total mapped reads/counts in dataset)
--min_length  N Minimum read length required to be included (BAM & bowtie files only. Default: 25)
--max_length  N Maximum read length permitted to be included (BAM & bowtie files only. Default: 100)
--maxmem  MAXMEM Maximum desired memory footprint in MB to devote to BigBed/BigWig files. May be exceeded by large queries. (Default: 0, No maximum)
--big_genome Use slower but memory-efficient implementation for big genomes or for memory-limited computers. For wiggle & bowtie files only.
Alignment mapping functions (bam & bowtie files only)

For BAM or bowtie files, one of the mutually exclusive read mapping functions is required:

Argument Description
--fiveprime_variable Map read alignment to a variable offset from 5’ position of read, with offset determined by read length. Requires –offset below
--fiveprime Map read alignment to 5’ position.
--threeprime Map read alignment to 3’ position
--center Subtract N positions from each end of read, and add 1/(length-N), to each remaining position, where N is specified by –nibble
Filtering and alignment mapping options

The remaining arguments are optional and affect the behavior of specific mapping functions:

Argument Description
--offset  OFFSET For –fiveprime or –threeprime, provide an integer representing the offset into the read, starting from either the 5’ or 3’ end, at which data should be plotted. For –fiveprime_variable, provide the filename of a two-column tab-delimited text file, in which first column represents read length or the special keyword ‘default’, and the second column represents the offset from the five prime end of that read length at which the read should be mapped. (Default: 0)
--nibble  N For use with –center only. nt to remove from each end of read before mapping (Default: 0)

chart subcommand


Produce a set of charts comparing multiple samples pairwise.

Charts include histograms of log2 fold changes and scatter plots with correlation coefficients, both generated for raw count and RPKM data.

Positional arguments
Argument Description
gene_list.txt Optional. File listing regions (genes or transcripts), one per line, to include in comparisons. If not given, all genes are included.
outbase Basename for output files
Optional arguments
Argument Description
-h, --help show this help message and exit
-i  INFILES [INFILES ...], --in  INFILES [INFILES ...] input files, made by ‘count’ subprogram
--bins  BINS [BINS ...] Bins into which features are partitioned based on counts
--regions  REGIONS [REGIONS ...] Regions to compare (default: exon, utr5, cds, utr3)
--metrics  METRICS [METRICS ...] Metrics to compare (default: rpkm, reads)
Warning/error options
Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)

—stylesheet {seaborn-darkgrid,seaborn-notebook,classic,seaborn-ticks,grayscale,bmh,seaborn-talk,dark_background,ggplot,fivethirtyeight,seaborn-colorblind,seaborn-deep,seaborn-whitegrid,seaborn-bright,seaborn-poster,seaborn-muted,seaborn-paper,seaborn-white,seaborn-pastel,seaborn-dark,seaborn-dark-palette}

Plotting options
Argument Description
--figformat  {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff} File format for figure(s); Default: png)
--figsize  N N Figure width and height, in inches. (Default: use matplotlibrc params)
--title  TITLE Base title for plot(s).
--cmap  CMAP Matplotlib color map from which palette will be made (e.g. ‘Blues’,’autumn’,’Set1’; default: use colors from --stylesheet if given, or color cycle in matplotlibrc)
--dpi  DPI Figure resolution (Default: 150) Use this matplotlib stylesheet instead of matplotlibrc params

Script contents

plastid.bin.cs.do_chart(args, plot_parser)[source]

Produce a set of charts comparing multiple samples pairwise.

Charts include histograms of log2 fold changes and scatter plots with correlation coefficients, both generated for raw count and RPKM data.

Parameters:
args : argparse.Namespace

command-line arguments for chart subprogram

plastid.bin.cs.do_count(args, alignment_parser)[source]

Count the number and density covering each merged gene in an annotation made made using the generate subcommand).

Parameters:
args : argparse.Namespace

command-line arguments for count subprogram

plastid.bin.cs.do_generate(args, annotation_parser, mask_parser)[source]

Generate gene position files from gene annotations.

  1. Genes whose transcripts share exons are first collapsed into merged genes.
  2. Within merged genes, all positions are classified. All positions are included in a set called exon. All positions that appear as coding regions in all transcripts (i.e. are never part of a 5’UTR or 3’UTR) included in a set called CDS. Similarly, all positions that appear as 5’ UTR or 3’ UTR in all transcripts are included in sets called UTR5 or UTR3, respectively.
  3. Genomic positions that are overlapped by multiple merged genes are excluded from the position sets for those genes.
  4. If a mask file is supplied, positions annotated in the mask file are also excluded
  5. Output is given as a series of BED files and a positions file containing the same data.
Parameters:
args : argparse.Namespace

command-line arguments for generate subprogram

plastid.bin.cs.do_scatter(x, y, count_mask, plot_parser, args, pearsonr=None, xlabel=None, ylabel=None, title=None, min_x=0.001, min_y=0.001)[source]

Scatter plot helper for cs chart subprogram

Parameters:
x, y : numpy.ndarray

Data to plot

count_mask : numpy.ndarray

Threshold mask

args : Namespace

Command-line arguments

pearsonr : float

Pearson’s r of the two samples

xlabel : str or None, optional

If not None, an x-axis label

ylabel : str or None, optional

If not None, a y-axis label

min_x,min_y : float

value to which low x- or y-values will respectively be truncated

Returns:
:class:`matplotlib.figure.Figure`

Formatted figure

plastid.bin.cs.main(argv=['-T', '-E', '-b', 'readthedocs', '-d', '_build/doctrees-readthedocs', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

Parameters:
argv : list, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.

Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line

plastid.bin.cs.merge_genes(tx_ivcs)[source]

Merge genes whose transcripts share exons into a combined, “merged” gene

Parameters:
tx_ivcs : dict

Dictionary mapping unique transcript IDs to Transcripts

Returns:
dict

Dictionary mapping raw gene names to the names of the merged genes

plastid.bin.cs.process_partial_group(transcripts, mask_hash, printer)[source]

Correct boundaries of merged genes, as described in do_generate()

Parameters:
transcripts : dict

Dictionary mapping unique transcript IDs to Transcripts. This set should be complete in the sense that it should contain all transcripts that have any chance of mutually overlapping each other (e.g. all on same chromosome and strand).

mask_hash : GenomeHash

GenomeHash of regions to exclude from analysis

Returns:
:class:`pandas.DataFrame`

Table of merged gene positions

:class:`pandas.DataFrame`

Table of adjusted transcript positions

:class:`dict`

Dictionary mapping raw gene names to merged gene names