plastid.bin.cs module¶
Count the number of read alignments and calculate read densities (in RPKM) over genes, correcting gene boundaries for overlap with other genes or regions specified in a mask file.
Counts and densities are calculated separately per gene for exons, 5’ UTRs, coding regions, and 3’ UTRs. In addition, positions overlapped by multiple genes are excluded, as are positions annotated in mask annotation files, if one is provided.
The script’s operation is divided into three subprograms:
- Generate
The
generate
mode pre-process a genome annotation as follows:All genes whose transcripts share exact exons are collapsed to “merged” genes.
Positions covered by more than one merged gene on the same strand are excluded from analysis, and subtracted from each merged genes.
Remaining positions in each merged gene are then divided into the following groups:
- exon
all positions in all transcripts mapping to the merged gene
- CDS
positions which appear in coding regions in all transcript isoforms mapping to the merged gene. i.e. These positions are never part of a fiveprime or threeprime UTR in any transcript mapping to the merged gene
- UTR5
positions which are annotated only as 5’ UTR in all transcript isoforms mapping to the merged gene
- UTR3
positions which are annotated only as 3 UTR in all transcript isoforms mapping to the merged gene
- masked
positions excluded from analyses as directed in an optional mask file
The following files are output, where OUTBASE is a name supplied by the user:
- OUTBASE_gene.positions
Tab-delimited text file. Each line is a merged gene, and columns indicate the genomic coordinates and lengths of each of the position sets above.
- OUTBASE_transcript.positions
Tab-delimited text file. Each line is a transcript, and columns indicate the genomic coordinates and lengths of each of the position sets above.
- OUTBASE_gene_REGION.bed
BED files showing position sets for REGION, where REGION is one of exon, CDS, UTR5, and UTR3 or masked. These contain the same information in
OUTBASE_gene.positions
, but can be visualized easily in a genome browser
- Count
The
count
mode counts the number and density of read alignments in each sub-region (exon, CDS, UTR5, and UTR3) of each gene.- Chart
The
chart
mode takes output from one or more samples run undercount
mode and generates several tables and charts that provide broad overviews of the data.
See command-line help for each subprogram for details on each mode
See also¶
counts_in_region
scriptCalculate the number and density of read alignments covering any set of regions of interest, making no corrections for gene boundaries.
- plastid.bin.cs.do_chart(args, plot_parser)[source]¶
Produce a set of charts comparing multiple samples pairwise.
Charts include histograms of log2 fold changes and scatter plots with correlation coefficients, both generated for raw count and RPKM data.
- Parameters
- args
argparse.Namespace
command-line arguments for
chart
subprogram
- args
- plastid.bin.cs.do_count(args, alignment_parser)[source]¶
Count the number and density covering each merged gene in an annotation made made using the generate subcommand).
- Parameters
- args
argparse.Namespace
command-line arguments for
count
subprogram
- args
- plastid.bin.cs.do_generate(args, annotation_parser, mask_parser)[source]¶
Generate gene position files from gene annotations.
Genes whose transcripts share exons are first collapsed into merged genes.
Within merged genes, all positions are classified. All positions are included in a set called exon. All positions that appear as coding regions in all transcripts (i.e. are never part of a 5’UTR or 3’UTR) included in a set called CDS. Similarly, all positions that appear as 5’ UTR or 3’ UTR in all transcripts are included in sets called UTR5 or UTR3, respectively.
Genomic positions that are overlapped by multiple merged genes are excluded from the position sets for those genes.
If a mask file is supplied, positions annotated in the mask file are also excluded
Output is given as a series of BED files and a positions file containing the same data.
- Parameters
- args
argparse.Namespace
command-line arguments for
generate
subprogram
- args
- plastid.bin.cs.do_scatter(x, y, count_mask, plot_parser, args, pearsonr=None, xlabel=None, ylabel=None, title=None, min_x=0.001, min_y=0.001)[source]¶
Scatter plot helper for cs chart subprogram
- Parameters
- x, y
numpy.ndarray
Data to plot
- count_mask
numpy.ndarray
Threshold mask
- args
Namespace
Command-line arguments
- pearsonrfloat
Pearson’s r of the two samples
- xlabelstr or None, optional
If not None, an x-axis label
- ylabelstr or None, optional
If not None, a y-axis label
- min_x,min_yfloat
value to which low x- or y-values will respectively be truncated
- x, y
- Returns
matplotlib.figure.Figure
Formatted figure
- plastid.bin.cs.main(argv=['-T', '-E', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '_build/html'])[source]¶
Command-line program
- Parameters
- argvlist, optional
A list of command-line arguments, which will be processed as if the script were called from the command line if
main()
is called directly.Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line
- plastid.bin.cs.merge_genes(tx_ivcs)[source]¶
Merge genes whose transcripts share exons into a combined, “merged” gene
- Parameters
- tx_ivcsdict
Dictionary mapping unique transcript IDs to
Transcripts
- Returns
- dict
Dictionary mapping raw gene names to the names of the merged genes
- plastid.bin.cs.process_partial_group(transcripts, mask_hash, printer)[source]¶
Correct boundaries of merged genes, as described in
do_generate()
- Parameters
- transcriptsdict
Dictionary mapping unique transcript IDs to
Transcripts
. This set should be complete in the sense that it should contain all transcripts that have any chance of mutually overlapping each other (e.g. all on same chromosome and strand).- mask_hash
GenomeHash
GenomeHash
of regions to exclude from analysis
- Returns
pandas.DataFrame
Table of merged gene positions
pandas.DataFrame
Table of adjusted transcript positions
dict
Dictionary mapping raw gene names to merged gene names