Major changes to
plastid are documented here. Version numbers for the
project follow the conventions described in PEP 440, along with the
guidelines in Semantic versioning, with the exception
that a 0 is prepended (i.e. our version scheme is era.major.minor).
plastid [0.4.7] = [2017-03-06]¶
This update is minor compared to the release 0.4.6, and was mainly motivated by
updates, bugfixes, and changes required for compatibility with new versions of
- Support for
write_pl_table()added as a convenience function
--use_meanflag added to
- Warnings / better help text
- rounding error in
PSL_Reader()now capable of parsing strands from translated blat output
- Fixed bug in header parsing in
plastid [0.4.6] = [2016-05-20]¶
- Support for BigWig files.
BigWigReaderreads BigWig files, and
BigWigGenomeArrayhandles them conveniently.
BigBedReaderhas been reimplemented using Jim Kent’s C library, making it far faster and more memory efficient.
BigBedReader.search()created to search indexed fields included in BigBed files, e.g. to find transcripts with a given gene_id (if gene_id is included as an extension column and indexed). To see which fields are searchable, use
Simplified file opening. All readers can now take filenames in addition to open filehandles. No need to wrap filenames in lists any more. For example:# old way to open GTF2 file >>> data = GTF2_TranscriptAssembler(open("some_file.gtf")) # new way. Also works with BED_Reader, GTF2_Reader, GFF3_TranscriptAssembler, and others >>> data = GTF2_TranscriptAssembler("some_file.gtf") # old way to get read alignments from BAM files >>> alignments = BAMGenomeArray(["some_file.bam","some_other_file.bam"]) # new way >>> alignemnts = BAMGenomeArray("some_file.bam","some_other_file.bam") # old way to open a tabix-indexed file >>> data = BED_Reader(pysam.tabix_iterator(open("some_file.bed.gz"),pysam.asTuple()),tabix=True) # new way >>> data = BED_Reader("some_file.bed.gz",tabix=True)
To maintain backward compatibility, the old syntax still works
BAMGenomeArraycan now use mapping functions that return multidimensional arrays. As an example we added
StratifiedVariableFivePrimeMapFactory, which produces a 2D array of counts at each position in a region (columns), stratified by read length (rows).
Reformatted & colorized warning output to improve legibility
read_pl_table()convenience function for reading tables written by command-line scripts into DataFrames, preserving headers, formatting, et c
All script output metadata now includes command as executed, for easier re-running and record keeping
Scripts using count files get
--sumflag, enabling users to set effective sum of counts/reads used in normalization and RPKM calculations
--constrainoption added to
psiteto improve performance on noisy or low count data.
- No longer saves intermediate count files.
--keepoption added to take care of this.
- Fixed/improved color scaling in heatmap output. Color values are now capped at the 95th percentile of nonzero values, improving contrast
- Added warnings for files that appear not to contain UTRs
psite, no longer saves intermediate count files.
--keepoption added to take care of this.
phase_by_sizecan now optionally use an ROI file from the
metagene generatesubprogram. This improves accuracy in higher eukaryotes by preventing double-counting of codons when more than one transcript is annotated per gene.
cs chartfile containing list of genes is now optional. If not given, all genes are included in comparisons
reformat_transcriptsis now able to export extended BED columns (e.g. gene_id) if the input data has useful attributes. This particularly useful when working with large transcript annotations in GTF2/GFF3 format- they can now be exported to BED format, and converted to BigBed foramt, allowing random access and low memory usage, while preserving gene-transcript relationships.
- Version parsing bug in setup script.
@deprecatedfunction decorator now gives
metagenehas been deprecated and will be removed in
plastidv0.5. Instead, use
--normalize_over, which performs the same role, except coordinates are specified relative to the landmark of interest, rather than entire window. This change is more intuitive to many users, and saves them mental math. If both
--normalize_overwill be used.
BigBedReader.custom_fieldshas been replaced with
BigBedReader.chrom_sizeshas been replaced with
BigBedReader.chromsfor consistency with other data structures
RTreeclasses, which will be removed in
plastid [0.4.5] = [2016-03-09]¶
Changes here are mostly under the hood, involving improvements in usability, speed, stability, compatibility, and error reporting. We also fixed up tools for developers and added entrypoints for custom mapping rules.
Users can now control verbosity/frequency of warnings via ‘-v’ or ‘-q’ options! By default there should no long screens of DataWarnings when processing Ensembl (or other) GTFs.
--aggregateoption added to
psitescript to improve sensitivity for low-count data.
Created entrypoints for allowing users to use custom mapping rules in the command line scripts:
plastid.mapping_rulesfor specifying new mapping functions
plastid.mapping_optionsfor specifying any other command-line arguments they consume
Detailed instructions for use in the developer info section of plastid.readthedocs.org.
Argument parsing classes that replace methods deprecated below:
- updated plotting tools to fetch color cycles from matplotlib versions >= 1.5
as well as >= 1.3. This corrected a plotting bug in cs.
AnnotationParser.get_genome_hash_from_args()now internally uses
GFF3_Reader and GTF2_Reader instead of GFF3_TranscriptAssembler and GTF2_TranscriptAssembler, allowing mask files in GTF2/GFF3 foramts to be type-agnostic in command-line scripts
contig names no longer lost when using 2bit files in crossmap
- output header in metagene profiles. Sorry about that
- fix compatibility problem with new versions of matplotlib
- now catches a
ValueErrorthat used to be an
IndexErrorin earlier versions of
Fixed loss-of-ID bug in
- now optionally takes parameters indicating the future version of plastid in which deprecated features will be removed, and what replacement to use instead
Argument parsing methods:
plastid [0.4.4] = [2105-11-16]¶
Although the list of changes is short, this release includes dramatic reductions in memory usage and speed improvements, as well as a few bug fixes. We recommend everybody upgrade
- 10-100 fold reduction in memory consumed by
GFF3_TranscriptAssembler. All position & mask hashes now lazily evaluated
- 50-fold fold Speed boosts in
SegmentChain.covers()and other methods for comparing
GenomicSegmentis now hashable, e.g. can be used in sets or dict keys
- Track naming bug in
- init bug in
plastid [0.4.2] = [2015-10-22]¶
No change in codebase vs 0.4.0. Updated required matplotlib version to 1.4.0. Made some changes in sphinx doc config for readthedocs.org, which is still at matplotlib 1.3.0.
plastid [0.4.0] = [2015-10-21]¶
This release primarily focuses on ease of use: mainly, it is a lot easier to do things with fewer lines of code. Imports have been shortened, plotting tools have been added, and scripts now produce more informative output.
Logical imports: the following commonly-used data structures can now be directly imported from the parent package
plastid, instead of subpackages/submodules:
- All GenomeHashes and GenomeArrays
- All file readers
VariableFivePrimeMapFactorycan now be created from static method
from_file(), so no need to manually parse text files or create dictionaries
BAMGenomeArraycan now be initialized with a list of paths to BAM files, in addition or instead of a list of
plastid.plottingpackage, which includes tools for making MA plots, scatter plots with marginal histograms, metagene profiles, et c
- more informative plots made in
- support for matplotlib stylesheets, colormaps, et c in all command-line scripts
add_three_for_stop_codon()reimplemented in Cython, resulting in 2-fold speedup. Moved from
plastid.genomics.roitools(though previous import path still works)
- Fixed IndexError in
psitethat arose when running with the latest release of numpy, when generating a read profile over an empty array
- Legends/text no longer get cut off in plots
- Removed deprecated functions
BED_to_SegmentChains, for which
BED_Readerserves as a drop-in replacement
plastid [0.3.2] = [2015-10-01]¶
- Important docstring updates: removed outdated warnings and descriptions
plastid [0.3.0] = [2015-10-01]¶
- Cython implementations of
Transcriptprovide massive speedups
cds_genome_endare now managed properties and update each other to maintain synchrony
SegmentChain._mask_segmentsare now read-only
SegmentChain.get_masked_length()are replaced by properties
SegmentChainnow sort lexically without help
plastid [0.2.3] = [2015-09-23]¶
- Cython implementations of BAM mapping rules now default, are 2-10x faster than Python implementations
plastid [0.2.2] = [2015-09-15]¶
First release under official name!
- Major algorithmic improvements to internals & command-line scripts
- Reimplemented mapping rules and some internals in Cython, giving 2-10x speedup for some operations
GenomicSegmentnow sorts lexically. Properties are read-only
This project was initially developed internally under the provisional name
genometools, and then later under the codename
yeti. The current
plastid will not change. Changelogs from earlier versions
yeti [0.2.1] = [2015-09-06]¶
- Support for extended BED formats now in both import & export, in command-line scripts and interactively
- BED Detail format and known ENCODE BED subtypes now automatically parsed from track definition lines
- Created warning classes DataWarning, FileFormatWarning, and ArgumentWarning
- parallelized crossmap script
- command line support for more sequence file formats; and a sequence
- speed & memory optimizations for cs generate script, resulting in 90% memory reduction on human genome annotation GrCh38.78
- ditto metagene generate script
- crossmap script does not save kmer files unless –save_kmers is given
- warnings now given at first (instead of every) occurence
- lazy imports; giving speed improvements to command-line scripts
yeti [0.2.0] = [2015-08-26]¶
Big changes, including some that are backwards-incompatible. We really think these are for the best, because they improve compatibility with other packages (e.g. pandas) and make the package more consistent in design & behavior
- GenomeArray __getitem__ and __setitem__ now can take SegmentChains as arguments
- Mapping functions for bowtie files now issue warnings when reads are unmappable
- support for 2bit files (via twobitreader) and for dicts of strings in SegmentChain.get_sequence
- various warnings added
- pandas compatibility: header rows in all output files no longer have starting ‘#. meaning UPDATE YOUR OLD POSITIONS/ROI FILES
- __getitem__ from GenomeArrays now returns vectors 5’ to 3’ relative toGenomicSegment rather than to genome. This is more consistent with user expectations.
- _get_valid_X methods of SegmentChain changed to _get_masked_X for consistency with documentation and with numpy notation
- ArrayTable class & tests
yeti [0.1.1] = [2015-07-23]¶
- Created & backpopulated changelog
- Docstrings re-written for user rather than developer focus
- Complete first draft of user manual documentation
- Readthedocs support for documentation
- GFF3_TranscriptAssembler now also handles features whose subfeatures share ID attributes instead of Parent attributes.
- import of scientific packages now simulated using mock during documentation builds by Sphinx
- duplicated attributes in GTF2 column 9 are now catenated & returned as a list in attr dict. This is outside GTF2 spec, but a behavior used by GENCODE
- Removed bug from
yeti.bin.metagene.do_generate()that extended maximal spanning windows past equivalence points in 3’ directions. Added extra unit test cases to suit it.
- GenomeHash can again accept GenomicSegments as parameters to __getitem__. Added unit tests for this.
Removed deprecated functions, modules, & classes:
yeti [0.1.0] = [2015-06-06]¶
First internal release of project under new codename,
yeti. Reset version
- AssembledFeatureReader, GTF2_TranscriptAssembler, GFF3_TranscriptAssembler
- GTF2/GFF3 token parsers now issue warnings on repeated keys
- GFF3 token parsers now return ‘Parent’, ‘Alias’, ‘Dbxref’, ‘dbxref’, and ‘Note’ fields as lists
Package renamed from
genometoolsto its provisional codename
Reset version number to 0.1.0
Refactored existing readers to descent from AssembledFeatureReader
Migration from old SVN to GIT repo
Renaming & moving of functions, classes, & modules for consistency and to avoid name clashes with other packages
Old name New Name GenomicInterva GenomicSegment IVCollection SegmentChain NibbleMapFactory CenterMapFactory genometools.genomics.ivtools yeti.genomics.roitools genometools.genomics.readers yeti.readers genometools.genomics.scriptlib yeti.util.scriptlib
genometools [0.9.1] - 2015-05-21¶
- renamed suppress_stdr -> capture_stderr
- capture_stdout decorator
genometools [0.9.0] - 2015-05-20¶
- All functions that used GenomicFeatures now use IVCollections instead
- GenomicFeature support from GenomeHash subclasses
- GenomicFeature support from IVCollection and GenomicInterval overlap end quality criteria
genometools [0.8.3] - 2015-05-19¶
- Included missing .positions and .sizes files into egg package
genometools [0.8.2] - 2015-05-19¶
- Test data now packaged in eggs
- updated documentation
- Bug in cleanup for test_crossmap
- Bug in setup.py
genometools [0.8.1] - 2015-05-18¶
- Python 3.0 support
- Support for tabix-compressed files. Creation of TabixGenomeHash
- Propagate various attributes to sub-features (utr_ivc, CDS) from Transcript
- Propagate all attributes to sub-features during GTF export from Transcript
- GTF2 export of Transcript objects now generates ‘start_codon’ and ‘stop_codon’ features
- Update of setup.py and Makefile to make dev vs distribution eggs
- ‘transcript_ids’ column of ‘cs generate’ position file now sorted before comma join.
genometools [0.8.2015-05-08] - 2015-05-08¶
- Merger of make_tophat_juncs, find_juncs, and merge_juncs into one script
- Standardization of column names among various output files: region, regions_counted, counts
- Standardized method names in IVCollection: get_valid_counts, get_valid_length, get_length, get_counts, et c
- IVCollection/Transcript openers/assemblers all return generators and can take multiple input files
- IVCollection/Transcript openers/assemblers return lexically-sorted objects
- Update to GFF3 escaping conventions rather than URL escaping. Also applied to GTF2 files
- Refactors to cs script, plus garbage collection to reduce memory usage
- Implementation of test suites
- Lazy assembly of GFF3 and GTF2 files to save memory in GTF2_TranscriptAssembler and GFF3_TranscriptAssembler
- BigBed support, creation of BigBedReader and BigBedGenomeHash. AutoSQL support
- Supported for truncated BED formats
- P-site offset script
- get_count_vectors script
- counts_in_region script
- UniqueFifo class
- Decorators: parallelize, suppress_stderr, in_separate_process
- variableStep export for BAMGenomeArray
- Support of GTF2 “frame” attribute for CDS features
- Bugfixes in minus strand offsets in crossmaps
- Fixed bug where minus strand crossmap features were ignored
- Bugfixes in CDS end export from Transcript when CDSes ended at the endpoint of internal but not terminal introns on plus-strand transcripts
- Ingolia file tagalign import
- Deprecation of GTF2_to_Transcripts and GFF3_to_Transcripts