Frequently asked questions¶
- Installation and runtime
- Installation fails in pip with no obvious error message
- Installer quits with an error message about numpy, pysam and Cython
- numpy, pysam and Cython are up to date, but install still fails
- Install fails inside a conda/Anaconda environment
- Locale error when installing or running
- Install fails on OSX with error code 1
command not found: I can’t run any of the command line scripts
- A script won’t run, reporting error: too few arguments
- I get an
- Can I use
plastidwith reverse-complemented sequencing data, like dUTP sequencing?
plastidbe used with paired-end data?
- The P-site and/or Metagene scripts show few or zero read in their output
counts_in_region, or some other part of
plastidreports zero counts for my gene, even though there are read alignments there
- Why do some scripts report fractional count numbers?
- Why does IGV report higher coverage at a given nucleotide than the file exported from
- What are the differences between
- Why does
SegmentChain.as_gff3()sometimes throw errors?
- How do I prepare data for differential gene expression analysis in DESeq?
- Can I use
Installation can fail for multiple reasons. To figure out what is responsible,
repeat installation passing the
--verbose flag to
$ pip install --no-cache-dir --verbose plastid | tee 2>&1 plastid_install_log.txt
Then find the corresponding error message below. If the error is not listed, let us know by filing a bug report at our issue tracker. Please attach plastid_install_log.txt to your report to help us figure out what is going on.
The most common cause of installation failure is that the setup script for plastid requires numpy, pysam, and Cython to be installed before it can be run. In this case, the following message should appear:
*** IMPORTANT INSTALLATION INFORMATION *** plastid setup requires numpy>=1.9.0, pysam>=0.8.4, and cython>=0.22 to be preinstalled. Please install these via pip, and retry: $ pip install --upgrade numpy pysam cython $ pip install plastid
$ pip install numpy pysam cython $ pip install plastid
Check if install worked:
$ pip list
If install fails when numpy, pysam and Cython are already up to date,
then there may be multiple versions of these libraries installed on your system,
plastid seeing an earlier version than pip. A few users have
come across this problem when installing
plastid in a conda/Anaconda
If install succeeds in a virtualenv, this suggests that there are in fact
multiple versions of one or more of plastid’s dependencies installed on your
system. In this case,
plastid can be used inside the virtualenv.
Two users have reported difficulties installing inside conda/Anaconda
environments. Despite having up-to-date versions of numpy, pysam, and
Cython installed in the conda environment, during the build process
pip found an incompatible version of Cython or of pysam.
This is known to occur on OSX. In this case, you should see a stack trace ending with something like:
from docutils.utils.error_reporting import locale_encoding, ErrorString, ErrorOutput File "/Applications/anaconda/lib/python2.7/site-packages/docutils/utils/error_reporting.py", line 47, in <module> locale_encoding = locale.getlocale() or locale.getdefaultlocale() File "/Applications/anaconda/lib/python2.7/locale.py", line 543, in getdefaultlocale return _parse_localename(localename) File "/Applications/anaconda/lib/python2.7/locale.py", line 475, in _parse_localename raise ValueError, 'unknown locale: %s' % localename ValueError: unknown locale: UTF-8
Please see the workaround here.
If installing on OSX and you find an error message that resembles the following:
Command "/usr/local/opt/python/bin/python2.7 -c "import setuptools, tokenize;\ __file__='/private/var/folders/8y/xm0qbq655f1d4v20kq5vvfgm0000gq/T/pip-build-0bVdPy/pysam/setup.py';\ exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))"\ install --record /var/folders/some-folder/install-record.txt --single-version-externally-managed \ --compile --user --prefix=" failed with error code 1 in /private/var/folders/some-folder/pysam
Before installing, type:
$ export CFLAGS=-Qunused-arguments $ export CPPFLAGS=-Qunused-arguments
and then retry.
If you receive a command not found error from the shell, the folder containing
the command-line scripts might not be in your environment’s
Command-line scripts will be installed wherever your system configuration dictates.
On OSX and many varities of linux, the install path for a single-user install is
~/.local/bin. For system-wide installs, the path is typically
/usr/local/bin. Make sure the appropriate location is in your
the following line adding to your
(depending on which your system uses):
If you see the following error:
<script name>: error: too few arguments
Try re-ordering the script arguments, so that all of the required arguments
(the ones that don’t start with
-) come first. For example, change:
$ cs count --fiveprime --offset 13 --min_length 23 --max_length 35 \ --count_files ../some_file.bam some_file.positions some_sample_name
$ cs count some_file.positions some_sample_name \ --fiveprime --offset 13 --min_length 23 --max_length 35 \ --count_files ../some_file.bam
Alternatively, put a
-- before the required options:
$ cs count --fiveprime --offset 13 --min_length 23 --max_length 35 \ --count_files ../some_file.bam \ -- some_file.positions some_sample_name
Things should then run.
If you get an error like the following:
Traceback (most recent call last): File "/home/user/Rib_prof/venv/bin/crossmap", line 5, in <module> from pkg_resources import load_entry_point File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2970, in <module> working_set = WorkingSet._build_master() File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 567, in _build_master ws.require(__requires__) File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 876, in require needed = self.resolve(parse_requirements(requirements)) File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 761, in resolve raise DistributionNotFound(req) pkg_resources.DistributionNotFound: scipy>=0.12.0
One or more dependencies (in this example, SciPy is not installed). Please see Installer quits with an error message about numpy, pysam and Cython, above.
Kits like Illumina’s Truseq Stranded mRNA Library Prep Kit, yield reads that are anti-sense to the mRNA from which they were generated, so the data coming off the sequencer will be reverse-complemented compared to the original strand that was cloned.
To use this data in
plastid, reverse-complement your FASTQ file using the
tool from the Hanon lab’s fastx toolkit. Then align the reverse-complemented
data using your favorite aligner.
Yes, but two points:
Because there are few nucleotide-resolution assays that used paried-end sequencing, it has been unclear what sorts of mapping functions might be useful.
If you have a suggestion for one, please submit your suggestion with a use case on our issue tracker. It would be helpful!
For simple gene expression counting, it is possible to use the
FivePrimeMapFactory) mapping function with zero offset. Accuracy can be improved by counting a single read from each pair in which both reads are mapped.
However, it is critical to retain the read that appears on the same strand as the gene from which it arose.
If the library was prepared using dUTP chemistry, as in many paired-end prep kits, select read2 from each pair using
samtools:$ samtools view -f 129 -b -o single_from_pair.bam paired_end_file.bam
Otherwise, select read1:$ samtools view -f 65 -b -o single_from_pair.bam paired_end_file.bamThen use single_from_pair.bam with
increase the minimum counts required to be included in the metagene / P-site estimate. Set
--min_countsargument to a high number (e.g. for a 100 nt normalization region, choose >= 100 counts)
the metagene profile or P-site can be estimated from aggregate counts (as opposed to median density) at each position using the
--aggregateargument, as shown here.
This might add some noise to the data, but it should still be interpretable if the gene models are good
counts_in_region, or some other part of
plastid reports zero counts for my gene, even though there are read alignments there¶
The default behavior for all of the scripts and tools in
plastid is to
exclude reads that are antisense to any given genomic feature when calculating
coverage over that feature.
Paired end libraries, and single-end libraries that have been prepared with dUTP sequencing or a number of other protocols will contain read alignments antisense to the original mRNAs, causing these reads to be considered antisense to genes, and therefore excluded from gene expression totals.
See Can I use plastid with reverse-complemented sequencing data, like dUTP sequencing? for instructions on how to reverse-complement your data
for single-end dUTP data; or Can plastid be used with paired-end data? for info on using
plastid with paired-end data.
Fractional counts for read alignments arise when using a alignment
mapping rule that maps reads fractionally over multiple positions (such
--center mapping). See the discussion of Read mapping functions,
where these are discussed in depth.
While it is possible to write any mapping rule in
plastid, the mapping rules
included by default count each read only once (e.g. at their 5’ end, 3’ end, et c). Even when using
center or entire mapping, each position covered by a read alignment is only incremented by \(1.0/\ell\),
where \(\ell\) is the length of the read. So, in this case, a 30-nucleotide read would only
contribute 1 count to a dataset. See Read mapping functions for more information.
counts_in_region very simply counts read coverage (or any data) over
regions of interest, and reports those numbers in terms of counts and RPKM. It can
optionally take a mask file, if there are genomic positions in the regions
of interest which should be excluded from analysis. Otherwise, it makes no corrections.
cs is more complex, and is principally designed to make rough estimates
of gene expression at the gene, rather than transcript, level. In so doing, it makes several
heuristic corrections to regions before tabulating their counts and RPKM. Specifically:
- Genes that have transcripts that share exons are merged into single entities
- Gene areas are defined for each merged geen by including all positions occupied by all transcripts from that merged gene
- Regions occupied by two or more merged genes on the same strand are excluded from the calculation of expression values for both genes
- Optionally, a mask file can be used to exclude any other positions from analysis.
- Expression values (in counts and RPKM) are tabulated for the entire gene area (reported as exon_counts and exon_rpkm) as well as for sub regions, if the gene is coding. Specifically, cds_counts and cds_rpkm are calculated from counts that cover positions in the gene area that are annotated as CDS in all transcripts in the merged gene. Ditto for 5’ and 3’ UTRs
The incredible fle flexibility of the GFF3 file format introduces ambiguities
for representation of discontinuous features: some sort of parent-child relationship
needs to exist, and, except in the case of transcripts,
know which one to use.
See this advice on how to handle this.