Frequently asked questions¶
FAQ contents
-
Can I use
plastid
with reverse-complemented sequencing data, like dUTP sequencing?The P-site and/or Metagene scripts show few or zero read in their output
Why does IGV report higher coverage at a given nucleotide than the file exported from
make_wiggle
?How do I prepare data for differential gene expression analysis in DESeq?
Installation and runtime¶
Installation fails in pip with no obvious error message¶
Installation can fail for multiple reasons. To figure out what is responsible,
repeat installation passing the --verbose
flag to pip
:
$ pip install --no-cache-dir --verbose plastid | tee 2>&1 plastid_install_log.txt
Then find the corresponding error message below. If the error is not listed, let us know by filing a bug report at our issue tracker. Please attach plastid_install_log.txt to your report to help us figure out what is going on.
If you need to get to work quickly, try installing in a vanilla environment in a fresh virtualenv . See instructions at Install inside a virtualenv.
Error messages about numpy or pysam arise during install or runtime¶
If odd installation or runtime bugs arise that mention numpy or pysam,
then there may be multiple versions of these libraries installed on your system,
with plastid
having been linked against versions different from those in
your runtime environment. A solution is to install inside a vanilla environment
in a fresh virtualenv, or to use BioConda. See instructions for either in
Installation
If install succeeds in a virtualenv, this suggests that there are in fact
multiple versions of one or more of plastid’s dependencies installed on your
system. In this case, plastid
can be used inside the virtualenv. or
BioConda environment.
Locale error when installing or running plastid
on OSX¶
This is known to occur on OSX. In this case, you should see a stack trace ending with something like:
from docutils.utils.error_reporting import locale_encoding, ErrorString, ErrorOutput
File "/Applications/anaconda/lib/python2.7/site-packages/docutils/utils/error_reporting.py", line 47, in <module>
locale_encoding = locale.getlocale()[1] or locale.getdefaultlocale()[1]
File "/Applications/anaconda/lib/python2.7/locale.py", line 543, in getdefaultlocale
return _parse_localename(localename)
File "/Applications/anaconda/lib/python2.7/locale.py", line 475, in _parse_localename
raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8
Please see the workaround here.
Install fails on OSX with error code 1¶
If installing on OSX and you find an error message that resembles the following:
Command "/usr/local/opt/python/bin/python2.7 -c "import setuptools, tokenize;\
__file__='/private/var/folders/8y/xm0qbq655f1d4v20kq5vvfgm0000gq/T/pip-build-0bVdPy/pysam/setup.py';\
exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))"\
install --record /var/folders/some-folder/install-record.txt --single-version-externally-managed \
--compile --user --prefix=" failed with error code 1 in /private/var/folders/some-folder/pysam
Before installing, type:
$ export CFLAGS=-Qunused-arguments
$ export CPPFLAGS=-Qunused-arguments
and then retry.
command not found
: I can’t run any of the command line scripts¶
If you receive a command not found error from the shell, the folder containing
the command-line scripts might not be in your environment’s PATH
variable.
Command-line scripts will be installed wherever your system configuration
dictates. On OSX and many varities of linux, the install path for a single-user
install is ~/bin
or ~/.local/bin
. For system-wide installs, the path is
typically /usr/local/bin
. Make sure the appropriate location is in your
PATH
by the following line adding to your .bashrc
, .bash_profile
, or
.profile
(depending on which your system uses):
export PATH=~/bin:~/.local.bin:/usr/local/bin:$PATH
A script won’t run, reporting error: too few arguments¶
If you see the following error:
<script name>: error: too few arguments
Try re-ordering the script arguments, so that all of the required arguments
(the ones that don’t start with -
) come first. For example, change:
$ cs count --fiveprime --offset 13 --min_length 23 --max_length 35 \
--count_files ../some_file.bam some_file.positions some_sample_name
to
$ cs count some_file.positions some_sample_name \
--fiveprime --offset 13 --min_length 23 --max_length 35 \
--count_files ../some_file.bam
Alternatively, put a --
before the required options:
$ cs count --fiveprime --offset 13 --min_length 23 --max_length 35 \
--count_files ../some_file.bam \
-- some_file.positions some_sample_name
Things should then run.
I get an ImportError
or DistributionError
when using plastid
¶
If you get an error like the following:
Traceback (most recent call last):
File "/home/user/Rib_prof/venv/bin/crossmap", line 5, in <module>
from pkg_resources import load_entry_point
File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2970, in <module>
working_set = WorkingSet._build_master()
File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 567, in _build_master
ws.require(__requires__)
File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 876, in require
needed = self.resolve(parse_requirements(requirements))
File "/home/user/Rib_prof/venv/lib/python2.7/site-packages/pkg_resources/__init__.py", line 761, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: scipy>=0.12.0
One or more dependencies (in this example, SciPy is not installed). Please reinstall.
Analysis¶
Can I use plastid
with reverse-complemented sequencing data, like dUTP sequencing?¶
Yes.
Kits like Illumina’s Truseq Stranded mRNA Library Prep Kit, yield reads that are anti-sense to the mRNA from which they were generated, so the data coming off the sequencer will be reverse-complemented compared to the original strand that was cloned.
To use this data in plastid
, reverse-complement your FASTQ file using the
fastx_reverse_complement
tool from the Hanon lab’s fastx toolkit. Then align the reverse-complemented
data using your favorite aligner.
Can plastid
be used with paired-end data?¶
Yes, but two points:
Because there are few nucleotide-resolution assays that used paried-end sequencing, it has been unclear what sorts of mapping functions might be useful.
If you have a suggestion for one, please submit your suggestion with a use case on our issue tracker. It would be helpful!
For simple gene expression counting, it is possible to use the
--fiveprime
(implemented inFivePrimeMapFactory
) mapping function with zero offset. Accuracy can be improved by counting a single read from each pair in which both reads are mapped.However, it is critical to retain the read that appears on the same strand as the gene from which it arose.
If the library was prepared using dUTP chemistry, as in many paired-end prep kits, select read2 from each pair using
samtools
:$ samtools view -f 129 -b -o single_from_pair.bam paired_end_file.bam
Otherwise, select read1:
$ samtools view -f 65 -b -o single_from_pair.bam paired_end_file.bam
Then use single_from_pair.bam with
plastid
as usual.
The P-site and/or Metagene scripts show few or zero read in their output¶
This occurs in datasets with few counts, because psite
and metagene
plots
the median density at each position. In this case, there are a few options:
increase the minimum counts required to be included in the metagene / P-site estimate. Set
--min_counts
argument to a high number (e.g. for a 100 nt normalization region, choose >= 100 counts)the metagene profile or P-site can be estimated from aggregate counts (as opposed to median density) at each position using the
--aggregate
argument, as shown here.This might add some noise to the data, but it should still be interpretable if the gene models are good
cs
, counts_in_region
, or some other part of plastid
reports zero counts for my gene, even though there are read alignments there¶
The default behavior for all of the scripts and tools in plastid
is to
exclude reads that are antisense to any given genomic feature when calculating
coverage over that feature.
Paired end libraries, and single-end libraries that have been prepared with dUTP sequencing or a number of other protocols will contain read alignments antisense to the original mRNAs, causing these reads to be considered antisense to genes, and therefore excluded from gene expression totals.
See Can I use plastid with reverse-complemented sequencing data, like dUTP sequencing? for instructions on how to reverse-complement your data
for single-end dUTP data; or Can plastid be used with paired-end data? for info on using
plastid
with paired-end data.
Why do some scripts report fractional count numbers?¶
Fractional counts for alignments arise when using a alignment
mapping rule that maps reads fractionally over multiple positions (such
as --center
mapping). See the discussion of Read mapping functions,
where these are discussed in depth.
Why does IGV report higher coverage at a given nucleotide than the file exported from make_wiggle
?¶
When IGV calculates coverage of a nucleotide, it counts the number of alignments covering that nucleotide. So, a 30-nucleotide read would contribute 30 counts to a dataset.
While it is possible to write any mapping rule in plastid
, the
mapping rules included by default count each read only
once (e.g. at their 5’ end, 3’ end, et c). Even when using center or entire
mapping, each position covered by a read alignment is only incremented by
\(1.0/\ell\), where \(\ell\) is the length of the read. So, in this
case, a 30-nucleotide read would only contribute 1 count to a
dataset. See Read mapping functions for more information.
What are the differences between counts_in_region
and cs
?¶
counts_in_region
very simply counts read coverage (or any
data) over regions of interest, and reports those numbers in terms of
counts and RPKM. It can optionally take a mask file, if
there are genomic positions in the regions of interest which should be excluded
from analysis. Otherwise, it makes no corrections.
cs
is more complex, and is principally designed to make
rough estimates of gene expression at the gene, rather than transcript, level.
In so doing, it makes several heuristic corrections to regions before tabulating
their counts and RPKM. Specifically:
Genes that have transcripts that share exons are merged into single entities
Gene areas are defined for each merged geen by including all positions occupied by all transcripts from that merged gene
Regions occupied by two or more merged genes on the same strand are excluded from the calculation of expression values for both genes
Optionally, a mask file can be used to exclude any other positions from analysis.
Expression values (in counts and RPKM) are tabulated for the entire gene area (reported as exon_counts and exon_rpkm) as well as for sub regions, if the gene is coding. Specifically, cds_counts and cds_rpkm are calculated from counts that cover positions in the gene area that are annotated as CDS in all transcripts in the merged gene. Ditto for 5’ and 3’ UTRs
Either one can be an appropriate starting place for a pipeline, depending upon
your needs. See the documentation and/or source code for cs
and
counts_in_region
for further discussion.
Why does SegmentChain.as_gff3()
sometimes throw errors?¶
The incredible flexibility of the GFF3 file format introduces ambiguities for
representation of discontinuous features: some sort of parent-child relationship
needs to exist, and, except in the case of transcripts, plastid
doesn’t
know which one to use.
See this advice on how to handle this.
How do I prepare data for differential gene expression analysis in DESeq?¶
See Gene expression analysis in the Tutorials section.
Tests¶
The tests won’t run¶
In order to run the tests, you need to download the test dataset and unpack
it into plastid/test/
.
We decided not to include the test data in the main package in order to keep the package download and the github repository small.