plastid.bin.crossmap module¶
This script empirically determines which positions in a genome yield multimapping reads under a given set of alignment parameters. These positions are saved in a BED-formatted mask file, so that they may be excluded as from further analyses.
To identify multimapping regions, a genome sequence is diced into k-mers and the k-mers aligned back to the genome. Positions in the genome that give rise to k-mers that align equally well to more than one genomic location are then marked as multimapping..
k is specified by the user, as are the alignment parameters.
Output files¶
The following files are made:
where:
OUTBASE is a name meaningful to the user
READLENGTH is the k-mer length chosen by the user
MISMATCHES is the number of mismatches permitted during alignment, also set by the user.
Considerations for large genomes¶
For large genomes (e.g. vertebrate, plant, or some very big amoebas):
crossmap
can require a ton of memory if genome sequence is stored in a fasta file. Ifcrossmap
maxes out your system’s memory, it may be terminated by your system before it completes.Consider converting the file to a `2bit`_ file to save memory and avoid this potential problem
Enabling mismatches with short read sizes will make
crossmap
take a lot longer, especially on large genomes. Consider using--mismatches 0
if you run into this problemUsing multiple processes (e.g. via
-p 2
) will speedcrossmap
’s execution time, but will increase its memory footprint, as each process will need its own memory space to create and align k-mers from chromosomal sequenceBy default,
crossmap
creates BED files. Consider converting these to BigBed files will save substantial amounts of time and memory in the future.This can be achieved using Jim Kent’s
bedToBigBed
utility as follows (from the terminal):$ bowtie-inspect --summary BOWTIE_INDEX | grep Sequence |\ cut -f2,3 | sed -e "s/\([^ ]\+\).*\t/\1\t/" >OUTFILE.sizes $ sort -k1,1 -k2,2n OUTBASE.bed > OUTBASE_sorted.bed $ bedToBigBed OUTBASE_sorted.bed OUTBASE.sizes OUTBASE_sorted.bbSee https://github.com/ENCODE-DCC/kentUtils/tree/master/src/product/scripts for download & documentation of Kent utilities
- class plastid.bin.crossmap.FastaNameReader(stream)[source]¶
Bases:
plastid.util.io.filters.AbstractReader
Returns names of sequences in a fasta file
- Attributes
- closed
Methods
close
()Close stream
fileno
()Returns underlying file descriptor if one exists.
filter
(line)Return next sequence name in a fasta file
flush
(/)Flush write buffers, if applicable.
isatty
()Return whether this is an 'interactive' stream.
read
()Similar to
file.read()
.readable
()Return whether object was opened for reading.
readline
()Process a single line of data, assuming it is string-like
next(self)
is more likely to behave as expected.Similar to
file.readlines()
.Change stream position.
seekable
()Return whether object supports random access.
tell
(/)Return current stream position.
Truncate file to size bytes.
writable
()Return whether object was opened for writing.
writelines
(lines, /)Write a list of lines to stream.
next
- close()¶
Close stream
- fileno()¶
Returns underlying file descriptor if one exists.
OSError is raised if the IO object does not use a file descriptor.
- filter(line)[source]¶
Return next sequence name in a fasta file
- Parameters
- linestr
Line of text
- Returns
- str
Name of next sequence, excluding prefix ‘>’ and line terminator
- flush(/)¶
Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
- isatty()¶
Return whether this is an ‘interactive’ stream.
Return False if it can’t be determined.
- next()¶
- read()¶
Similar to
file.read()
. Process all units of data, assuming it is string-like- Returns
- str
- readable()¶
Return whether object was opened for reading.
If False, read() will raise OSError.
- readline()¶
Process a single line of data, assuming it is string-like
next(self)
is more likely to behave as expected.- Returns
- object
a unit of processed data
- readlines()¶
Similar to
file.readlines()
.- Returns
- list
processed data
- seek()¶
Change stream position.
Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:
0 – start of stream (the default); offset should be zero or positive
1 – current stream position; offset may be negative
2 – end of stream; offset is usually negative
Return the new absolute position.
- seekable()¶
Return whether object supports random access.
If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().
- tell(/)¶
Return current stream position.
- truncate()¶
Truncate file to size bytes.
File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.
- writable()¶
Return whether object was opened for writing.
If False, write() will raise OSError.
- writelines(lines, /)¶
Write a list of lines to stream.
Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.
- closed¶
- plastid.bin.crossmap.fa_to_bed(toomany_fh, k, offset=0)[source]¶
Create a BED file indicating genomic origins of reads in a bowtie
toomany
file- Parameters
- toomany_fhfile-like
Open filehandle to fasta-formatted
toomany
file from bowtie- kint
Length of k-mers
- offsetint, optional
Offset from 5’ end of read at which to map read, if any (Default: 0)
- Yields
SegmentChain
Plus-strand
SegmentChain
representing a repetitive regionSegmentChain
Minus-strand
SegmentChain
representing a repetitive region
- plastid.bin.crossmap.main(argv=['-T', '-E', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '_build/html'])[source]¶
Command-line program
- Parameters
- argvlist, optional
A list of command-line arguments, which will be processed as if the script were called from the command line if
main()
is called directly.- Default: `sys.argv[1:]`. The command-line arguments, if the script is
- invoked from the command line
- plastid.bin.crossmap.revcomp_mask_chain(seg, k, offset=0)[source]¶
Reverse-complement a single-interval mask, correcting for offset.
- Parameters
- seg
SegmentChain
Plus-strand mask, including offset
- kint
Length of k-mers
- offsetint, optional
Offset from 5’ end of read at which to map mask (Default: 0)
- seg
- Returns
SegmentChain
Mask on minus strand corresponding to seg
- plastid.bin.crossmap.simulate_reads(seq_record, fh=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, k=30)[source]¶
Chops a DNA sequence into k-mers, mimicking a sequencing run. Output is delivered in fasta format. Sequences are named for position of origin using 0-based indices.
- Parameters
- seq_record
Bio.SeqRecord.SeqRecord
DNA sequence
- fhfile-like
filehandle to write output
- kint, optional
length of k-mers to generate (Default: 30)
- seq_record