plastid.bin.crossmap module¶
This script empirically determines which positions in a genome yield multimapping reads under a given set of alignment parameters. These positions are saved in a BED-formatted mask file, so that they may be excluded as from further analyses.
To identify multimapping regions, a genome sequence is diced into k-mers and the k-mers aligned back to the genome. Positions in the genome that give rise to k-mers that align equally well to more than one genomic location are then marked as multimapping..
k is specified by the user, as are the alignment parameters.
Output files¶
The following files are made:
where:
OUTBASE is a name meaningful to the user
READLENGTH is the k-mer length chosen by the user
MISMATCHES is the number of mismatches permitted during alignment, also set by the user.
Considerations for large genomes¶
For large genomes (e.g. vertebrate, plant, or some very big amoebas):
crossmapcan require a ton of memory if genome sequence is stored in a fasta file. Ifcrossmapmaxes out your system’s memory, it may be terminated by your system before it completes.Consider converting the file to a `2bit`_ file to save memory and avoid this potential problem
Enabling mismatches with short read sizes will make
crossmaptake a lot longer, especially on large genomes. Consider using--mismatches 0if you run into this problemUsing multiple processes (e.g. via
-p 2) will speedcrossmap’s execution time, but will increase its memory footprint, as each process will need its own memory space to create and align k-mers from chromosomal sequenceBy default,
crossmapcreates BED files. Consider converting these to BigBed files will save substantial amounts of time and memory in the future.This can be achieved using Jim Kent’s
bedToBigBedutility as follows (from the terminal):$ bowtie-inspect --summary BOWTIE_INDEX | grep Sequence |\ cut -f2,3 | sed -e "s/\([^ ]\+\).*\t/\1\t/" >OUTFILE.sizes $ sort -k1,1 -k2,2n OUTBASE.bed > OUTBASE_sorted.bed $ bedToBigBed OUTBASE_sorted.bed OUTBASE.sizes OUTBASE_sorted.bbSee https://github.com/ENCODE-DCC/kentUtils/tree/master/src/product/scripts for download & documentation of Kent utilities
- class plastid.bin.crossmap.FastaNameReader(stream)[source]¶
Bases:
plastid.util.io.filters.AbstractReaderReturns names of sequences in a fasta file
- Attributes
- closed
Methods
close()Close stream
fileno()Returns underlying file descriptor if one exists.
filter(line)Return next sequence name in a fasta file
flush(/)Flush write buffers, if applicable.
isatty()Return whether this is an 'interactive' stream.
read()Similar to
file.read().readable()Return whether object was opened for reading.
readline()Process a single line of data, assuming it is string-like
next(self)is more likely to behave as expected.Similar to
file.readlines().Change stream position.
seekable()Return whether object supports random access.
tell(/)Return current stream position.
Truncate file to size bytes.
writable()Return whether object was opened for writing.
writelines(lines, /)Write a list of lines to stream.
next
- close()¶
Close stream
- fileno()¶
Returns underlying file descriptor if one exists.
OSError is raised if the IO object does not use a file descriptor.
- filter(line)[source]¶
Return next sequence name in a fasta file
- Parameters
- linestr
Line of text
- Returns
- str
Name of next sequence, excluding prefix ‘>’ and line terminator
- flush(/)¶
Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
- isatty()¶
Return whether this is an ‘interactive’ stream.
Return False if it can’t be determined.
- next()¶
- read()¶
Similar to
file.read(). Process all units of data, assuming it is string-like- Returns
- str
- readable()¶
Return whether object was opened for reading.
If False, read() will raise OSError.
- readline()¶
Process a single line of data, assuming it is string-like
next(self)is more likely to behave as expected.- Returns
- object
a unit of processed data
- readlines()¶
Similar to
file.readlines().- Returns
- list
processed data
- seek()¶
Change stream position.
Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:
0 – start of stream (the default); offset should be zero or positive
1 – current stream position; offset may be negative
2 – end of stream; offset is usually negative
Return the new absolute position.
- seekable()¶
Return whether object supports random access.
If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().
- tell(/)¶
Return current stream position.
- truncate()¶
Truncate file to size bytes.
File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.
- writable()¶
Return whether object was opened for writing.
If False, write() will raise OSError.
- writelines(lines, /)¶
Write a list of lines to stream.
Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.
- closed¶
- plastid.bin.crossmap.fa_to_bed(toomany_fh, k, offset=0)[source]¶
Create a BED file indicating genomic origins of reads in a bowtie
toomanyfile- Parameters
- toomany_fhfile-like
Open filehandle to fasta-formatted
toomanyfile from bowtie- kint
Length of k-mers
- offsetint, optional
Offset from 5’ end of read at which to map read, if any (Default: 0)
- Yields
SegmentChainPlus-strand
SegmentChainrepresenting a repetitive regionSegmentChainMinus-strand
SegmentChainrepresenting a repetitive region
- plastid.bin.crossmap.main(argv=['-T', '-E', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '_build/html'])[source]¶
Command-line program
- Parameters
- argvlist, optional
A list of command-line arguments, which will be processed as if the script were called from the command line if
main()is called directly.- Default: `sys.argv[1:]`. The command-line arguments, if the script is
- invoked from the command line
- plastid.bin.crossmap.revcomp_mask_chain(seg, k, offset=0)[source]¶
Reverse-complement a single-interval mask, correcting for offset.
- Parameters
- seg
SegmentChain Plus-strand mask, including offset
- kint
Length of k-mers
- offsetint, optional
Offset from 5’ end of read at which to map mask (Default: 0)
- seg
- Returns
SegmentChainMask on minus strand corresponding to seg
- plastid.bin.crossmap.simulate_reads(seq_record, fh=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, k=30)[source]¶
Chops a DNA sequence into k-mers, mimicking a sequencing run. Output is delivered in fasta format. Sequences are named for position of origin using 0-based indices.
- Parameters
- seq_record
Bio.SeqRecord.SeqRecord DNA sequence
- fhfile-like
filehandle to write output
- kint, optional
length of k-mers to generate (Default: 30)
- seq_record