plastid.bin.crossmap module¶

This script empirically determines which positions in a genome yield multimapping reads under a given set of alignment parameters. These positions are saved in a BED-formatted mask file, so that they may be excluded as from further analyses.

To identify multimapping regions, a genome sequence is diced into k-mers and the k-mers aligned back to the genome. Positions in the genome that give rise to k-mers that align equally well to more than one genomic location are then marked as multimapping..

k is specified by the user, as are the alignment parameters.

Output files¶

The following files are made:

OUTBASE_READLENGTH_MISMATCHES_crossmap.bed
Final mask file annotation, in BED format

OUTBASE_READLENGTH_MISMATCHES_CHROMOSOME_kmers.fa
K-mers derived from chromosome CHROMOSOME. These files can be reused in subsequent runs allowing a different number of mismatches, using the --have_kmers option

where:

OUTBASE is a name meaningful to the user

READLENGTH is the k-mer length chosen by the user

MISMATCHES is the number of mismatches permitted during alignment, also set by the user.

Considerations for large genomes¶

For large genomes (e.g. vertebrate, plant, or some very big amoebas):

crossmap can require a ton of memory if genome sequence is stored in a fasta file. If crossmap maxes out your system’s memory, it may be terminated by your system before it completes.

Consider converting the file to a `2bit`_ file to save memory and avoid this potential problem

Enabling mismatches with short read sizes will make crossmap take a lot longer, especially on large genomes. Consider using --mismatches 0 if you run into this problem

Using multiple processes (e.g. via -p 2) will speed crossmap’s execution time, but will increase its memory footprint, as each process will need its own memory space to create and align k-mers from chromosomal sequence
By default, crossmap creates BED files. Consider converting these to BigBed files will save substantial amounts of time and memory in the future.

This can be achieved using Jim Kent’s bedToBigBed utility as follows (from the terminal):
$ bowtie-inspect --summary BOWTIE_INDEX | grep Sequence |\
                 cut -f2,3 | sed -e "s/$[^ ]\+$.*\t/\1\t/"  >OUTFILE.sizes
$ sort -k1,1 -k2,2n OUTBASE.bed > OUTBASE_sorted.bed
$ bedToBigBed OUTBASE_sorted.bed OUTBASE.sizes OUTBASE_sorted.bb
See https://github.com/ENCODE-DCC/kentUtils/tree/master/src/product/scripts for download & documentation of Kent utilities

class plastid.bin.crossmap.FastaNameReader(stream)[source]¶

Bases: plastid.util.io.filters.AbstractReader

Returns names of sequences in a fasta file

Attributes

closed

Methods

`close`()	Close stream
`fileno`()	Returns underlying file descriptor if one exists.
`filter`(line)	Return next sequence name in a fasta file
`flush`(/)	Flush write buffers, if applicable.
`isatty`()	Return whether this is an 'interactive' stream.
`read`()	Similar to `file.read()`.
`readable`()	Return whether object was opened for reading.
`readline`()	Process a single line of data, assuming it is string-like `next(self)` is more likely to behave as expected.
`readlines`()	Similar to `file.readlines()`.
`seek`	Change stream position.
`seekable`()	Return whether object supports random access.
`tell`(/)	Return current stream position.
`truncate`	Truncate file to size bytes.
`writable`()	Return whether object was opened for writing.
`writelines`(lines, /)	Write a list of lines to stream.

next

close()¶: Close stream

fileno()¶

Returns underlying file descriptor if one exists.

OSError is raised if the IO object does not use a file descriptor.

filter(line)[source]¶

Return next sequence name in a fasta file

Parameters

linestr: Line of text

Returns

str: Name of next sequence, excluding prefix ‘>’ and line terminator

flush(/)¶

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

isatty()¶

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

next()¶

read()¶

Similar to file.read(). Process all units of data, assuming it is string-like

Returns

str

readable()¶

Return whether object was opened for reading.

If False, read() will raise OSError.

readline()¶

Process a single line of data, assuming it is string-like next(self) is more likely to behave as expected.

Returns

object: a unit of processed data

readlines()¶

Similar to file.readlines().

Returns

list: processed data

seek()¶

Change stream position.

Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:

0 – start of stream (the default); offset should be zero or positive
1 – current stream position; offset may be negative
2 – end of stream; offset is usually negative

Return the new absolute position.

seekable()¶

Return whether object supports random access.

If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().

tell(/)¶: Return current stream position.

truncate()¶

Truncate file to size bytes.

File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.

writable()¶

Return whether object was opened for writing.

If False, write() will raise OSError.

writelines(lines, /)¶

Write a list of lines to stream.

Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.

closed¶

plastid.bin.crossmap.chrom_worker(chrom_seq, args=None)[source]¶

plastid.bin.crossmap.fa_to_bed(toomany_fh, k, offset=0)[source]¶

Create a BED file indicating genomic origins of reads in a bowtie toomany file

Parameters

toomany_fhfile-like: Open filehandle to fasta-formatted toomany file from bowtie
kint: Length of k-mers
offsetint, optional: Offset from 5’ end of read at which to map read, if any (Default: 0)

Yields

SegmentChain: Plus-strand SegmentChain representing a repetitive region
SegmentChain: Minus-strand SegmentChain representing a repetitive region

plastid.bin.crossmap.main(argv=['-T', '-E', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '_build/html'])[source]¶

Command-line program

Parameters

argvlist, optional: A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.
Default: `sys.argv[1:]`. The command-line arguments, if the script is
invoked from the command line

plastid.bin.crossmap.revcomp_mask_chain(seg, k, offset=0)[source]¶

Reverse-complement a single-interval mask, correcting for offset.

Parameters

segSegmentChain: Plus-strand mask, including offset
kint: Length of k-mers
offsetint, optional: Offset from 5’ end of read at which to map mask (Default: 0)

Returns

SegmentChain: Mask on minus strand corresponding to seg

plastid.bin.crossmap.simulate_reads(seq_record, fh=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, k=30)[source]¶

Chops a DNA sequence into k-mers, mimicking a sequencing run. Output is delivered in fasta format. Sequences are named for position of origin using 0-based indices.

Parameters

seq_recordBio.SeqRecord.SeqRecord: DNA sequence
fhfile-like: filehandle to write output
kint, optional: length of k-mers to generate (Default: 30)