plastid.bin.crossmap module

This script empirically determines which positions in a genome yield multimapping reads under a given set of alignment parameters. These positions are saved in a BED-formatted mask file, so that they may be excluded as from further analyses.

To identify multimapping regions, a genome sequence is diced into k-mers and the k-mers aligned back to the genome. Positions in the genome that give rise to k-mers that align equally well to more than one genomic location are then marked as multimapping..

k is specified by the user, as are the alignment parameters.

Output files

The following files are made:

OUTBASE_READLENGTH_MISMATCHES_crossmap.bed

Final mask file annotation, in BED format

OUTBASE_READLENGTH_MISMATCHES_CHROMOSOME_kmers.fa

K-mers derived from chromosome CHROMOSOME. These files can be reused in subsequent runs allowing a different number of mismatches, using the --have_kmers option

where:

  • OUTBASE is a name meaningful to the user

  • READLENGTH is the k-mer length chosen by the user

  • MISMATCHES is the number of mismatches permitted during alignment, also set by the user.

Considerations for large genomes

For large genomes (e.g. vertebrate, plant, or some very big amoebas):

  • crossmap can require a ton of memory if genome sequence is stored in a fasta file. If crossmap maxes out your system’s memory, it may be terminated by your system before it completes.

    Consider converting the file to a `2bit`_ file to save memory and avoid this potential problem

  • Enabling mismatches with short read sizes will make crossmap take a lot longer, especially on large genomes. Consider using --mismatches 0 if you run into this problem

  • Using multiple processes (e.g. via -p 2) will speed crossmap’s execution time, but will increase its memory footprint, as each process will need its own memory space to create and align k-mers from chromosomal sequence

  • By default, crossmap creates BED files. Consider converting these to BigBed files will save substantial amounts of time and memory in the future.

    This can be achieved using Jim Kent’s bedToBigBed utility as follows (from the terminal):

    $ bowtie-inspect --summary BOWTIE_INDEX | grep Sequence |\
                     cut -f2,3 | sed -e "s/\([^ ]\+\).*\t/\1\t/"  >OUTFILE.sizes
    $ sort -k1,1 -k2,2n OUTBASE.bed > OUTBASE_sorted.bed
    $ bedToBigBed OUTBASE_sorted.bed OUTBASE.sizes OUTBASE_sorted.bb
    

    See https://github.com/ENCODE-DCC/kentUtils/tree/master/src/product/scripts for download & documentation of Kent utilities

class plastid.bin.crossmap.FastaNameReader(stream)[source]

Bases: plastid.util.io.filters.AbstractReader

Returns names of sequences in a fasta file

Attributes
closed

Methods

close()

Close stream

fileno()

Returns underlying file descriptor if one exists.

filter(line)

Return next sequence name in a fasta file

flush(/)

Flush write buffers, if applicable.

isatty()

Return whether this is an 'interactive' stream.

read()

Similar to file.read().

readable()

Return whether object was opened for reading.

readline()

Process a single line of data, assuming it is string-like next(self) is more likely to behave as expected.

readlines()

Similar to file.readlines().

seek

Change stream position.

seekable()

Return whether object supports random access.

tell(/)

Return current stream position.

truncate

Truncate file to size bytes.

writable()

Return whether object was opened for writing.

writelines(lines, /)

Write a list of lines to stream.

next

close()

Close stream

fileno()

Returns underlying file descriptor if one exists.

OSError is raised if the IO object does not use a file descriptor.

filter(line)[source]

Return next sequence name in a fasta file

Parameters
linestr

Line of text

Returns
str

Name of next sequence, excluding prefix ‘>’ and line terminator

flush(/)

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

isatty()

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

next()
read()

Similar to file.read(). Process all units of data, assuming it is string-like

Returns
str
readable()

Return whether object was opened for reading.

If False, read() will raise OSError.

readline()

Process a single line of data, assuming it is string-like next(self) is more likely to behave as expected.

Returns
object

a unit of processed data

readlines()

Similar to file.readlines().

Returns
list

processed data

seek()

Change stream position.

Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:

  • 0 – start of stream (the default); offset should be zero or positive

  • 1 – current stream position; offset may be negative

  • 2 – end of stream; offset is usually negative

Return the new absolute position.

seekable()

Return whether object supports random access.

If False, seek(), tell() and truncate() will raise OSError. This method may need to do a test seek().

tell(/)

Return current stream position.

truncate()

Truncate file to size bytes.

File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.

writable()

Return whether object was opened for writing.

If False, write() will raise OSError.

writelines(lines, /)

Write a list of lines to stream.

Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.

closed
plastid.bin.crossmap.chrom_worker(chrom_seq, args=None)[source]
plastid.bin.crossmap.fa_to_bed(toomany_fh, k, offset=0)[source]

Create a BED file indicating genomic origins of reads in a bowtie toomany file

Parameters
toomany_fhfile-like

Open filehandle to fasta-formatted toomany file from bowtie

kint

Length of k-mers

offsetint, optional

Offset from 5’ end of read at which to map read, if any (Default: 0)

Yields
SegmentChain

Plus-strand SegmentChain representing a repetitive region

SegmentChain

Minus-strand SegmentChain representing a repetitive region

plastid.bin.crossmap.main(argv=['-T', '-E', '-b', 'html', '-d', '_build/doctrees', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

Parameters
argvlist, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.

Default: `sys.argv[1:]`. The command-line arguments, if the script is
invoked from the command line
plastid.bin.crossmap.revcomp_mask_chain(seg, k, offset=0)[source]

Reverse-complement a single-interval mask, correcting for offset.

Parameters
segSegmentChain

Plus-strand mask, including offset

kint

Length of k-mers

offsetint, optional

Offset from 5’ end of read at which to map mask (Default: 0)

Returns
SegmentChain

Mask on minus strand corresponding to seg

plastid.bin.crossmap.simulate_reads(seq_record, fh=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, k=30)[source]

Chops a DNA sequence into k-mers, mimicking a sequencing run. Output is delivered in fasta format. Sequences are named for position of origin using 0-based indices.

Parameters
seq_recordBio.SeqRecord.SeqRecord

DNA sequence

fhfile-like

filehandle to write output

kint, optional

length of k-mers to generate (Default: 30)