plastid.bin.crossmap module

This script empirically determines which positions in a genome yield multimapping reads under a given set of alignment parameters. These positions are saved in a BED-formatted mask file, so that they may be excluded as from further analyses.

To identify multimapping regions, a genome sequence is diced into k-mers and the k-mers aligned back to the genome. Positions in the genome that give rise to k-mers that align equally well to more than one genomic location are then marked as multimapping..

k is specified by the user, as are the alignment parameters.

Output files

The following files are made:

OUTBASE_READLENGTH_MISMATCHES_crossmap.bed
Final mask file annotation, in BED format
OUTBASE_READLENGTH_MISMATCHES_CHROMOSOME_kmers.fa
K-mers derived from chromosome CHROMOSOME. These files can be reused in subsequent runs allowing a different number of mismatches, using the --have_kmers option

where:

  • OUTBASE is a name meaningful to the user
  • READLENGTH is the k-mer length chosen by the user
  • MISMATCHES is the number of mismatches permitted during alignment, also set by the user.

Considerations for large genomes

For large genomes (e.g. vertebrate, plant, or some very big amoebas):

  • crossmap can require a ton of memory if genome sequence is stored in a fasta file. If crossmap maxes out your system’s memory, it may be terminated by your system before it completes.

    Consider converting the file to a `2bit`_ file to save memory and avoid this potential problem

  • Enabling mismatches with short read sizes will make crossmap take a lot longer, especially on large genomes. Consider using --mismatches 0 if you run into this problem

  • Using multiple processes (e.g. via -p 2) will speed crossmap’s execution time, but will increase its memory footprint, as each process will need its own memory space to create and align k-mers from chromosomal sequence

  • By default, crossmap creates BED files. Consider converting these to BigBed files will save substantial amounts of time and memory in the future.

    This can be achieved using Jim Kent’s bedToBigBed utility as follows (from the terminal):

    $ bowtie-inspect --summary BOWTIE_INDEX | grep Sequence |\
                     cut -f2,3 | sed -e "s/\([^ ]\+\).*\t/\1\t/"  >OUTFILE.sizes
    $ sort -k1,1 -k2,2n OUTBASE.bed > OUTBASE_sorted.bed
    $ bedToBigBed OUTBASE_sorted.bed OUTBASE.sizes OUTBASE_sorted.bb
    

    See https://github.com/ENCODE-DCC/kentUtils/tree/master/src/product/scripts for download & documentation of Kent utilities


Command-line arguments

Positional arguments

Argument Description
ebwt Bowtie index of genome against which crossmap will be made. In most cases, should be generated from the same sequences that are in sequence_file.
outbase Basename for output files

Optional arguments

Argument Description
-h, --help show this help message and exit
-k  READ_LENGTH K-mer length to generate from input file. (Default: 29)
--offset  OFFSET Offset from 5’ end of plus-strand read at which to attribute score (Default: 14)
--mismatches  N Number of mismatches tolerated in alignment. (Default: 0)
--bowtie  BOWTIE Location of bowtie binary (Default: /usr/local/bin/bowtie)
--have_kmers If specified, use k-mer files from previous run. In this case ‘sequence_file’ should be the value ‘outbase’ from the k-mer files you want to use.
--save_kmers Save k-mer files for reuse in a subsequent run.
-p  N, --processes  N Number of processes to use (should be <= number of chromosomes

Warning/error options

Argument Description
-q, --quiet Suppress all warning messages. Cannot use with ‘-v’.
-v, --verbose Increase verbosity. With ‘-v’, show every warning. With ‘-vv’, turn warnings into exceptions. Cannot use with ‘-q’. (Default: show each type of warning once)

Sequence options

Argument Description
--sequence_file  infile.[fasta | fastq | twobit | genbank | embl] A file of DNA sequence
--sequence_format  {fasta,fastq,twobit,genbank,embl} Format of sequence_file (Default: fasta).

Script contents

class plastid.bin.crossmap.FastaNameReader(stream)[source]

Bases: plastid.util.io.filters.AbstractReader

Returns names of sequences in a fasta file

Attributes:
closed

Methods

close() Close stream
filter(line) Return next sequence name in a fasta file
flush Flush write buffers, if applicable.
read() Similar to file.read().
readline() Process a single line of data, assuming it is string-like next(self) is more likely to behave as expected.
readlines() Similar to file.readlines().
seek Change stream position.
tell Return current stream position.
truncate Truncate file to size bytes.
fileno  
isatty  
next  
readable  
seekable  
writable  
writelines  
close()

Close stream

fileno()

Returns underlying file descriptor if one exists.

An IOError is raised if the IO object does not use a file descriptor.

filter(line)[source]

Return next sequence name in a fasta file

Parameters:
line : str

Line of text

Returns:
str

Name of next sequence, excluding prefix ‘>’ and line terminator

flush()

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

isatty()

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

next() → the next value, or raise StopIteration
read()

Similar to file.read(). Process all units of data, assuming it is string-like

Returns:
str
readable()

Return whether object was opened for reading.

If False, read() will raise IOError.

readline()

Process a single line of data, assuming it is string-like next(self) is more likely to behave as expected.

Returns:
object

a unit of processed data

readlines()

Similar to file.readlines().

Returns:
list

processed data

seek()

Change stream position.

Change the stream position to the given byte offset. The offset is interpreted relative to the position indicated by whence. Values for whence are:

  • 0 – start of stream (the default); offset should be zero or positive
  • 1 – current stream position; offset may be negative
  • 2 – end of stream; offset is usually negative

Return the new absolute position.

seekable()

Return whether object supports random access.

If False, seek(), tell() and truncate() will raise IOError. This method may need to do a test seek().

tell()

Return current stream position.

truncate()

Truncate file to size bytes.

File pointer is left unchanged. Size defaults to the current IO position as reported by tell(). Returns the new size.

writable()

Return whether object was opened for writing.

If False, read() will raise IOError.

writelines()
closed
plastid.bin.crossmap.chrom_worker(chrom_seq, args=None)[source]
plastid.bin.crossmap.fa_to_bed(toomany_fh, k, offset=0)[source]

Create a BED file indicating genomic origins of reads in a bowtie toomany file

Parameters:
toomany_fh : file-like

Open filehandle to fasta-formatted toomany file from bowtie

k : int

Length of k-mers

offset : int, optional

Offset from 5’ end of read at which to map read, if any (Default: 0)

Yields:
|SegmentChain|

Plus-strand SegmentChain representing a repetitive region

|SegmentChain|

Minus-strand SegmentChain representing a repetitive region

plastid.bin.crossmap.main(argv=['-T', '-E', '-b', 'readthedocs', '-d', '_build/doctrees-readthedocs', '-D', 'language=en', '.', '_build/html'])[source]

Command-line program

Parameters:
argv : list, optional

A list of command-line arguments, which will be processed as if the script were called from the command line if main() is called directly.

Default: sys.argv[1:]. The command-line arguments, if the script is invoked from the command line

plastid.bin.crossmap.revcomp_mask_chain(seg, k, offset=0)[source]

Reverse-complement a single-interval mask, correcting for offset.

Parameters:
seg : SegmentChain

Plus-strand mask, including offset

k : int

Length of k-mers

offset : int, optional

Offset from 5’ end of read at which to map mask (Default: 0)

Returns:
|SegmentChain|

Mask on minus strand corresponding to seg

plastid.bin.crossmap.simulate_reads(seq_record, fh=<open file '<stdout>', mode 'w'>, k=30)[source]

Chops a DNA sequence into k-mers, mimicking a sequencing run. Output is delivered in fasta format. Sequences are named for position of origin using 0-based indices.

Parameters:
seq_record : Bio.SeqRecord.SeqRecord

DNA sequence

fh : file-like

filehandle to write output

k : int, optional

length of k-mers to generate (Default: 30)