Excluding (masking) regions of the genome

It is often desirable to exclude missing or unreliable data from analysis. For this purpose, numpy and SciPy offer masking tools for numerical data in the modules numpy.ma and scipy.stats.mstats.

In genomics, specific regions of the genome – e.g. paralogs, transposons and repetitive sequences – are prone to yield missing or unreliable data, because reads from high-throughput sequencing experiments will multimap equally well to each instance of the repeated sequence, creating ambiguity in the data.

In this tutorial we discuss how to exclude or mask regions of the genome from analysis using the Demo dataset. Masking is discussed in the following sections:

Manual masking

Creating & adding masks

Like all genomic features in Plastid, masks are represented as SegmentChains, which can be used to mask other SegmentChain or Transcript objects.

First, let’s load a transcripts, get its CDS, and create masks that cover the first and last 5 codons:

>>> import numpy
>>> from plastid import *

# load transcripts and count data
>>> alignments = BAMGenomeArray("SRR609197_riboprofile_5hr_rep1.bam",mapping=FivePrimeMapFactory(offset=14))
>>> transcripts = list(BED_Reader("merlin_orfs.bed",return_type=Transcript))

# Get coding region using get_cds()
>>> demo_cds = transcripts[39].get_cds()

# Create SegmentChains covering the first and last 5 codons.
# We'll use these as masks
>>> start_codon_mask = list(demo_cds.get_subchain(0,15))
>>> stop_codon_mask  = list(demo_cds.get_subchain(demo_cds.length-15,demo_cds.length))

The SegmentChains we just created can be applied as masks using add_masks():

# Apply the masks to the demo_cds
>>> demo_cds.add_masks(*start_codon_mask)
>>> demo_cds.add_masks(*stop_codon_mask)

# save masks to a BED file
>>> fout = open("merlin_start_codon_masks.bed","w")
>>> for mask in start_codon_mask:
>>>     fout.write(SegmentChain(mask).as_bed())
>>>
>>> fout.close()

Uses of masks

After masks are added, we can get a masked count vector by calling get_masked_counts(). This method returns a numpy.ma.MaskedArray, rather than a numpy.ndarray. MaskedArray objects because they contain all the values, but ignore masked values when performing operations:

# count reads, excluding those mapping to masked positions
>>> demo_cds.get_masked_counts(alignments).sum()
53.0

Calling get_counts() after adding masks will still return an unmasked numpy.ndarray:

# count all reads
>>> demo_cds.get_counts(alignments).sum()
67.0

Masked positions are also excluded from length measurements, if and only if get_masked_length() is called:

>>> demo_cds.masked_length # length, excluding masked nucleotides
213

>>> demo_cds.length # total length
243

We can also retrieve masks that have been added to a SegmentChain, either as a list of GenomicSegments or as a SegmentChain:

>>> demo_cds.mask_segments
[<GenomicSegment merlin:14615-14630 strand='+'>,
 <GenomicSegment merlin:14843-14858 strand='+'>]

>>> demo_cds.get_masks_as_segmentchain()
<SegmentChain segments=2 bounds=merlin:14615-14858(+) name=merlin:14615-14630^14843-14858(+)>

Using mask files

Mask files annotate genomic regions that should be masked from analysis. As any annotation file, these can be in any of many formats (e.g. BED, BigBed, GFF3, or others).

In interactive Python sessions

Mask files can be loaded into a GenomeHash, a dictionary-like object that indexes features by their locations in the genome. To create a GenomeHash:

# get list of masks
>>> mask_features = list(BED_Reader("merlin_start_codon_masks.bed"))

# use GenomeHash to index masks
>>> mask_hash = GenomeHash(mask_features)

We’ll retrieve all the masks in mask_hash that overlap demo_cds by using it as a dictionary key:

# find masks
>>> demo_masks = mask_hash[demo_cds]
>>> demo_masks
[<SegmentChain segments=1 bounds=merlin:14615-14630(+) name=merlin:14615-14630(+)>]

# add to demo_cds
>>> for mask_chain in demo_masks:
>>>    demo_cds.add_masks(*mask_chain)

If the mask file is very large, it should be converted to an indexed file format such as BigBed to save memory.

Indexed annotation files can instead be loaded into BigBedGenomeHash and TabixGenomeHash, which take advantage of the indexes present in BigBed and tabix-compressed files.

In command-line scripts

Mask files can be used by command-line scripts using the argument --mask_annotation_files. For example:

# create metagene file that excludes regions in mask_file.bed
$ metagene generate outbase
                    --landmark cds_start \
                    --annotation_files annotation_file.gtf \
                    --mask_annotation_files mask_file.bed \
                    --mask_annotation_format BED

Creating a mask file using the crossmap script

The crossmap script empirically annotates genomic regions that multimap under various alignment criteria, and saves these as a mask file.

Algorithm

crossmap uses the following approach (adapted from [IGNW09]):

  1. A genome is diced into pseudo-reads (k-mers) of a given length. The length of the pseudo-read is chosen to conservatively approximate the expected read length from a high-throughput sequencing experiment. So, for a ribosome profiling experiment that typically produces 27- to 32-mers, one might choose k to be 25 or 30.

  2. The pseudo-reads are realigned to the genome sequence, permitting a user-configurable number of mismatches. Again, the number of mismatches should be chosen to conservatively reflect the number of mismatches that will be permitted when the sequencing data is aligned.

  3. The number of times each pseudo-read aligns is counted. When a pseudo-read multimaps equally well to more than a single location, the genomic position that gave rise to that pseudo-read is annotated as repetitive under the given value for k and number of mismatches.

  4. Repetitive regions are saved in BED format.

Running

Because crossmap internally uses bowtie for alignments, bowtie must be installed on your system. Once it is, use bowtie-build to build an index of your genome. From the terminal:

$ bowtie-build merlin_NC006273-2.fa merlin_NC006273-2

Then, run the script. We’ll use 26-mers and a 12-nucleotide P-site offset, allowing 2 mismatches during alignment:

$ crossmap -k 26 --offset 12 --mismatches 2 \
           merlin_NC006273-2.fa \
           merlin_NC006273-2 \
           merlin_NC006273-2

Considerations for large genomes

For large genomes (e.g. vertebrate, plant, or some very big amoebas):

  • crossmap can require a ton of memory if genome sequence is stored in a fasta file. If crossmap maxes out your system’s memory, it may be terminated by your system before it completes.

    Consider converting the file to a `2bit`_ file to save memory and avoid this potential problem

  • crossmap can take several days to run, especially if mismatches are allowed. Consider using --mismatches 0 if you run into this problem

  • Using more processes (e.g. via -p 2) will speed crossmap’s runtime, but will increase its memory footprint, as each process will need its own memory space to create and align k-mers from chromosomal sequence

  • By default, crossmap creates BED files. Consider converting these to BigBed files will save substantial amounts of time and memory in the future. For instructions, see the documentation for Jim Kent’s utilities


See also