Excluding (masking) regions of the genome¶
In genomics, specific regions of the genome – e.g. paralogs, transposons and repetitive sequences – are prone to yield missing or unreliable data, because reads from high-throughput sequencing experiments will multimap equally well to each instance of the repeated sequence, creating ambiguity in the data.
In this tutorial we discuss how to exclude or mask regions of the genome from analysis using the Demo dataset. Masking is discussed in the following sections:
First, let’s load a transcripts, get its CDS, and create masks that cover the first and last 5 codons:
>>> import numpy >>> from plastid import * # load transcripts and count data >>> alignments = BAMGenomeArray("SRR609197_riboprofile_5hr_rep1.bam",mapping=FivePrimeMapFactory(offset=14)) >>> transcripts = list(BED_Reader("merlin_orfs.bed",return_type=Transcript)) # Get coding region using get_cds() >>> demo_cds = transcripts.get_cds() # Create SegmentChains covering the first and last 5 codons. # We'll use these as masks >>> start_codon_mask = list(demo_cds.get_subchain(0,15)) >>> stop_codon_mask = list(demo_cds.get_subchain(demo_cds.length-15,demo_cds.length))
# Apply the masks to the demo_cds >>> demo_cds.add_masks(*start_codon_mask) >>> demo_cds.add_masks(*stop_codon_mask) # save masks to a BED file >>> fout = open("merlin_start_codon_masks.bed","w") >>> for mask in start_codon_mask: >>> fout.write(SegmentChain(mask).as_bed()) >>> >>> fout.close()
After masks are added, we can get a masked count vector by calling
get_masked_counts(). This method
numpy.ma.MaskedArray, rather than a
MaskedArray objects because they contain all the values,
but ignore masked values when performing operations:
# count reads, excluding those mapping to masked positions >>> demo_cds.get_masked_counts(alignments).sum() 53.0
# count all reads >>> demo_cds.get_counts(alignments).sum() 67.0
Masked positions are also excluded from length measurements, if and only if
get_masked_length() is called:
>>> demo_cds.masked_length # length, excluding masked nucleotides 213 >>> demo_cds.length # total length 243
>>> demo_cds.mask_segments [<GenomicSegment merlin:14615-14630 strand='+'>, <GenomicSegment merlin:14843-14858 strand='+'>] >>> demo_cds.get_masks_as_segmentchain() <SegmentChain segments=2 bounds=merlin:14615-14858(+) name=merlin:14615-14630^14843-14858(+)>
# get list of masks >>> mask_features = list(BED_Reader("merlin_start_codon_masks.bed")) # use GenomeHash to index masks >>> mask_hash = GenomeHash(mask_features)
We’ll retrieve all the masks in mask_hash that overlap demo_cds by using it as a dictionary key:
# find masks >>> demo_masks = mask_hash[demo_cds] >>> demo_masks [<SegmentChain segments=1 bounds=merlin:14615-14630(+) name=merlin:14615-14630(+)>] # add to demo_cds >>> for mask_chain in demo_masks: >>> demo_cds.add_masks(*mask_chain)
# create metagene file that excludes regions in mask_file.bed $ metagene generate outbase --landmark cds_start \ --annotation_files annotation_file.gtf \ --mask_annotation_files mask_file.bed \ --mask_annotation_format BED
- A genome is diced into pseudo-reads (k-mers) of a given length. The length of the pseudo-read is chosen to conservatively approximate the expected read length from a high-throughput sequencing experiment. So, for a ribosome profiling experiment that typically produces 27- to 32-mers, one might choose k to be 25 or 30.
- The pseudo-reads are realigned to the genome sequence, permitting a user-configurable number of mismatches. Again, the number of mismatches should be chosen to conservatively reflect the number of mismatches that will be permitted when the sequencing data is aligned.
- The number of times each pseudo-read aligns is counted. When a pseudo-read multimaps equally well to more than a single location, the genomic position that gave rise to that pseudo-read is annotated as repetitive under the given value for k and number of mismatches.
- Repetitive regions are saved in BED format.
$ bowtie-build merlin_NC006273-2.fa merlin_NC006273-2
Then, run the script. We’ll use 26-mers and a 12-nucleotide P-site offset, allowing 2 mismatches during alignment:
$ crossmap -k 26 --offset 12 --mismatches 2 \ merlin_NC006273-2.fa \ merlin_NC006273-2 \ merlin_NC006273-2
For large genomes (e.g. vertebrate, plant, or some very big amoebas):
Consider converting the file to a `2bit`_ file to save memory and avoid this potential problem
crossmapcan take several days to run, especially if mismatches are allowed. Consider using
--mismatches 0if you run into this problem
Using more processes (e.g. via
-p 2) will speed
crossmap’s runtime, but will increase its memory footprint, as each process will need its own memory space to create and align k-mers from chromosomal sequence
crossmapcreates BED files. Consider converting these to BigBed files will save substantial amounts of time and memory in the future. For instructions, see the documentation for Jim Kent’s utilities