Ambiguous read alignments

Some high-throughput sequencing reads align equally well (multimap) to multiple parts of a genome or transcriptome. This can occur when a read derives from repeated sequence, such as a duplicated gene, transposon, or pseudogene; or from repetitive sequence like telomeres or heterochromatin.

In the absence of other information, multimapping reads cannot unambiguously be assigned to a single position of origin. Various approaches have been developed to handle this:

  • discarding multimappers from alignment, and excluding duplicated genomic regions from analysis using a mask file (as in [IGNW09] and [DFB+13])
  • counting each copy of repetitive sequence (e.g. each copy of a transposon) as a single entity and summing all read alignments across all copies before calculating read density
  • randomly assigning each multimapper to one of the possible places in a genome or transcriptome from which they could have arisen (the default behavior of the TopHat aligner)
  • using uniquely mapping reads surrounding each copy of repeated sequence to determine the proportions of multimapping reads that should be assigned to each copy (TODO: example?)

plastid is compatible with any of these approaches, but provides tools specifically for masking repetitive genomic regions from analysis. For further information on this strategy, see Excluding (masking) regions of the genome.

See also