Coordinate systems used in genomics

plastid's readers automatically convert coordinates from any of the supported file formats into a 0-indexed and half-open space (i.e. following typical Python convention), so users don’t need to worry about off-by-one errors in their annotations.

Nonetheless, this tutorial describes various coordinate representations used in genomics:

Coordinates

Genomic coordinates are typically specified as a set of:

  • a chromosome name

  • a start position

  • an end position

  • a chromosome strand:

    • ‘+’ for the forward strand

    • ‘-’ for the reverse stranded

    • ‘.’ for both strands / unstranded features

This gives rise to several non-obvious considerations:

start ≤ end

In the vast majority of annotation formats, the start coordinate refers to the lowest-numbered (i.e. leftmost, chromosome-wise) coordinate relative to the genome rather than the feature. So, for reverse-stand features, the start coordinate actually denotes the 3’ end of the feature, while the end coordinate denotes the 5’ end.

Counting from 0 vs 1

Coordinate systems can start counting from 0 (i.e. are 0-indexed) or from 1 (1-indexed). Suppose we have an XbaI restriction site on chromosome chrI:

                           XbaI
                          ______
ChrI:         ACCGATGCTAGCTCTAGACTACATCTACTCCGTCGTCTAGCATGATGCTAGCTGAC
              |          |^^^^^^     |          |          |
0-index:      0          10          20         30         40
1-index:      1          11          21         31         41

In 0-indexed representation, the restriction site begins at position 11. In 1-indexed representation, it begins at position 12.

In the context of genomics, both 0-indexed and 1-indexed systems are used, depending upon file format. plastid knows which file formats use which representation, and automatically converts all coordinates to a 0-indexed representation, following Python idioms.

Half-open vs fully-closed coordinates

Similarly, coordinate systems can represent end coordinates in two ways:

  1. In a fully-closed or end-inclusive coordinate system, positions are inclusive: the end coordinate corresponds to the last position IN the feature.

    So, in 0-indexed, fully-closed representation, the XbaI site would start at position 11, and end at position 16:

                               XbaI
                              ______
    ChrI:         ACCGATGCTAGCTCTAGACTACATCTACTCCGTCGTCTAGCATGATGCTAGCTGAC
                  |           ^^^^^^     |          |          |
    0-index:      0           |    |     20         30         40
                              |    |
    Start & end:              11   16
    

    And the length of the feature equals:

    \[\ell = end - start + 1 = 16 - 11 + 1 = 6\]
  2. In contrast, in a half-open coordinate system, the end coordinate is defined as the first position NOT included in the feature. In a 0-indexed, half-open representation, the XbaI site starts at position 11, and ends at position 17. In this case, the length of the feature equals:

    \[\ell = end - start = 17 - 11 = 6\]

Four possible coordinate representations

Because coordinate systems can be 0-indexed or 1-indexed, and half-open or fully-closed, genomic features can be can be represented in four possible ways. For the XbaI site in this example:

Half-open

Fully-closed

0-indexed

start: 11 end: 17

start: 11 end: 16

1-indexed

start: 12 end: 18

start: 12 end: 17

Coordinate systems of some common file formats

Format

Index

End coordinates

BED

0

Half-open

BigBed

0

Half-open

GTF2

1

Fully-closed

GFF3

1

Fully closed

Other GFFs

Either

Either

PSL

0

Half-open

SAM

1

n/a

BAM

0

n/a

bowtie

0

n/a

bedGraph

0

Half-open

BigWig*

0 or 1

Half-open or n/a

Wiggle

1

n/a

*The coordinate representation used in BigWig files depends upon the format of the data blocks inside the file, which can be represented as wiggle or bedGraph blocks.

Conventions used in plastid

Following Python conventions, plastid reports all coordinates in 0-indexed and half-open representation. In this case, the coordinate would be:

chromosome/contig:  'ChrI'
start:              11
end:                17
strand:             '.'

See also