# Coordinate systems used in genomics¶

plastid's readers automatically convert coordinates from any of the supported file formats into a 0-indexed and half-open space (i.e. following typical Python convention), so users don’t need to worry about off-by-one errors in their annotations.

Nonetheless, this tutorial describes various coordinate representations used in genomics:

## Coordinates¶

Genomic coordinates are typically specified as a set of:

• a chromosome name

• a start position

• an end position

• a chromosome strand:

• ‘+’ for the forward strand
• ‘-‘ for the reverse stranded
• ‘.’ for both strands / unstranded features

This gives rise to several non-obvious considerations:

## start ≤ end¶

In the vast majority of annotation formats, the start coordinate refers to the lowest-numbered (i.e. leftmost, chromosome-wise) coordinate relative to the genome rather than the feature. So, for reverse-stand features, the start coordinate actually denotes the 3’ end of the feature, while the end coordinate denotes the 5’ end.

## Counting from 0 vs 1¶

Coordinate systems can start counting from 0 (i.e. are 0-indexed) or from 1 (1-indexed). Suppose we have an XbaI restriction site on chromosome chrI:

                           XbaI
______
ChrI:         ACCGATGCTAGCTCTAGACTACATCTACTCCGTCGTCTAGCATGATGCTAGCTGAC
|          |^^^^^^     |          |          |
0-index:      0          10          20         30         40
1-index:      1          11          21         31         41


In 0-indexed representation, the restriction site begins at position 11. In 1-indexed representation, it begins at position 12.

In the context of genomics, both 0-indexed and 1-indexed systems are used, depending upon file format. plastid knows which file formats use which representation, and automatically converts all coordinates to a 0-indexed representation, following Python idioms.

## Half-open vs fully-closed coordinates¶

Similarly, coordinate systems can represent end coordinates in two ways:

1. In a fully-closed or end-inclusive coordinate system, positions are inclusive: the end coordinate corresponds to the last position IN the feature.

So, in 0-indexed, fully-closed representation, the XbaI site would start at position 11, and end at position 16:

                           XbaI
______
ChrI:         ACCGATGCTAGCTCTAGACTACATCTACTCCGTCGTCTAGCATGATGCTAGCTGAC
|           ^^^^^^     |          |          |
0-index:      0           |    |     20         30         40
|    |
Start & end:              11   16


And the length of the feature equals:

$\ell = end - start + 1 = 16 - 11 + 1 = 6$
2. In contrast, in a half-open coordinate system, the end coordinate is defined as the first position NOT included in the feature. In a 0-indexed, half-open representation, the XbaI site starts at position 11, and ends at position 17. In this case, the length of the feature equals:

$\ell = end - start = 17 - 11 = 6$

## Four possible coordinate representations¶

Because coordinate systems can be 0-indexed or 1-indexed, and half-open or fully-closed, genomic features can be can be represented in four possible ways. For the XbaI site in this example:

 Half-open Fully-closed 0-indexed start: 11 end: 17 start: 11 end: 16 1-indexed start: 12 end: 18 start: 12 end: 17

## Coordinate systems of some common file formats¶

 Format Index End coordinates BED 0 Half-open BigBed 0 Half-open GTF2 1 Fully-closed GFF3 1 Fully closed Other GFFs Either Either PSL 0 Half-open SAM 1 n/a BAM 0 n/a bowtie 0 n/a bedGraph 0 Half-open BigWig* 0 or 1 Half-open or n/a Wiggle 1 n/a

*The coordinate representation used in BigWig files depends upon the format of the data blocks inside the file, which can be represented as wiggle or bedGraph blocks.

## Conventions used in plastid¶

Following Python conventions, plastid reports all coordinates in 0-indexed and half-open representation. In this case, the coordinate would be:

chromosome/contig:  'ChrI'
start:              11
end:                17
strand:             '.'