Annotated Amino Acid (A3) Specification

The Annotated Amino Acid (A3) format stores an amino acid sequence together with position-based annotations and provenance metadata.

This document defines the canonical A3 schema and the required cross-language methods for R, Python, Julia, TypeScript, and Rust.

Scope

A3 is a single-protein, amino-acid level data structure.
The canonical interchange format is JSON.
TOML is also supported as an alternate source/target syntax for the same data model.
All implementations must validate and normalize input into the same canonical representation before writing output.

Canonical Schema

An A3 document is a JSON object with the following top-level fields:

sequence Required. String. Amino acid sequence as a single uppercase string.
annotations Required. Object.
uniprotId Optional. String.
description Optional. String.
reference Optional. String, typically a URL, DOI, or citation string.

The annotations object must contain these fields in canonical output:

site Object mapping annotation type to an object mapping feature name to an array of residue positions.
region Object mapping annotation type to an object mapping feature name to an array of inclusive ranges.
ptm Object mapping annotation type to an object mapping feature name to an array of residue positions.
processing Object mapping annotation type to an object mapping feature name to an array of residue positions or an array of inclusive ranges.
variant Array of variant objects.

This is the only serialized A3 shape. Alternate spellings or alternate nesting are not part of the schema.

Writers must always emit all five annotation fields, using empty objects or an empty array when no data are present.

Annotation Types

`site`

Typed named tracks of single-residue annotations.

Structure:

site[type][name] = [positions]
type is an open-ended annotation category such as activeSite or lipidBinding.
name is a specific named feature within that category.

Canonical form:

{
  "activeSite": {
    "catalyticResidues": [57, 102]
  },
  "lipidBinding": {
    "membranePatchA": [15, 18, 21]
  }
}

Rules:

Positions are 1-based integers.
Positions must be unique and sorted ascending in canonical output.

`region`

Typed named tracks of inclusive residue ranges.

Structure:

region[type][name] = [[start, stop], ...]
type is an open-ended annotation category such as domain or repeat.
name is a specific named feature within that category.

Canonical form:

{
  "domain": {
    "peptidaseCore": [[25, 88], [120, 141]]
  },
  "repeat": {
    "ankyrinRepeat1": [[201, 216]]
  }
}

Rules:

Each range is a two-element array: [start, end].
start and end are 1-based integers.
start <= end.
Ranges must be sorted by start, then end.
Overlapping or directly adjacent ranges must be merged in canonical output.

`processing`

Typed named tracks of sequence processing and maturation events.

Structure:

processing[type][name] = [positions]
processing[type][name] = [[start, stop], ...]
type is an open-ended annotation category such as proteolyticCleavage, signalPeptide, propeptide, or matureChain.
name is a specific named feature within that category.

These may include both:

single-residue positions (e.g. cleavage sites)
residue ranges (e.g. signal peptides, propeptides, mature chains)

Processing annotation types and names are not restricted. Implementations must accept arbitrary strings for both.

Canonical form:

{
  "proteolyticCleavage": {
    "furinSite": [24, 57]
  },
  "signalPeptide": {
    "signalPeptide1": [[1, 24]]
  },
  "matureChain": {
    "chainA": [[25, 390]]
  }
}

Rules:

For any given processing[type][name], the value may be either:
- a position array such as [24, 57]
- a range array such as [[1, 24], [90, 110]]
String ranges such as "1:24" are not part of the schema and must not be accepted.

`variant`

An ordered array of variant records.

Canonical form:

[
  {
    "position": 301,
    "from": "P",
    "to": "L",
    "source": "ClinVar"
  }
]

Rules:

Each variant must be an object.
Each variant must contain position.
position must be a 1-based integer.
Additional fields are allowed.
Additional fields must be JSON-compatible values.
Variant order should be preserved.

Validation Rules

All implementations must enforce the following rules before storing or serializing an A3 object.

Sequence

sequence is required.
Canonical sequence is a single uppercase string.
Empty sequences are invalid.
Sequence characters must be uppercase single-letter residue codes. To avoid over-constraining early implementations, A-Z and * are allowed.

Positions

All annotation positions are 1-based.
Every position must satisfy 1 <= position <= sequence.length.
Duplicate positions must be removed in canonical output.

Annotation Names

Annotation types are strings.
Annotation feature names are strings.
Empty annotation types are invalid.
Empty annotation feature names are invalid.

Annotation Family Shapes

site[type][name] values must be arrays of positions.
ptm[type][name] values must be arrays of positions.
region[type][name] values must be arrays of ranges.
processing[type][name] values must be either arrays of positions or arrays of ranges.
Mixed point and range values within a single processing[type][name] entry are invalid.

Empty Structures

annotations must always exist.
Canonical empty values are:
- site = {}
- region = {}
- ptm = {}
- processing = {}
- variant = []

Canonical Field Names

The serialized field names are exactly:

sequence
annotations
uniprotId
description
reference
site
region
ptm
processing
variant

Names such as uniprotid and cleavage_site are not part of the A3 schema.

JSON Representation

JSON is the normative interchange format.

Required behavior:

to_json must emit canonical field names.
from_json must parse, validate, and normalize input.
read_json and write_json must operate on files using UTF-8.
Writers should support pretty output, but compact output is also acceptable.

TOML Representation

TOML support is defined as an alternate syntax for the same A3 data model, not as a separate schema.

Required behavior:

from_toml parses TOML, converts the parsed document to plain JSON-compatible values, then applies normal A3 validation and normalization.
to_toml serializes the canonical A3 object without changing semantics.
read_toml and write_toml must operate on files using UTF-8.

Restrictions:

TOML writers should emit only values that map losslessly to JSON: strings, integers, floats, booleans, arrays, and tables.
If an implementation encounters TOML-native types without a clean JSON equivalent, it should either convert them to strings or reject them.

Suggested TOML shape:

sequence = "MSTNPKPQR"
uniprotId = "P10636"
description = "Example A3 document"
reference = "https://example.org/paper"

[annotations.site.activeSite]
catalyticResidues = [3, 5, 7]

[annotations.ptm.phosphorylation]
activationLoopCluster = [2, 6]

[annotations.processing.proteolyticCleavage]
ctslSite = [4]

[annotations.processing.signalPeptide]
signalPeptide1 = [[1, 3]]

[[annotations.variant]]
position = 4
from = "N"
to = "D"
label = "N4D"

[annotations.region.domain]
peptidaseCore = [[2, 6], [8, 9]]

Required Cross-Language Methods

Method names may follow language conventions, but every implementation should provide the following capabilities.

Construction And Validation

new or constructor from an in-memory object.
from_data(data) Create from a plain map/object/dict/hash.
to_data() Return the canonical plain data structure.
validate() Validate the current object or candidate data.
normalize(data) Normalize input into canonical form before object creation.

Serialization

from_json(text)
to_json(...)
read_json(path)
write_json(path, ...)
from_toml(text)
to_toml(...)
read_toml(path)
write_toml(path, ...)

Core Query Helpers

length Return sequence length.
residue_at(position) Return the residue at a 1-based position.
variants_at(position) Return all variants at a 1-based position.

These query helpers should behave consistently across languages even if exposed as a property, method, or free function.

Convenience Inputs Are Outside The Schema

Language implementations may optionally accept convenience inputs such as:

residue arrays instead of a sequence string
expanded region position vectors that can be collapsed into ranges

If supported, these are constructor-level conveniences only. They are not part of the serialized A3 schema, and they must normalize to the canonical shape defined above.

Internal object models may also differ from the canonical wire shape. For example, an implementation may use explicit annotation objects internally, but it must still read and write the canonical nested-map schema defined above.

Rejected Alternative: Object-Array Annotation Families

The following object-array shape was considered and explicitly not chosen as the canonical wire format:

{
  "site": [
    {
      "type": "activeSite",
      "name": "catalyticResidues",
      "index": [57, 102]
    }
  ]
}

This shape maps naturally to explicit internal annotation objects, and language implementations may still use that style internally.

It was not chosen as the canonical serialized form because nested maps are easier to summarize, group, count, and present by annotation type, while also keeping the JSON and TOML structure compact and direct.

Canonical Example JSON

{
  "sequence": "MSTNPKPQR",
  "annotations": {
    "site": {
      "activeSite": {
        "catalyticResidues": [3, 5, 7]
      }
    },
    "region": {
      "domain": {
        "peptidaseCore": [[2, 6], [8, 9]]
      }
    },
    "ptm": {
      "phosphorylation": {
        "activationLoopCluster": [2, 6]
      }
    },
    "processing": {
      "proteolyticCleavage": {
        "ctslSite": [4]
      },
      "signalPeptide": {
        "signalPeptide1": [[1, 3]]
      }
    },
    "variant": [
      {
        "position": 4,
        "from": "N",
        "to": "D",
        "label": "N4D"
      }
    ]
  },
  "uniprotId": "P10636",
  "description": "Example A3 document",
  "reference": "https://example.org/paper"
}

Notes For Language Implementations

Disk format must be canonical even if the host language uses different field names internally.
Implementations should keep the public API as close as practical to the schema names to minimize unnecessary translation layers.
Canonical serialized region output must always use inclusive range arrays.
The canonical wire shape for named annotation families is nested maps: family -> type -> name -> index.

A3 Schema

Annotated Amino Acid (A3) Specification

Scope

Canonical Schema

Annotation Types

`site`

`region`

`ptm`

`processing`

`variant`

Validation Rules

Sequence

Positions

Annotation Names

Annotation Family Shapes

Empty Structures

Canonical Field Names

JSON Representation

TOML Representation

Required Cross-Language Methods

Construction And Validation

Serialization

Core Query Helpers

Convenience Inputs Are Outside The Schema

Rejected Alternative: Object-Array Annotation Families

Canonical Example JSON

Notes For Language Implementations

On this page