A3 Schema
Annotated Amino Acid (A3) Specification
The Annotated Amino Acid (A3) format stores an amino acid sequence together with position-based annotations and provenance metadata.
This document defines the canonical A3 schema and the required cross-language methods for R, Python, Julia, TypeScript, and Rust.
Scope
- A3 is a single-protein, amino-acid level data structure.
- The canonical interchange format is JSON.
- TOML is also supported as an alternate source/target syntax for the same data model.
- All implementations must validate and normalize input into the same canonical representation before writing output.
Canonical Schema
An A3 document is a JSON object with the following top-level fields:
sequenceRequired. String. Amino acid sequence as a single uppercase string.annotationsRequired. Object.uniprotIdOptional. String.descriptionOptional. String.referenceOptional. String, typically a URL, DOI, or citation string.
The annotations object must contain these fields in canonical output:
siteObject mapping annotation type to an object mapping feature name to an array of residue positions.regionObject mapping annotation type to an object mapping feature name to an array of inclusive ranges.ptmObject mapping annotation type to an object mapping feature name to an array of residue positions.processingObject mapping annotation type to an object mapping feature name to an array of residue positions or an array of inclusive ranges.variantArray of variant objects.
This is the only serialized A3 shape. Alternate spellings or alternate nesting are not part of the schema.
Writers must always emit all five annotation fields, using empty objects or an empty array when no data are present.
Annotation Types
site
Typed named tracks of single-residue annotations.
Structure:
site[type][name] = [positions]typeis an open-ended annotation category such asactiveSiteorlipidBinding.nameis a specific named feature within that category.
Canonical form:
{
"activeSite": {
"catalyticResidues": [57, 102]
},
"lipidBinding": {
"membranePatchA": [15, 18, 21]
}
}Rules:
- Positions are 1-based integers.
- Positions must be unique and sorted ascending in canonical output.
region
Typed named tracks of inclusive residue ranges.
Structure:
region[type][name] = [[start, stop], ...]typeis an open-ended annotation category such asdomainorrepeat.nameis a specific named feature within that category.
Canonical form:
{
"domain": {
"peptidaseCore": [[25, 88], [120, 141]]
},
"repeat": {
"ankyrinRepeat1": [[201, 216]]
}
}Rules:
- Each range is a two-element array:
[start, end]. startandendare 1-based integers.start <= end.- Ranges must be sorted by
start, thenend. - Overlapping or directly adjacent ranges must be merged in canonical output.
ptm
Typed named tracks of residue-level post-translational modification sites.
Structure:
ptm[type][name] = [positions]typeis an open-ended annotation category such asphosphorylationoracetylation.nameis a specific named feature within that category.
Canonical form:
{
"phosphorylation": {
"activationLoopCluster": [46, 51, 199]
},
"acetylation": {
"nTerminus": [3]
}
}The same position rules as site apply.
processing
Typed named tracks of sequence processing and maturation events.
Structure:
processing[type][name] = [positions]processing[type][name] = [[start, stop], ...]typeis an open-ended annotation category such asproteolyticCleavage,signalPeptide,propeptide, ormatureChain.nameis a specific named feature within that category.
These may include both:
- single-residue positions (e.g. cleavage sites)
- residue ranges (e.g. signal peptides, propeptides, mature chains)
Processing annotation types and names are not restricted. Implementations must accept arbitrary strings for both.
Canonical form:
{
"proteolyticCleavage": {
"furinSite": [24, 57]
},
"signalPeptide": {
"signalPeptide1": [[1, 24]]
},
"matureChain": {
"chainA": [[25, 390]]
}
}Rules:
- For any given
processing[type][name], the value may be either:- a position array such as
[24, 57] - a range array such as
[[1, 24], [90, 110]]
- a position array such as
- String ranges such as
"1:24"are not part of the schema and must not be accepted.
variant
An ordered array of variant records.
Canonical form:
[
{
"position": 301,
"from": "P",
"to": "L",
"source": "ClinVar"
}
]Rules:
- Each variant must be an object.
- Each variant must contain
position. positionmust be a 1-based integer.- Additional fields are allowed.
- Additional fields must be JSON-compatible values.
- Variant order should be preserved.
Validation Rules
All implementations must enforce the following rules before storing or serializing an A3 object.
Sequence
sequenceis required.- Canonical
sequenceis a single uppercase string. - Empty sequences are invalid.
- Sequence characters must be uppercase single-letter residue codes. To avoid
over-constraining early implementations,
A-Zand*are allowed.
Positions
- All annotation positions are 1-based.
- Every position must satisfy
1 <= position <= sequence.length. - Duplicate positions must be removed in canonical output.
Annotation Names
- Annotation types are strings.
- Annotation feature names are strings.
- Empty annotation types are invalid.
- Empty annotation feature names are invalid.
Annotation Family Shapes
site[type][name]values must be arrays of positions.ptm[type][name]values must be arrays of positions.region[type][name]values must be arrays of ranges.processing[type][name]values must be either arrays of positions or arrays of ranges.- Mixed point and range values within a single
processing[type][name]entry are invalid.
Empty Structures
annotationsmust always exist.- Canonical empty values are:
site = {}region = {}ptm = {}processing = {}variant = []
Canonical Field Names
The serialized field names are exactly:
sequenceannotationsuniprotIddescriptionreferencesiteregionptmprocessingvariant
Names such as uniprotid and cleavage_site are not part of the A3 schema.
JSON Representation
JSON is the normative interchange format.
Required behavior:
to_jsonmust emit canonical field names.from_jsonmust parse, validate, and normalize input.read_jsonandwrite_jsonmust operate on files using UTF-8.- Writers should support pretty output, but compact output is also acceptable.
TOML Representation
TOML support is defined as an alternate syntax for the same A3 data model, not as a separate schema.
Required behavior:
from_tomlparses TOML, converts the parsed document to plain JSON-compatible values, then applies normal A3 validation and normalization.to_tomlserializes the canonical A3 object without changing semantics.read_tomlandwrite_tomlmust operate on files using UTF-8.
Restrictions:
- TOML writers should emit only values that map losslessly to JSON: strings, integers, floats, booleans, arrays, and tables.
- If an implementation encounters TOML-native types without a clean JSON equivalent, it should either convert them to strings or reject them.
Suggested TOML shape:
sequence = "MSTNPKPQR"
uniprotId = "P10636"
description = "Example A3 document"
reference = "https://example.org/paper"
[annotations.site.activeSite]
catalyticResidues = [3, 5, 7]
[annotations.ptm.phosphorylation]
activationLoopCluster = [2, 6]
[annotations.processing.proteolyticCleavage]
ctslSite = [4]
[annotations.processing.signalPeptide]
signalPeptide1 = [[1, 3]]
[[annotations.variant]]
position = 4
from = "N"
to = "D"
label = "N4D"
[annotations.region.domain]
peptidaseCore = [[2, 6], [8, 9]]Required Cross-Language Methods
Method names may follow language conventions, but every implementation should provide the following capabilities.
Construction And Validation
newor constructor from an in-memory object.from_data(data)Create from a plain map/object/dict/hash.to_data()Return the canonical plain data structure.validate()Validate the current object or candidate data.normalize(data)Normalize input into canonical form before object creation.
Serialization
from_json(text)to_json(...)read_json(path)write_json(path, ...)from_toml(text)to_toml(...)read_toml(path)write_toml(path, ...)
Core Query Helpers
lengthReturn sequence length.residue_at(position)Return the residue at a 1-based position.variants_at(position)Return all variants at a 1-based position.
These query helpers should behave consistently across languages even if exposed as a property, method, or free function.
Convenience Inputs Are Outside The Schema
Language implementations may optionally accept convenience inputs such as:
- residue arrays instead of a sequence string
- expanded region position vectors that can be collapsed into ranges
If supported, these are constructor-level conveniences only. They are not part of the serialized A3 schema, and they must normalize to the canonical shape defined above.
Internal object models may also differ from the canonical wire shape. For example, an implementation may use explicit annotation objects internally, but it must still read and write the canonical nested-map schema defined above.
Rejected Alternative: Object-Array Annotation Families
The following object-array shape was considered and explicitly not chosen as the canonical wire format:
{
"site": [
{
"type": "activeSite",
"name": "catalyticResidues",
"index": [57, 102]
}
]
}This shape maps naturally to explicit internal annotation objects, and language implementations may still use that style internally.
It was not chosen as the canonical serialized form because nested maps are easier to summarize, group, count, and present by annotation type, while also keeping the JSON and TOML structure compact and direct.
Canonical Example JSON
{
"sequence": "MSTNPKPQR",
"annotations": {
"site": {
"activeSite": {
"catalyticResidues": [3, 5, 7]
}
},
"region": {
"domain": {
"peptidaseCore": [[2, 6], [8, 9]]
}
},
"ptm": {
"phosphorylation": {
"activationLoopCluster": [2, 6]
}
},
"processing": {
"proteolyticCleavage": {
"ctslSite": [4]
},
"signalPeptide": {
"signalPeptide1": [[1, 3]]
}
},
"variant": [
{
"position": 4,
"from": "N",
"to": "D",
"label": "N4D"
}
]
},
"uniprotId": "P10636",
"description": "Example A3 document",
"reference": "https://example.org/paper"
}Notes For Language Implementations
- Disk format must be canonical even if the host language uses different field names internally.
- Implementations should keep the public API as close as practical to the schema names to minimize unnecessary translation layers.
- Canonical serialized region output must always use inclusive range arrays.
- The canonical wire shape for named annotation families is nested maps:
family -> type -> name -> index.