<- back to architecture

Hasura Schema Registry

Hasura | Software Engineer

Nov 2022 - Aug 2023

graphqldistributed-systemsasync-processingobservabilityplatform

Context and business stakes

As Hasura adoption grew, schema change management became a recurring risk pattern, especially for larger teams running multiple environments and fast release cycles. Teams could ship schema changes without a strong audit trail, and API breakages often appeared downstream in production.

Customers were asking for confidence in three areas:

  • what exactly changed in the schema
  • whether the change was safe, dangerous, or breaking
  • how to recover or reason about version history during incidents

This was not a cosmetic feature request. For enterprise users, schema reliability is deployment reliability. Without change intelligence, teams slow down or accept avoidable outage risk.

Constraints and non-goals

The first and most important constraint was architectural: Hasura GraphQL Engine runs a schema sync cycle as part of core behavior, and this path could not absorb heavy diffing or persistence work.

Hard constraints:

  • no material latency added to GraphQL Engine hot path
  • large schema payloads in some projects
  • eventual consistency behavior across distributed event delivery
  • support for production-grade alerting and auditability

Non-goals for the first release:

  • strict linear ordering guarantees across every event edge case
  • full payload optimization for very large schemas through object-store pointers in V1
  • complete rollback automation from registry UI in the first milestone

The release strategy was to maximize practical safety improvements while keeping the engine responsive.

Architecture overview

I designed Schema Registry as an asynchronous sidecar-style system with event-driven ingestion and deterministic version comparison logic.

Schema Registry async flow
rendering diagram...

Processing steps:

  1. Engine emitted SDL payload after schema sync.
  2. Registry hashed incoming schema and compared against last known version.
  3. If unchanged, system exited quickly.
  4. If changed, service computed diff using Myers algorithm and classified changes as breaking, dangerous, or safe.
  5. Persisted schema snapshots plus diffs for history, querying, and alerting workflows.

This design kept expensive work off the request-sensitive path while still providing teams with actionable risk signals.

Critical design decisions and tradeoffs

1) Async by default to protect engine performance

We explicitly avoided in-engine diffing or heavy state logic. The engine emitted events, and registry handled the rest.

Tradeoff: eventual consistency in visibility, but strong isolation from query path regressions.

2) Hash-first diff strategy

We used SHA-256 hash comparisons before running diff. That prevented unnecessary compute on unchanged schemas and reduced storage churn.

Tradeoff: hashing is only a gate; semantic meaning still requires full diff and classification for changed payloads.

3) Change classification as first-class output

Many tools stop at raw diffs. We added explicit breaking/dangerous/safe classification so teams could plug this directly into CI and deployment controls.

Tradeoff: classification logic needed careful schema-rule curation to avoid noisy false positives.

4) Timestamp-based ordering under distributed uncertainty

Out-of-order event arrival was possible. We stored updates with engine-issued generation timestamps and ordered versions accordingly.

Tradeoff: no guarantee of perfect linear history in pathological delivery patterns, but predictable behavior with low operational complexity.

Failure modes and mitigations

Out-of-order events

Risk: schema version C can arrive before B. Mitigation: order records by engine generation timestamp and preserve raw history so teams can inspect sequence anomalies.

Event loss or delay

Risk: registry can temporarily miss an update. Mitigation: operational guidance to re-trigger schema sync; system designed for refreshable eventual consistency rather than silent corruption.

Very large schema payload pressure

Risk: huge SDL payloads increase memory and transfer overhead. Mitigation in V1: gzip compression for emitted payloads. Investigated follow-up path using object storage pointers (S3/GCS) and asynchronous fetch to decouple metadata from bulk payload transfer.

Alert fatigue

Risk: teams ignore alerts if signal quality drops. Mitigation: severity-aware classification and structured alert outputs mapped to practical actions.

The general rule here was that reliability is not just storing versions. It is making sure teams can trust the signal enough to act on it.

Outcomes with concrete metrics

The delivered system created measurable impact across both internal and external users:

  • shipped within roughly 1.5 quarters
  • adopted internally across Hasura instances from day one
  • enabled CI workflows to catch schema regressions before production rollout
  • strong cloud adoption, including broad usage by new projects
  • notable enterprise usage for formal schema governance workflows
  • observed reduction in schema-related production incidents after launch

I built the initial version end-to-end and then mentored two engineers as we scaled and operationalized it.

What I'd change now

If I were iterating now, I would prioritize three improvements earlier:

  1. Payload indirection by default for large schemas: upload SDL to object storage and send lightweight event references.
  2. Stronger replay tooling: deterministic replay windows for incident forensics and point-in-time reconstruction.
  3. Policy packs for CI integration: opinionated deployment gates by environment, instead of requiring each team to hand-roll rule thresholds.

Even with those upgrades, I would keep the same core choice: protect the hot path first. The engine should stay fast, and the intelligence layer should evolve asynchronously around it.