Skip to content

Concord Conformance Suite

Repo location: /docs/internal/conformance-suite.md Version: v0.1 (working draft) Status: Initial design; finalized scope required before Phase 1 freeze Owners: Workstream A (Specs), with co-ownership across Workstreams B, C, D, E (each responsible for their interface's sub-suite)

Purpose

The conformance suite is the artifact that lets multiple implementations of Concord — different backends, different transports, different attestation layers, eventually different control planes — coexist as part of one ecosystem.

It exists because pluggable backends without a conformance suite is just marketing. With a conformance suite, the equal-access goal (PRD Section 3.2) becomes structural: anyone can ship a backend that passes the suite, and operators can swap backends without changing observable behavior.

It also exists to make the platform-vendor-capture defense (PRD Section 21.1) credible. A vendor-hosted version of Concord that doesn't pass the same conformance suite as the open-source reference is, definitionally, not Concord. The suite is what gives that statement teeth.

What conformance means

A backend or component is conformant at a stated level if it:

  • Implements every required operation in the relevant spec
  • Produces the specified behavior for every test case at that level
  • Meets the performance and security SLAs declared at that level
  • Documents which optional capabilities it does and does not implement

Conformance is observable equivalence, not implementation equivalence. Two memory backends can use entirely different storage technology and conform if they produce the same observable behavior to the swarm.

Audiences

Audience Question they're answering
Backend implementers "What do I need to test to claim compliance?"
Operators choosing backends "Which backend stacks are guaranteed to work together?"
Security reviewers "What's actually verified about this implementation?"
The foundation "Is this implementation eligible for the conformance mark?"

Conformance levels

Each pluggable interface has tiered conformance. A backend declares which level(s) it claims; the suite verifies the claim.

Level What it covers
Core Minimum viable. Required operations, basic correctness, default failure modes. The floor for any backend that wants to call itself Concord-compliant.
Full Core + all optional operations + concurrency + full failure-mode coverage. The level expected of production-grade backends.
Verifiable Full + cryptographic verifiability properties. Required for backends claiming the verifiable stack.
Constrained-transport Full + intermittent-connectivity properties. Required for transports claiming field-deployment support.

A backend can claim any combination consistent with what it implements. A registry might claim Core + Verifiable; a transport might claim Full + Constrained-transport; a memory backend might claim Full only.

Suite structure

The suite is organized into five top-level categories:

/conformance
├── /interface              # Per-interface sub-suites (isolation tests)
│   ├── registry/
│   ├── memory/
│   ├── receipts/
│   ├── transport/
│   ├── access-control/
│   └── attestation/
├── /language               # Constitution language conformance
│   ├── cedar-plus-semantics/
│   ├── static-analyzer/
│   └── authoring-pipeline/
├── /behavioral             # Cross-component integration scenarios
│   ├── queue-mode/
│   ├── campaign-mode/
│   ├── federation/
│   └── constrained-network/
├── /performance            # SLA tests
└── /security               # Threat-model adversary tests

Each category is described below.


Interface conformance

Each pluggable interface has an isolated sub-suite. Tests run against the candidate implementation with mocked peers and a controlled environment, so failures unambiguously attribute to the implementation under test.

Registry conformance

Core (required for any registry implementation):

  • Register agent with valid passport → succeeds, returns agent ID and registration receipt
  • Register with invalid signature → fails closed; no partial state
  • Register duplicate agent ID → fails per spec (no silent overwrites)
  • Lookup by ID returns the registered passport with cryptographic integrity preserved
  • Lookup of non-existent ID returns null per spec (not an exception)
  • Capability assertion returns true for declared capabilities, false for undeclared
  • Revocation of an agent → subsequent lookups reflect the revocation atomically
  • Concurrent registrations of distinct agents all succeed
  • Simultaneous registration of the same ID → exactly one succeeds; the other receives the spec'd error

Full (additional):

  • Bulk operations are transactional (all-or-nothing)
  • Listing supports pagination, filtering, ordering per spec
  • Capability updates produce receipts and are observable in subscribers
  • Heartbeat and liveness semantics work end-to-end

Verifiable (additional):

  • Passport signatures are verifiable against published public keys without trusting the registry operator
  • Registration produces a verifiable claim retrievable by third parties
  • Revocation is mutually recognizable across federation peers
  • Registry state is auditable end-to-end without operator cooperation
  • Cryptographic chain allows external reconstruction of registration history

Memory conformance

For both per-agent and shared tiers:

Core:

  • put(key, value, scope, ttl, tags) followed by get(key) returns the value with metadata intact
  • put with TTL → get after TTL returns null
  • put with scope tag → access from outside scope is denied at the backend, not just the wrapper
  • search(query, filter, limit) returns items matching filter, respecting access control
  • forget(key) → subsequent get returns null; the deletion is durable
  • export(scope) returns all items in scope with manifest

Concurrency:

  • N parallel puts of distinct keys all succeed
  • Last-writer-wins on conflicting puts (default consistency model)
  • Subscribers receive notifications for matching writes within the spec's latency bound
  • No torn reads under concurrent put + get on the same key

Privacy:

  • Per-agent memory of agent A is not visible to agent B
  • Shared memory respects scope tags from the active constitution version
  • forget cascades to subscribers and updates indices atomically

Vector/semantic capabilities (optional, claimed separately):

  • Semantic search returns items in score order
  • Filter combined with semantic search applies both correctly
  • Embedding model identification is recorded in receipts so retrieval is reproducible

Verifiable (additional):

  • All writes produce attestable claims with cryptographic provenance
  • forget on immutable backends produces a redaction event with selective-disclosure semantics (the platform is honest about what is and is not deletable)
  • Cross-org memory federation respects declared export scope and revocation

Receipt store conformance

Core:

  • Append a receipt → it's retrievable by ID and by causal predecessor
  • Receipts are content-addressed (identical content → identical ID)
  • Tamper detection: any modified receipt fails signature check
  • Sequential append is durable across process restart
  • Concurrent appends preserve causal ordering where causal metadata is present
  • Query by causal predecessor returns the consistent set of children

Full (additional):

  • Range queries by time, agent, swarm work per spec
  • Bulk export produces a verifiable manifest
  • Retention policies are enforced (with configurability per deployment)

Verifiable (additional):

  • Receipts are mutually recognizable across organizations using only public keys
  • Cryptographic chain enables cross-store verification without trusting either operator
  • Receipt batches can be sealed (Merkle-rooted) for efficient verification at scale
  • Selective-disclosure proofs allow revealing a single receipt without revealing the rest of the chain

Transport conformance

Per profile (datacenter, WAN, constrained), in addition to common requirements.

Common (all profiles):

  • send(message)receive(message) at recipient with envelope intact
  • Causal metadata preserved through transport
  • Encryption applied per profile spec; channels are identity-bound where the registry supports it
  • Replay attempts are detected and rejected
  • Dropped messages are detected (at-least-once delivery semantics by default)
  • Backpressure is signaled, not silently dropped

Datacenter profile:

  • p99 routing latency under 100ms in-region for swarms of up to 100 agents
  • Throughput supports the spec'd ops-per-second floor
  • TLS configuration meets the platform crypto baseline

WAN profile:

  • Retry / backoff under variable latency converges
  • Behavior under brief partitions (< 30s) is graceful
  • p99 routing latency under 500ms across regions

Constrained-transport (additional, claimed separately):

The hardest sub-suite, because the failure modes are subtle.

  • 50% packet loss sustained for 5 minutes → all writes eventually sync; no permanent loss
  • 80% partition windows → bounded local autonomy is respected (agents act only within declared scope)
  • 10 kbps sustained bandwidth cap → no operation starves; cost-aware routing prevents pathological behavior
  • One-hour partition → reconvergence produces deterministic, consistent state
  • Conflicting writes during partition → conflict-free merge produces a deterministic, observable outcome
  • No silent receipt loss under any failure pattern in the test set
  • Sync failures are surfaced to operators within the spec'd time bound

Access control conformance

Core:

  • Capability tokens are unforgeable (signature verification rejects modifications)
  • Token expiry is enforced
  • Token attenuation (delegation with reduced scope) is enforced
  • Revocation is observable within the spec'd propagation bound
  • No ambient authority; every operation requires an explicit capability check

Verifiable (additional):

  • Capability chains are auditable end-to-end
  • Cross-org delegation produces verifiable attestation
  • Revocation is cryptographically detectable across federation peers

Attestation conformance (optional layer)

For backends claiming attestation support:

  • Attestations bind a decision to the inputs and the agent making it
  • Attestation verification is possible without trusting the producer
  • Attestation chains support reconstruction of decision provenance
  • Attestation forgery is cryptographically detectable

Constitution language conformance

The Cedar+ evaluator and authoring pipeline have their own conformance sub-suite. This is large because policy semantics is where logic bugs hide.

Cedar+ semantic tests

Hundreds to thousands of test cases, organized by feature:

  • Hard constraints (forbid). Empty constitutions; single forbid; conflicting forbids; nested unless; principal-action-resource matrix coverage.
  • Permits. Basic permit; permit with conditions; permit with unless; default-deny when no permit matches.
  • Soft preferences (prefer). Single prefer; multiple prefers ordered by score; prefer interaction with permit / forbid.
  • Memory norms. Per-tier access; scope enforcement; tag-based norms; cascade on forget.
  • Resource budgets. Budget-conditional permits; budget exhaustion; budget restoration semantics.
  • Procedures. State machine transitions; timeout handling; escalation paths; nested procedures.
  • Temporal expressions. Comparisons against passed-in clock; quarantine windows; TTL-conditional rules.
  • Edge cases. Empty constitution; circular references rejected; self-modifying rules rejected; deeply nested unless.

Static analyzer tests

The analyzer is the security boundary; its conformance matters more than most.

  • Rejects programs with loops or recursion
  • Rejects programs with side effects
  • Rejects programs exceeding statically-bounded depth
  • Rejects programs with unbounded data structures
  • Accepts every valid program in the canonical examples library
  • Identifies ambiguity (multiple valid interpretations) and surfaces it

Authoring pipeline tests

For the LLM-assisted plain-English layer:

  • Plain English → Cedar+ produces structurally valid output for the canonical example set
  • Generated test cases cover the scenarios claimed in the English source
  • Diagnostic mode correctly flags ambiguous English
  • Round-trip: hand-edited Cedar+ → re-display in English layer is consistent
  • Failure mode: when English cannot be cleanly compiled, the tool fails loud rather than producing an unsafe predicate

The LLM authoring layer is not a security boundary (the static analyzer is), so its conformance is about UX consistency, not safety. A backend that ships a worse LLM authoring tool can still be conformant; it just has worse DX.


Behavioral conformance

Cross-component integration tests. These prove that a particular backend stack produces the right swarm-level behavior, not just that each component passes its interface suite.

This is the most expensive part of the suite to run, but the most important for proving observable equivalence across stacks.

Reference scenarios

Each scenario runs a complete swarm against the candidate backend combination and verifies swarm-level invariants.

S1. Customer-support queue mode (Phase 1 anchor scenario). Five agents from two frameworks; queue of 1,000 tickets; supervisor; full memory access; norms over PII. Pass criteria: all tickets resolved or escalated correctly; all consequential actions produced verifiable receipts; no unauthorized memory access; behavior identical (within tolerance) to the reference stack.

S2. Insurance due-diligence pipeline (Phase 2 queue-mode anchor). Ten agents; 500 claims; affinity rules; peer consultation; role-scoped memory; SLA tracking. Pass criteria: SLA met or accurately reported as missed; affinity rules respected; medical-info access norms enforced.

S3. Security incident response (Phase 2 campaign-mode anchor). Eight investigator agents; injected adversary scenarios from the threat model; supervisor approvals required for production actions. Pass criteria: hard-rule violations detected within one tick; coaching reduces repeat violations; no production action without supervisor approval.

S4. Disaster response with constrained transport (Phase 3 anchor). Twenty-five agents (mix of software and drone-mounted); simulated network with packet loss and partition windows; federated across two organizations. Pass criteria: bounded local autonomy respected; reconvergence produces consistent state; no receipt loss; cross-org receipts mutually recognized.

S5. Federated multi-org delivery network (Phase 4 anchor). Three swarms with different constitutions; shared subset of norms; bad-actor agent in one swarm. Pass criteria: federation handshake succeeds; bad-actor agent is detected and quarantined; clean detachment after the joint operation.

Differential conformance

A particularly important pattern: run S1–S5 on both the candidate stack and the reference stack with identical inputs, and verify that observable swarm behavior is equivalent within stated tolerances. This is the single strongest test that a candidate implementation is genuinely substitutable.

Tolerances must be specified explicitly. Some are exact (every receipt must exist, every signature must verify); some are bounded (latency within X percent of reference); some are statistical (decision distribution within Y standard deviations).


Performance conformance

SLA tests with stated load profiles. These are not the place to set ambitious performance goals; they are the floor below which a backend cannot fall.

Operation p99 target Notes
Message routing (datacenter) < 100 ms Up to 100 agents, in-region
Message routing (WAN) < 500 ms Cross-region
Receipt write < 250 ms Default backend; verifiable backends may have a separate target
Memory get / put (KV) < 50 ms Default backend
Memory search (semantic) < 200 ms Vector-capable backend
Constitution evaluation < 10 ms Per decision; bounded by static depth

Targets are revisited as the suite matures and real backends generate benchmark data. Phase 1 ships with placeholder targets; Phase 2 fixes them.

Security conformance

Targeted tests per adversary in the threat model (/docs/internal/threat-model.md).

For each of A1 through A9:

  • A scripted attack exercising the adversary's claimed capabilities
  • Verification that the documented mitigations fire
  • Verification that residual risk is no worse than claimed
  • A red-team test where the adversary attempts to escape the documented capability bounds

These tests are co-developed with Workstream L. Results are reviewed by the foundation's security working group before any conformance mark is issued.

How conformance is run

  • Reference test runner. A canonical runner ships in the project repo. Backends declare claimed levels; runner executes the appropriate subsets; produces a signed report.
  • Self-attestation. Vendors may run the suite and publish results themselves. Self-attested results carry a different visual treatment than foundation-verified results.
  • Foundation verification. For the conformance mark, the foundation runs the suite against the vendor's submitted artifact in a controlled environment. Results are published in a public registry.
  • Continuous conformance. Conformant backends commit to running the latest suite against every release. Falling out of conformance triggers loss of the mark.

The conformance mark

The conformance mark is what users see. It indicates:

  • The level(s) at which the backend is conformant
  • The suite version it was last tested against
  • The date of last verification
  • Whether verification was self-attested or foundation-verified

The mark is revocable. Misrepresentation of conformance is grounds for revocation and (for foundation-verified marks) public advisory.

How the suite evolves

  • Test additions follow the RFC process. New tests must be backwards-compatible with existing claims at older spec versions; a backend conformant at suite v1.0 must remain conformant at v1.1 unless the v1.1 changes were explicitly breaking and version-bumped accordingly.
  • Test deprecation requires explicit replacement. No test is removed without a successor that covers the same property.
  • Suite versioning follows semver: major version bumps signal breaking changes; minor versions add tests; patch versions fix bugs in existing tests.
  • Backwards compatibility window. Backends have at least 12 months to retest against a new major suite version before the older suite version is retired.

Open design questions

These are resolved during Phase 1, before the suite is published as a stable artifact.

  • Test runner language. Rust (matches core) vs. polyglot (lower contributor barrier). Leaning Rust for the runner core with adapters for vendor-provided backends in any language.
  • How are vendor backends submitted? Container images, source builds, or running endpoints? Probably all three with different verification depth per submission type.
  • Tolerance specification language. How are statistical-equivalence tolerances declared? Bespoke vs. existing property-test framework. Defer.
  • Performance test environment normalization. Different hardware produces different results. Either we standardize the test environment (cloud instance type), provide a normalization model, or report results per-environment. Probably a hybrid.
  • Cost. Running the full behavioral suite is expensive (long-running, requires multiple agents, network simulation). The foundation needs a sustainable funding model for verification-as-a-service.
  • Conformance for the control plane itself. This document covers backends and language. A separate sub-suite for alternative control-plane implementations (if any ever emerge) is a Phase 4 question.

How this document evolves

Major changes to suite design follow the RFC process. The suite itself is a living artifact in /conformance/; this document explains why it is structured the way it is, and is updated when that rationale changes.