THE IAIMS TRANSLATION PROTOCOL
From Policy to Evidence
Summary
Regulations tell you what to achieve. IAIMS tells you how to prove it. We take 400-page policy documents and convert them into runnable code that produces auditable evidence. The result: a machine-readable "receipt" that any regulator, auditor, or stakeholder can verify independently.
Why Measurement Is Hard
If rigorous AI measurement were easy, someone would have already done it. The technical challenges are real — and understanding them is essential to understanding why IAIMS exists.
Measurement Brittleness
Benchmarks break as models evolve. A test that worked six months ago may be meaningless today. We version everything and design for deprecation from day one.
Uncertainty Propagation
Errors compound across systems. A small measurement error at the component level can become a large error at the system level. We quantify uncertainty at every stage.
Lifecycle Decay
Claims degrade as context shifts. A model that passed evaluation in January may behave differently in production by March. Evidence must be timestamped and re-evaluatable.
Incentives to Oversimplify
There's pressure to produce clean scores when messy ranges are more honest. We resist the temptation to flatten nuance into misleading single numbers.
The Implementation Gap
The core mission of the Institute is to solve the Implementation Gap: the distance between a 400-page regulation and a single line of code.
We bridge this gap through a three-stage Technical Refinery that converts subjective regulatory prose into objective, machine-readable telemetry.
Ingestion (Prose to Parameters)
We begin by decomposing high-level frameworks (e.g., EU AI Act, NIST AI RMF, ISO 42001) into technical requirements. Our AI-assisted "ingestion engine" identifies every subjective mandate and extracts its underlying Technical Parameters—the specific system properties that must be measured.
Example:
- Input (Prose): "Models must not generate harmful or toxic content"
- Output (Parameter): Toxicity scores, jailbreak success rates, refusal accuracy
Logic Mapping (Parameters to Executables)
For every parameter, we develop or adopt a Reference Implementation—an open-source script or benchmark designed to perform the measurement objectively. These are not "guidelines"; they are Executables—Python test harnesses, Dockerized environments, and YAML-based constraints that slot directly into your engineering stack.
What gets produced:
A pipeline script that runs a standardized stress test against a model's API.
Scientific Rigor:
Every measurement includes Uncertainty Quantification (UQ)—defining the "Confidence Score" and "Instability Index" of the test results. This ensures measurements are reproducible and statistically meaningful.
The Manifest (Executables to Evidence)
Running these executables produces the Unified Evidence Manifest. This is a machine-readable JSON or YAML file that acts as a technical "receipt" for an auditor. It confirms that the test was run, captures the versioned result, and maps it directly back to the original regulatory requirement.
The MVEP (Minimum Viable Evidence Pack):
We define a standard evidence package consisting of:
- A machine-readable AI-BOM (AI Bill of Materials) — a structured inventory of model components
- Evaluation procedure references
- A version-controlled change log
A Concrete Example: EU AI Act Compliance
Hypothetical example of future mapping concepts (not yet implemented):
Step 1: Semantic Deconstruction (Prose to Parameters)
We don't just "read" the law. We use Semantic Mapping to break down a clause into a measurable variable.
- Requirement: "High-risk systems must ensure appropriate levels of accuracy and robustness." (EU AI Act Art. 15)
- The Parameter:
accuracy_threshold: >0.92,adversarial_robustness_score: >0.85
Step 2: The Reference Implementation (Parameters to Executable)
IAIMS is developing a library of Open-Source Test Harnesses. We don't invent new math; we package existing science (like NIST or MLCommons) into executable scripts.
- The Artifact: A Python
iaims-evalCLI tool that runs a specific stress test against your local model endpoint.
Step 3: The Evidence Export (Executable to Manifest)
The output isn't a "report" — it's a Machine-Readable Receipt. Below is an illustrative example of the Evidence Manifest structure:
Thresholds are derived from harmonized standards, industry consensus, or user-defined risk tolerance. IAIMS provides the measurement — you define the acceptance criteria.
{
"manifest_id": "EXAMPLE-001",
"artifact_status": "illustrative",
"schema": "urn:iaims:schemas:evidence-manifest:v0.1",
"generated_at": "YYYY-MM-DDTHH:MM:SSZ",
"subject": {
"type": "ml_model",
"identifier": "org:models:model-name:version",
"deployment_context": "environment/region"
},
"measurement_procedure": {
"procedure_id": "urn:iaims:procedures:data-provenance:v0.1",
"reference_definition": "forthcoming-public-specification",
"parameters": {
"depth": "full_lineage",
"include_transformations": true
}
},
"measurement_outputs": {
"data_provenance": {
"schema": "urn:iaims:schemas:data-provenance:v0.1",
"source_dataset": {
"uri": "[dataset-location]",
"hash": "[integrity-hash]",
"record_count": "[integer]",
"collected_range": {
"start": "YYYY-MM-DDTHH:MM:SSZ",
"end": "YYYY-MM-DDTHH:MM:SSZ"
}
},
"lineage_chain": [
{
"step": 1,
"operation": "[operation-type]",
"source": "[source-identifier]",
"timestamp": "YYYY-MM-DDTHH:MM:SSZ"
}
],
"coverage_assessment": {
"fields_total": "[integer]",
"fields_documented": "[integer]",
"coverage_ratio": "[0.0-1.0]"
}
}
},
"uncertainty": {
"confidence_level": "[0.0-1.0]",
"known_limitations": [
{
"code": "[LIMITATION_CODE]",
"description": "[human-readable description]",
"impact": "[low|medium|high]"
}
]
},
"validity": {
"valid_from": "YYYY-MM-DDTHH:MM:SSZ",
"valid_until": "YYYY-MM-DDTHH:MM:SSZ",
"revalidation_triggers": [
"source_data_schema_change",
"model_retrain",
"time_elapsed"
]
},
"limitations": [
"This artifact is illustrative only",
"Schema and field names are subject to change",
"Does not constitute operational measurement evidence"
]
}What You Get
For Startups
Portable governance logic. Run our open-source executables anywhere—no vendor lock-in. If you switch providers, your audit trail comes with you.
For Enterprise Compliance Teams
One measurement framework that maps to multiple regulatory regimes (NIST, ISO, EU AI Act). Eliminate redundant audits.
For Developers
Scripts that slot into CI/CD pipelines. Governance becomes a release gate, not a manual bottleneck.
For Regulators & Auditors
Standardized, reproducible evidence. Every manifest traces back to the original requirement with full version history.
For Academics & Researchers
Open methodology, open benchmarks. Cite, critique, contribute.
Glossary
| Term | Definition |
|---|---|
| Technical Parameter | A measurable system property derived from a regulatory requirement |
| Executable | A runnable script or benchmark that objectively measures a parameter |
| Unified Evidence Manifest | A machine-readable file documenting test results mapped to requirements |
| MVEP | Minimum Viable Evidence Pack — the minimum documentation needed for audit |
| AI-BOM | AI Bill of Materials — a structured inventory of AI system components |
| UQ | Uncertainty Quantification — statistical measures of measurement confidence |
| Logic-ID | A unique identifier that traces evidence back to its originating regulatory clause |
