The Institute for AI Measurement Science logo
The Institute for
AI Measurement Science

THE IAIMS TRANSLATION PROTOCOL

From Policy to Evidence

Summary

Regulations tell you what to achieve. IAIMS tells you how to prove it. We take 400-page policy documents and convert them into runnable code that produces auditable evidence. The result: a machine-readable "receipt" that any regulator, auditor, or stakeholder can verify independently.

Why Measurement Is Hard

If rigorous AI measurement were easy, someone would have already done it. The technical challenges are real — and understanding them is essential to understanding why IAIMS exists.

Measurement Brittleness

Benchmarks break as models evolve. A test that worked six months ago may be meaningless today. We version everything and design for deprecation from day one.

Uncertainty Propagation

Errors compound across systems. A small measurement error at the component level can become a large error at the system level. We quantify uncertainty at every stage.

Lifecycle Decay

Claims degrade as context shifts. A model that passed evaluation in January may behave differently in production by March. Evidence must be timestamped and re-evaluatable.

Incentives to Oversimplify

There's pressure to produce clean scores when messy ranges are more honest. We resist the temptation to flatten nuance into misleading single numbers.

The Implementation Gap

The core mission of the Institute is to solve the Implementation Gap: the distance between a 400-page regulation and a single line of code.

We bridge this gap through a three-stage Technical Refinery that converts subjective regulatory prose into objective, machine-readable telemetry.

01

Ingestion (Prose to Parameters)

We begin by decomposing high-level frameworks (e.g., EU AI Act, NIST AI RMF, ISO 42001) into technical requirements. Our AI-assisted "ingestion engine" identifies every subjective mandate and extracts its underlying Technical Parameters—the specific system properties that must be measured.

Example:

  • Input (Prose): "Models must not generate harmful or toxic content"
  • Output (Parameter): Toxicity scores, jailbreak success rates, refusal accuracy
02

Logic Mapping (Parameters to Executables)

For every parameter, we develop or adopt a Reference Implementation—an open-source script or benchmark designed to perform the measurement objectively. These are not "guidelines"; they are Executables—Python test harnesses, Dockerized environments, and YAML-based constraints that slot directly into your engineering stack.

What gets produced:

A pipeline script that runs a standardized stress test against a model's API.

Scientific Rigor:

Every measurement includes Uncertainty Quantification (UQ)—defining the "Confidence Score" and "Instability Index" of the test results. This ensures measurements are reproducible and statistically meaningful.

03

The Manifest (Executables to Evidence)

Running these executables produces the Unified Evidence Manifest. This is a machine-readable JSON or YAML file that acts as a technical "receipt" for an auditor. It confirms that the test was run, captures the versioned result, and maps it directly back to the original regulatory requirement.

The MVEP (Minimum Viable Evidence Pack):

We define a standard evidence package consisting of:

  • A machine-readable AI-BOM (AI Bill of Materials) — a structured inventory of model components
  • Evaluation procedure references
  • A version-controlled change log

A Concrete Example: EU AI Act Compliance

Hypothetical example of future mapping concepts (not yet implemented):

Step 1: Semantic Deconstruction (Prose to Parameters)

We don't just "read" the law. We use Semantic Mapping to break down a clause into a measurable variable.

  • Requirement: "High-risk systems must ensure appropriate levels of accuracy and robustness." (EU AI Act Art. 15)
  • The Parameter: accuracy_threshold: >0.92, adversarial_robustness_score: >0.85

Step 2: The Reference Implementation (Parameters to Executable)

IAIMS is developing a library of Open-Source Test Harnesses. We don't invent new math; we package existing science (like NIST or MLCommons) into executable scripts.

  • The Artifact: A Python iaims-eval CLI tool that runs a specific stress test against your local model endpoint.

Step 3: The Evidence Export (Executable to Manifest)

The output isn't a "report" — it's a Machine-Readable Receipt. Below is an illustrative example of the Evidence Manifest structure:

Thresholds are derived from harmonized standards, industry consensus, or user-defined risk tolerance. IAIMS provides the measurement — you define the acceptance criteria.

{
  "manifest_id": "EXAMPLE-001",
  "artifact_status": "illustrative",
  "schema": "urn:iaims:schemas:evidence-manifest:v0.1",
  "generated_at": "YYYY-MM-DDTHH:MM:SSZ",
  
  "subject": {
    "type": "ml_model",
    "identifier": "org:models:model-name:version",
    "deployment_context": "environment/region"
  },

  "measurement_procedure": {
    "procedure_id": "urn:iaims:procedures:data-provenance:v0.1",
    "reference_definition": "forthcoming-public-specification",
    "parameters": {
      "depth": "full_lineage",
      "include_transformations": true
    }
  },

  "measurement_outputs": {
    "data_provenance": {
      "schema": "urn:iaims:schemas:data-provenance:v0.1",
      "source_dataset": {
        "uri": "[dataset-location]",
        "hash": "[integrity-hash]",
        "record_count": "[integer]",
        "collected_range": {
          "start": "YYYY-MM-DDTHH:MM:SSZ",
          "end": "YYYY-MM-DDTHH:MM:SSZ"
        }
      },
      "lineage_chain": [
        {
          "step": 1,
          "operation": "[operation-type]",
          "source": "[source-identifier]",
          "timestamp": "YYYY-MM-DDTHH:MM:SSZ"
        }
      ],
      "coverage_assessment": {
        "fields_total": "[integer]",
        "fields_documented": "[integer]",
        "coverage_ratio": "[0.0-1.0]"
      }
    }
  },

  "uncertainty": {
    "confidence_level": "[0.0-1.0]",
    "known_limitations": [
      {
        "code": "[LIMITATION_CODE]",
        "description": "[human-readable description]",
        "impact": "[low|medium|high]"
      }
    ]
  },

  "validity": {
    "valid_from": "YYYY-MM-DDTHH:MM:SSZ",
    "valid_until": "YYYY-MM-DDTHH:MM:SSZ",
    "revalidation_triggers": [
      "source_data_schema_change",
      "model_retrain",
      "time_elapsed"
    ]
  },

  "limitations": [
    "This artifact is illustrative only",
    "Schema and field names are subject to change",
    "Does not constitute operational measurement evidence"
  ]
}

What You Get

For Startups

Portable governance logic. Run our open-source executables anywhere—no vendor lock-in. If you switch providers, your audit trail comes with you.

For Enterprise Compliance Teams

One measurement framework that maps to multiple regulatory regimes (NIST, ISO, EU AI Act). Eliminate redundant audits.

For Developers

Scripts that slot into CI/CD pipelines. Governance becomes a release gate, not a manual bottleneck.

For Regulators & Auditors

Standardized, reproducible evidence. Every manifest traces back to the original requirement with full version history.

For Academics & Researchers

Open methodology, open benchmarks. Cite, critique, contribute.

Glossary

TermDefinition
Technical ParameterA measurable system property derived from a regulatory requirement
ExecutableA runnable script or benchmark that objectively measures a parameter
Unified Evidence ManifestA machine-readable file documenting test results mapped to requirements
MVEPMinimum Viable Evidence Pack — the minimum documentation needed for audit
AI-BOMAI Bill of Materials — a structured inventory of AI system components
UQUncertainty Quantification — statistical measures of measurement confidence
Logic-IDA unique identifier that traces evidence back to its originating regulatory clause