What is AI measurement science?

AI measurement science is the discipline of creating standardized, reproducible methods to evaluate AI system capabilities, safety, and compliance. It translates ambiguous governance requirements into technical executables that produce audit-ready evidence.

Why does AI governance need measurement infrastructure?

AI governance requires measurement infrastructure because trust claims need verifiable evidence. Without standardized measurement, AI safety remains a marketing claim rather than a proven fact. Measurement infrastructure enables audit-ready documentation, interoperable compliance tools, and reproducible safety evaluations.

What is the Public Logic Layer for AI?

The Public Logic Layer is IAIMS' approach to AI governance that provides open-source, standardized measurement tools between AI systems and regulatory requirements. It translates frameworks like NIST AI RMF, ISO 42001, and the EU AI Act into executable technical specifications.

How does IAIMS approach AI audit and compliance?

IAIMS provides measurement infrastructure that produces audit-ready evidence for AI compliance. This includes standardized evaluation protocols, interoperable data formats, and open-source tools that work across regulatory frameworks including NIST, ISO, and EU AI Act requirements.

What AI governance frameworks does IAIMS support?

IAIMS measurement infrastructure supports major AI governance frameworks including NIST AI Risk Management Framework (AI RMF), ISO/IEC 42001 AI Management Systems, the EU AI Act, and emerging international standards for AI safety and compliance.

THE IAIMS TRANSLATION PROTOCOL

From Policy to Evidence

Summary

Regulations tell you what to achieve. IAIMS tells you how to prove it. We take 400-page policy documents and convert them into runnable code that produces auditable evidence. The result: a machine-readable "receipt" that any regulator, auditor, or stakeholder can verify independently.

Why Measurement Is Hard

If rigorous AI measurement were easy, someone would have already done it. The technical challenges are real — and understanding them is essential to understanding why IAIMS exists.

Measurement Brittleness

Benchmarks break as models evolve. A test that worked six months ago may be meaningless today. We version everything and design for deprecation from day one.

Uncertainty Propagation

Errors compound across systems. A small measurement error at the component level can become a large error at the system level. We quantify uncertainty at every stage.

Lifecycle Decay

Claims degrade as context shifts. A model that passed evaluation in January may behave differently in production by March. Evidence must be timestamped and re-evaluatable.

Incentives to Oversimplify

There's pressure to produce clean scores when messy ranges are more honest. We resist the temptation to flatten nuance into misleading single numbers.

The Implementation Gap

The core mission of the Institute is to solve the Implementation Gap: the distance between a 400-page regulation and a single line of code.

We bridge this gap through a three-stage Technical Refinery that converts subjective regulatory prose into objective, machine-readable telemetry.

Ingestion (Prose to Parameters)

We begin by decomposing high-level frameworks (e.g., EU AI Act, NIST AI RMF, ISO 42001) into technical requirements. Our AI-assisted "ingestion engine" identifies every subjective mandate and extracts its underlying Technical Parameters—the specific system properties that must be measured.

Example:

Input (Prose): "Models must not generate harmful or toxic content"
Output (Parameter): Toxicity scores, jailbreak success rates, refusal accuracy

Logic Mapping (Parameters to Executables)

For every parameter, we develop or adopt a Reference Implementation—an open-source script or benchmark designed to perform the measurement objectively. These are not "guidelines"; they are Executables—Python test harnesses, Dockerized environments, and YAML-based constraints that slot directly into your engineering stack.

What gets produced:

A pipeline script that runs a standardized stress test against a model's API.

Scientific Rigor:

Every measurement includes Uncertainty Quantification (UQ)—defining the "Confidence Score" and "Instability Index" of the test results. This ensures measurements are reproducible and statistically meaningful.

The Manifest (Executables to Evidence)

Running these executables produces the Unified Evidence Manifest. This is a machine-readable JSON or YAML file that acts as a technical "receipt" for an auditor. It confirms that the test was run, captures the versioned result, and maps it directly back to the original regulatory requirement.

The MVEP (Minimum Viable Evidence Pack):

We define a standard evidence package consisting of:

A machine-readable AI-BOM (AI Bill of Materials) — a structured inventory of model components
Evaluation procedure references
A version-controlled change log

A Concrete Example: EU AI Act Compliance

Hypothetical example of future mapping concepts (not yet implemented):

Step 1: Semantic Deconstruction (Prose to Parameters)

We don't just "read" the law. We use Semantic Mapping to break down a clause into a measurable variable.

Requirement: "High-risk systems must ensure appropriate levels of accuracy and robustness." (EU AI Act Art. 15)
The Parameter: accuracy_threshold: >0.92, adversarial_robustness_score: >0.85

Step 2: The Reference Implementation (Parameters to Executable)

IAIMS is developing a library of Open-Source Test Harnesses. We don't invent new math; we package existing science (like NIST or MLCommons) into executable scripts.

The Artifact: A Python iaims-eval CLI tool that runs a specific stress test against your local model endpoint.

Step 3: The Evidence Export (Executable to Manifest)

The output isn't a "report" — it's a Machine-Readable Receipt. Below is an illustrative example of the Evidence Manifest structure:

Thresholds are derived from harmonized standards, industry consensus, or user-defined risk tolerance. IAIMS provides the measurement — you define the acceptance criteria.

{
  "manifest_id": "EXAMPLE-001",
  "artifact_status": "illustrative",
  "schema": "urn:iaims:schemas:evidence-manifest:v0.1",
  "generated_at": "YYYY-MM-DDTHH:MM:SSZ",
  
  "subject": {
    "type": "ml_model",
    "identifier": "org:models:model-name:version",
    "deployment_context": "environment/region"
  },

  "measurement_procedure": {
    "procedure_id": "urn:iaims:procedures:data-provenance:v0.1",
    "reference_definition": "forthcoming-public-specification",
    "parameters": {
      "depth": "full_lineage",
      "include_transformations": true
    }
  },

  "measurement_outputs": {
    "data_provenance": {
      "schema": "urn:iaims:schemas:data-provenance:v0.1",
      "source_dataset": {
        "uri": "[dataset-location]",
        "hash": "[integrity-hash]",
        "record_count": "[integer]",
        "collected_range": {
          "start": "YYYY-MM-DDTHH:MM:SSZ",
          "end": "YYYY-MM-DDTHH:MM:SSZ"
        }
      },
      "lineage_chain": [
        {
          "step": 1,
          "operation": "[operation-type]",
          "source": "[source-identifier]",
          "timestamp": "YYYY-MM-DDTHH:MM:SSZ"
        }
      ],
      "coverage_assessment": {
        "fields_total": "[integer]",
        "fields_documented": "[integer]",
        "coverage_ratio": "[0.0-1.0]"
      }
    }
  },

  "uncertainty": {
    "confidence_level": "[0.0-1.0]",
    "known_limitations": [
      {
        "code": "[LIMITATION_CODE]",
        "description": "[human-readable description]",
        "impact": "[low|medium|high]"
      }
    ]
  },

  "validity": {
    "valid_from": "YYYY-MM-DDTHH:MM:SSZ",
    "valid_until": "YYYY-MM-DDTHH:MM:SSZ",
    "revalidation_triggers": [
      "source_data_schema_change",
      "model_retrain",
      "time_elapsed"
    ]
  },

  "limitations": [
    "This artifact is illustrative only",
    "Schema and field names are subject to change",
    "Does not constitute operational measurement evidence"
  ]
}

What You Get

For Startups

Portable governance logic. Run our open-source executables anywhere—no vendor lock-in. If you switch providers, your audit trail comes with you.

For Enterprise Compliance Teams

One measurement framework that maps to multiple regulatory regimes (NIST, ISO, EU AI Act). Eliminate redundant audits.

For Developers

Scripts that slot into CI/CD pipelines. Governance becomes a release gate, not a manual bottleneck.

For Regulators & Auditors

Standardized, reproducible evidence. Every manifest traces back to the original requirement with full version history.

For Academics & Researchers

Open methodology, open benchmarks. Cite, critique, contribute.

Glossary

Term	Definition
Technical Parameter	A measurable system property derived from a regulatory requirement
Executable	A runnable script or benchmark that objectively measures a parameter
Unified Evidence Manifest	A machine-readable file documenting test results mapped to requirements
MVEP	Minimum Viable Evidence Pack — the minimum documentation needed for audit
AI-BOM	AI Bill of Materials — a structured inventory of AI system components
UQ	Uncertainty Quantification — statistical measures of measurement confidence
Logic-ID	A unique identifier that traces evidence back to its originating regulatory clause