ROAR DocumentationROAR Documentation
  • Tech Specs
  • Databases
  • Workflows
  • Application
  • GitHub Actions
  • Dashboard Components
  • Firebase App Check
  • Cloud Functions
  • Backend Architecture
  • Internationalization
  • Integrating New Apps
  • Optimizing Assets
  • ROAR Redivis Instance
  • Logging and Querying
  • Emulation
  • Data Tools
  • Data Organization
  • Data Requests
GitHub
  • Tech Specs
  • Databases
  • Workflows
  • Application
  • GitHub Actions
  • Dashboard Components
  • Firebase App Check
  • Cloud Functions
  • Backend Architecture
  • Internationalization
  • Integrating New Apps
  • Optimizing Assets
  • ROAR Redivis Instance
  • Logging and Querying
  • Emulation
  • Data Tools
  • Data Organization
  • Data Requests
GitHub
  • Tech Specs
    • ROAR Backend Technical Specification
    • ROAR Administrations & Assignments: Technical Specification
    • ROAR Agreements: Technical Specification
    • ROAR Assessment Execution: Technical Specification
    • ROAR Task Integration with Measurement Services: Technical Specification
    • ROAR Parent Accounts: Technical Specification
    • ROAR Permissions: Technical Specification
    • ROAR Rostering Integration Spec
    • ROAR Task Configuration and Variant System: Technical Specification
    • ROAR Users and Organizations: Technical Specification
  • Databases
    • Database Information
    • gse-roar-admin
    • gse-roar-assessment
  • BigQuery
    • Querying Assessment Data
    • BigQuery schema: administrations
    • BigQuery Runtime Configuration File
    • BigQuery schema: classes
    • BigQuery schema: districts
    • BigQuery schema: families
    • BigQuery schema: groups
    • BigQuery schema: legal_versions
    • BigQuery schema: legal
    • Querying Admin Data
    • BigQuery schema: schools
    • BigQuery schema: user_runs
    • BigQuery schema: user_trials
    • BigQuery schema: users
  • Workflows
    • Workflows
    • Creating an Assignment
    • Authentication
    • Creating new Users
    • User Roster Changes
    • How to Impersonate a Clever User on Localhost
  • Application

    • Auth
  • GitHub Actions
    • ROAR Apps GitHub Actions
      • GitHub Actions in ROAR Apps
      • firebase-deploy-preview.yml
      • firebase-hosting-merge.yml
      • publish-to-npm-create-new-release.yml
      • submit-dashboard-pr.yml
    • ROAR Dashboard GitHub Actions
      • GitHub Actions in the ROAR Dashboard
  • Dashboard Components
    • Dashboard Components
    • Organization Score Reports
  • Firebase App Check
    • Firebase App Check Configuration for roar-firekit and roar-dashboard
  • Backend Architecture
    • Architecture
      • Backend Architecture in ROAR
      • Data Models
      • Database Implementations
      • Error Handling Architecture in ROAR
      • Repository Layer Architecture
      • Service Layer Architecture
    • API
      • Classes
        • Class: AdministrationServiceError
        • Class: FirebaseClientError
        • Class: FirebaseImplementationError
        • Class: FirestoreAdministrationRepository
        • Class: FirestoreAdministrationRepositoryError
        • Class: abstract FirestoreBaseRepository<T>
        • Class: FirestoreFilterAdapter
        • Class: FirestoreIdentityProviderRepository
        • Class: FirestoreIdentityProviderRepositoryError
        • Class: FirestoreOrgRepository
        • Class: FirestoreOrgRepositoryError
        • Class: FirestoreRepositoryError
        • Class: FirestoreUserClaimRepository
        • Class: FirestoreUserClaimRepositoryError
        • Class: FirestoreUserRepository
        • Class: FirestoreUserRepositoryError
        • Class: IdentityProviderServiceError
        • Classes
      • Enumerations
        • Enumeration: CollectionType
        • Enumeration: IdentityProviderType
        • Enumeration: Operator
        • Enumerations
      • Functions
        • Functions
        • Function: chunkOrgs()
        • Function: createAdministrationService()
        • Function: createFirestoreImplementation()
        • Function: createIdentityProviderService()
        • Function: isEmptyOrgs()
      • Interfaces
        • Interface: Administration
        • Interface: AdministrationBaseRepository
        • Interface: AdministrationService
        • Interface: AssentConsent
        • Interface: Assessment
        • Interface: BaseModel
        • Interface: BaseRepository<T>
        • Interface: Claims
        • Interface: CompositeCondition
        • Interface: CompositeFilter
        • Interface: CreateAdministrationServiceParams<AdminRepo, OrgRepo, UserClaimRepo>
        • Interface: CreateParams
        • Interface: DeleteParams
        • Interface: EducationalOrgsList
        • Interface: FieldCondition
        • Interface: FilterAdapter<T>
        • Interface: FirestoreCreateParams
        • Interface: FirestoreDeleteParams
        • Interface: FirestoreFetchDocumentParams
        • Interface: FirestoreGetAllParams
        • Interface: FirestoreGetByIdParams
        • Interface: FirestoreGetByNameParams
        • Interface: FirestoreGetByRoarUidParams
        • Interface: FirestoreGetParams
        • Interface: FirestoreGetWithFiltersParams
        • Interface: FirestoreImplementation
        • Interface: FirestoreRunTransactionParams<T>
        • Interface: FirestoreUpdateParams
        • Interface: FutureParams
        • Interface: GetAdministrationIdsForAdministratorParams
        • Interface: GetAdministrationIdsFromOrgsParams
        • Interface: GetAllParams
        • Interface: GetByNameParams
        • Interface: GetByProviderIdParams
        • Interface: GetByRoarUidParams
        • Interface: GetParams
        • Interface: GetRoarUidParams
        • Interface: IdentityProvider
        • Interface: IdentityProviderBaseRepository
        • Interface: IdentityProviderService
        • Interface: Legal
        • Interface: OrgBase
        • Interface: OrgBaseRepository
        • Interface: OrgsList
        • Interfaces
        • Interface: Result<T>
        • Interface: RunTransactionParams<T>
        • Interface: SingleFilter
        • Interface: UpdateParams
        • Interface: User
        • Interface: UserBaseRepository
        • Interface: UserClaim
        • Interface: UserClaimBaseRepository
        • Interface: createIdentityProviderServiceParams<IDPRepo, UserClaimRepo, UserRepo>
        • Interface: getAdministrationIdsFromOrgsParams
        • Interface: _setAdministrationIdsParams
      • Type Aliases
        • Type Alias: BaseFilter
        • Type Alias: ComparisonOperator
        • Type Alias: Condition
        • Type Alias: DocumentCreatedEvent
        • Type Alias: DocumentDeletedEvent
        • Type Alias: DocumentUpdatedEvent
        • Type Alias: DocumentWrittenEvent
        • Type Alias: ParameterValue
        • Type Aliases
        • Type Alias: SelectAllCondition
      • Variables
        • Variable: FirebaseAppClient
        • Variable: FirebaseAuthClient
        • Variable: FirestoreClient
        • Variable: ORG_NAMES
        • Variables API Documentation
    • Examples
      • Examples
    • Guides
      • Guides
  • Cloud Functions
    • gse-roar-admin
      • Admin Database
      • appendToAdminClaims()
      • associateassessmentuid()
      • createAdministratorAccount()
      • createGuestDocsForGoogleUsers()
      • createLevanteGroup()
      • createLevanteUsers()
      • createnewfamily()
      • createstudentaccount()
      • mirrorClasses()
      • mirrorCustomClaims
      • mirrorDistricts()
      • mirrorFamilies()
      • mirrorGroups()
      • mirrorSchools()
      • removefromadminclaims()
      • saveSurveyResponses()
      • setuidcustomclaims()
      • softDeleteUserAssignment()
      • softDeleteUserExternalData
      • softDeleteUser()
      • syncAssignmentCreated()
      • syncAssignmentDeleted()
      • syncAssignmentUpdated()
      • syncAssignmentsOnAdministrationUpdate()
      • syncAssignmentsOnUserUpdate()
      • syncCleverOrgs()
      • syncCleverUser()
    • gse-roar-assessment
      • Assessment Database
      • organizeBucketLogsByDate()
      • setuidclaims()
      • softDeleteGuestTrial()
      • softDeleteGuest()
      • softDeleteUserRun()
      • softDeleteUserTrial()
      • syncOnRunDocUpdate()
  • Internationalization
    • ROAM Fluency
    • ROAR Letter
    • ROAR Phoneme
    • Internationalization of ROAR Apps
    • ROAR Sentence
    • ROAR Word
  • Integrating New Apps
    • Integrating Roar Apps into the Dashboard
    • Dashboard Integration
    • Monitoring and Testing
    • Preparing the App for Packaging and Deployment
    • Packaging and Publishing to npm
    • Secrets in the GitHub Repository
  • Assets Optimization
    • Optimizing Assets
    • Audio Optimization Guide
    • Image Optimization Guide
  • ROAR Redivis Instance
    • ROAR Redivis Instance
    • ROAR Data Validator Trigger
    • ROAR Data Validator
  • Logging and Querying
    • ROAR Logging
  • Emulation
    • Running the Emulator
      • Commands
    • Emulator Configuration Guide
      • Configuration
      • Cypress Configuration
      • Setup and Dependencies
      • Firebase CLI Configuration
      • Firebase Emulator Configuration
      • GitHub Secrets and Workflows
      • Importing and Exporting Data
      • Local Environment Variables
  • Clowder Implementation
    • Clowder Integration
    • Letter - Clowder
    • Multichoice - Clowder
    • Phoneme - Clowder
    • ARF & CALF - Clowder

ROAR Task Integration with Measurement Services: Technical Specification

Purpose and Scope

This specification defines how the ROAR ecosystem generates, stores, computes, validates, and retrieves scores associated with task runs, and how tasks integrate with psychometric measurement services. It supports:

  • Trial-level and run-level score handling
  • Final vs. partial score storage
  • Reliability tracking
  • Browser interaction tracking
  • Score update auditing
  • Persistent post-run score storage
  • On-demand score computation from item responses
  • Score validation submitted by external clients
  • Integration with stopping condition and item selection services

The API is grouped under the /api/measurement/ namespace to support extensibility and clear separation from task execution flows.

System Overview

Definitions

  • Run: An attempt by a user to complete a task.

  • Trial: A single stimulus-response interaction within a task.

  • Run Scores vs. Trial Scores: in-progress, item-level scoring and run-level (summary) scoring are treated as conceptually and structurally distinct.

    • Run Scores, referred to hereafter as simply scores, are stored once at the completion of an assessment. They are derived from the full set of a user's item responses.

    • Trial scores are real-time estimates and are updated throughout the task after each trial.

    Scores and trial scores are stored in separate database tables.

  • Raw Score: Direct count/aggregation from item correctness (e.g., total correct, total incorrect, total attempted). Ability estimates from computer adaptive testing are also considered raw scores (e.g., theta estimates and standard errors).

  • Computed Score: Derived from raw scores via normalization, statistical transformations. This include percentile scores and standard scores.

  • Score Name: A string representing the name of a particular score (e.g., "percentile," "num_correct", or "ROAR score"). This field can be any arbitrary string.

  • Score Type: Either "raw" or "computed." See the definitions above.

  • Assessment Phase: Indicates the testing stage of the assessment run. Acceptable values are:

    • "practice" – Practice or warm-up activity
    • "test" – Core task run used for scoring

    In the design below, we use a string phase field to capture this information instead of a boolean is_practice field so that we can extend the types of phases in the future (e.g., to add a "review" phase).

  • Assessment Domain: The skill or subdomain being measured by a particular score within an assessment. A single task may report multiple domains if it includes blocks targeting different subskills (e.g., sound deletion or first-sound matching in a phonological awareness assessment). The domain field will default to "composite" if not specified, indicating that the score is relevant to the entire assessment.

  • Reliability: A judgement about whether or not a run results in valid scores. In question form, "would a researcher believe that the scores resulting from this run accurately reflect the user's abilities?" Reliability judgements can evolve over the course of a run or after researcher reviews.

  • Reliability Events: Events that indicate issues with the validity or trustworthiness of a run. Examples include response times being too fast or patterned guessing. Reliability events are the evidence upon which a final reliability decision is made.

Component Flow Diagram

Runtime Behavior

The task runtime operates as a thin orchestrator. It presents items, collects responses and metadata, and invokes services to interpret that data. Services are invoked after each item chunk and may operate in parallel. In practice, the chunk size N is set to one. But tasks are designed to support arbitrary item chunk sizes. For each item chunk (size N):

  1. Present Items: The task presents a chunk of N items to the user.
  2. Capture Responses and Metadata, including:
    • Trial-level responses
    • Response timestamps
    • Interaction events (e.g., focus/blur, fullscreen)
    • Device-level metadata
    • Eye tracking data
  3. Parallel Service Calls:
    • Database Writes:
      • POST /api/scoring/trial-scores (one per trial)
      • POST /api/scoring/browser-interactions (if applicable)
    • Score Computation:
      • Call the score service with trial-level response data
      • Receive a list of raw and computed scores
      • Write scores via POST /api/scoring/scores
    • Reliability Evaluation:
      • Call the reliability service with responses + interactions
      • Receive a judgment and optional list of reliability events
      • Write events via POST /api/scoring/reliability-events
      • Set reliability status via PATCH /api/runs/{run_id}
  4. Stopping Condition Evaluation:
    • Call the stopping condition service with trial and run metadata
    • If should_stop = true, finalize the run
    • Otherwise, continue
  5. Item Selection:
    • Call the item selection service
    • Present the next chunk of N items
  • During each trial, the task computes trials via POST /api/scoring/compute and submits trial-level scores via POST /api/scoring/trial-scores.
  • The task also records browser interactions and submits them via POST /api/scoring/browser/interactions.
  • The run may be annotated with reliability events via POST /api/scoring/reliability-events.
  • These reliability events may me marked as resolved if the task determines that they should not invalidate the run.
  • After a run completes, final scores are submitted via POST /api/scoring/scores.
  • If the run is abandoned, a post-processing job may promote trial scores into a partial score record via POST /api/scoring/scores with status=partial.
  • Scores may be updated later by researchers or staff, with all changes logged.

Edge Cases and Error Handling

ScenarioBehavior
Run completed normallyScores logged with status = 'final'
Run ended early but is usableTrial scores promoted; scores logged with status = 'partial'
Run aborted with insufficient dataNo scores logged
Score service unavailableTask retries or defers; run marked incomplete
Reliability service failsReliability status left undefined or deferred
Stopping condition service failsDefault stopping heuristic used (e.g., item count threshold)
No items returned from selectorRun ends with status = 'complete'
Reliability issue detectedAdd entry to reliability_events and update run metadata

Design Rationale

  • Measurement service abstraction separates raw data capture from psychometric logic; supports plugging in different scoring engines, stopping models, or reliability classifiers.
  • Chunked item loop: Improves control over runtime memory, UI responsiveness, and async evaluation of trial data.
  • Parallel service invocation: Decouples response collection from scoring and reliability computation; enables responsive UIs.
  • Separation of scoring from reliability: Allows independent evaluation and debugging of accuracy vs. validity.
  • Pluggable, injectable services: Supports experimentation, model versioning, and local vs. cloud-based execution.
  • Explicit stopping and item selection logic: Makes adaptive behaviors testable, observable, and replaceable.
  • Use of /api/measurement/ namespace: Reflects full scope of evaluation logic, not limited to scoring.
  • Separation of trial and final scores: Enables real-time feedback and post-hoc evaluation without cluttering the final scores table.
  • Partial scoring: Promotes best-effort summaries when assessments terminate early.
  • Domain and phase fields: Allow disaggregated and nuanced reporting across subskills and assessment stages.

Pluggable Services

Pluggable Services

These endpoints represent pluggable interfaces. Their implementation may vary by environment (e.g., local module, internal microservice, or remote API).

These services may be exposed publicly or remain internal-only, depending on how the system is deployed. Clients should treat this as a logical service contract rather than a fixed URL.

Do not hardcode assumptions about endpoint location or availability. If you're implementing a client, inject the service endpoint via configuration.

POST /internal/measurement/compute-scores

Computes scores (raw, computed, IRT) from item responses. This simply returns scores and does not write to the database.

Request:

POST /internal/measurement/compute-scores
{
  "task_slug": "roar-word",
  "responses": [
    { "phase": "test", "domain": "blockA", "a": 1, "b": 0, "c": 0, "d": 1, "correct": true },
    { "phase": "test", "domain": "blockA", "a": 1, "b": 0, "c": 0, "d": 1, "correct": false },
  ],
}

Response:

{
  "scores": [
    {
      "name": "total_correct",
      "value": 1,
      "type": "raw",
      "domain": "blockA",
      "phase": "test",
    },
    {
      "name": "total_correct",
      "value": 1,
      "type": "raw",
      "domain": "composite",
      "phase": "test",
    },
    {
      "name": "theta_estimate",
      "value": 0.91,
      "type": "raw",
      "domain": "blockA",
      "phase": "test"
    },
    {
      "name": "theta_se",
      "value": 0.08,
      "type": "raw",
      "domain": "blockA",
      "phase": "test"
    },
    {
      "name": "theta_estimate",
      "value": -0.85,
      "type": "raw",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "theta_se",
      "value": 0.1,
      "type": "raw",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "percentile",
      "value": 48.2,
      "type": "computed",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "standard_score",
      "value": 180,
      "type": "computed",
      "domain": "composite",
      "phase": "test"
    },
  ]
}

POST /internal/measurement/evaluate-reliability

Evaluates reliability of a task run based on response patterns and interaction data.

TODO

The Request/Response needs refinement.

Request:

POST /internal/measurement/evaluate-reliability
{
  "task_slug": "roar-word",
  "trials": [
    {
      "trial_id": "t1",
      "response_time_ms": 420,
      "correct": true,
      "response_pattern": "ABCD"
    },
    {
      "trial_id": "t2",
      "response_time_ms": 190,
      "correct": false,
      "response_pattern": "DDDD"
    }
  ],
  "interactions": [
    {
      "interaction_type": "fullscreen_exit",
      "timestamp": "2025-07-03T10:00:00Z",
      "trial_id": "t1",
      "metadata": { "window_width": 1024, "window_height": 768 }
    }
  ]
}

Response:

{
  "reliable": false,
  "events": [
    {
      "reason": "Mean RT under 200ms for 5+ trials",
      "reason_code": "fast_response"
    },
    {
      "reason": "Fullscreen exited twice",
      "reason_code": "fullscreen_exit"
    }
  ]
}

POST /internal/measurement/evaluate-stopping-condition

Determines whether the task should stop based on accumulated scores, standard error, item count, or elapsed time.

TODO

The Request/Response needs refinement.

Request:

POST /internal/measurement/evaluate-stopping-condition
{
  "task_slug": "roar-word",
  "elapsed_time_sec": 305,
  "num_items": 32,
  "theta_se": 0.12,
}

Response:

{
  "should_stop": true,
  "reason": "Item count threshold reached",
  "reason_code": "item_count"
}

POST /internal/measurement/select-items

Selects the next chunk of items based on current ability estimate and available item pool.

TODO

The Request/Response needs refinement.

SQL Schema

scores

CREATE TABLE scores (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  run_id UUID REFERENCES runs(id) ON DELETE CASCADE,
  task_id UUID REFERENCES tasks(id),
  variant_id UUID REFERENCES variants(id),
  user_id UUID REFERENCES users(id),
  assignment_id UUID REFERENCES assignments(id),
  value INTEGER NOT NULL,
  name TEXT NOT NULL,
  type TEXT CHECK (type IN ('raw', 'computed')),
  phase TEXT CHECK (phase IN ('practice', 'test')) default 'test',
  domain TEXT DEFAULT 'composite',
  status TEXT CHECK (status in ('final', 'partial', 'invalid')),
  created_at TIMESTAMP DEFAULT now()
  updated_at TIMESTAMP DEFAULT now(),
  deleted_at TIMESTAMP,
);

trial_scores

CREATE TABLE trial_scores (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  trial_id UUID REFERENCES trials(id) ON DELETE CASCADE,
  run_id UUID REFERENCES runs(id) ON DELETE CASCADE,
  task_id UUID REFERENCES tasks(id),
  variant_id UUID REFERENCES variants(id),
  user_id INTEGER REFERENCES users(id),
  assignment_id INTEGER REFERENCES assignments(id),
  value INTEGER NOT NULL,
  name TEXT NOT NULL,
  type TEXT CHECK (type IN ('raw', 'computed')),
  phase TEXT CHECK (phase IN ('practice', 'test')) default 'test',
  domain TEXT DEFAULT 'composite',
  created_at TIMESTAMP DEFAULT now(),
  updated_at TIMESTAMP DEFAULT now(),
  deleted_at TIMESTAMP,
);

score_update_log

CREATE TABLE score_update_log (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  score_id UUID REFERENCES scores(id) ON DELETE CASCADE,
  old_domain TEXT NOT NULL,
  old_phase TEXT NOT NULL,
  old_type TEXT NOT NULL,
  old_value INTEGER NOT NULL,
  new_domain TEXT NOT NULL,
  new_phase TEXT NOT NULL,
  new_type TEXT NOT NULL,
  new_value INTEGER NOT NULL,
  updated_by UUID REFERENCES user(id),
  updated_at TIMESTAMP DEFAULT now(),
  reason TEXT
);

reliability_events

CREATE TABLE reliability_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  run_id UUID REFERENCES runs(id) ON DELETE CASCADE,
  user_id UUID REFERENCES users(id),
  task_id UUID REFERENCES tasks(id),
  variant_id UUID REFERENCES variants(id),
  assignment_id UUID REFERENCES assignments(id),
  trial_id UUID REFERENCES trials(id),
  reason TEXT,
  reason_code TEXT CHECK (
    reason_code IN (
      'fast_response',
      'blurred_focus',
      'fullscreen_exit',
      'inconsistent_response',
      'low_accuracy',
      'manual_review'
    )
  ),
  resolution TEXT,
  resolution_code TEXT CHECK (
    resolution_code IN (
      'recovered',
      'invalidated',
      'manual_review'
    )
  ),
  resolved_by UUID REFERENCES users(id),
  created_at TIMESTAMP DEFAULT now(),
  updated_at TIMESTAMP DEFAULT now(),
  deleted_at TIMESTAMP,
);

browser_interactions

CREATE TABLE browser_interactions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  trial_id UUID REFERENCES trials(id) ON DELETE CASCADE,
  run_id UUID REFERENCES runs(id) ON DELETE CASCADE,
  user_id UUID REFERENCES users(id) ON DELETE CASCADE,
  interaction_type TEXT CHECK (
    interaction_type IN ('focus', 'blur', 'fullscreen_enter', 'fullscreen_exit')
  ) NOT NULL,
  timestamp TIMESTAMP DEFAULT now(),
  metadata JSONB,
  created_at TIMESTAMP DEFAULT now(),
  updated_at TIMESTAMP DEFAULT now(),
  deleted_at TIMESTAMP,
);

runs

See the assessment-execution section for the full schema.

API Contract

POST /api/measurement/validate

Validates provided scores against computed results

/api/measurement/validate request

POST /api/measurement/validate
{
  "task_slug": "roar-word",
  "item_responses": [
    { "phase": "test", "a": 1, "b": 0, "c": 0, "d": 1, "correct": true },
    { "phase": "test", "a": 1, "b": 0, "c": 0, "d": 1, "correct": false },
  ],
  "scores": [
    { "name": "total_correct", "value": 1, "type": "raw", "domain": "composite", "phase": "test", },
    { "name": "theta_estimate", "value": -0.85, "type": "raw", "domain": "composite", "phase": "test" },
    { "name": "theta_se", "value": 0.1, "type": "raw", "domain": "composite", "phase": "test" },
    { "name": "percentile", "value": 48.2, "type": "computed", "domain": "composite", "phase": "test" },
    { "name": "standard_score", "value": 180, "type": "computed", "domain": "composite", "phase": "test" },
  ]
}

/api/measurement/validate response

If valid, returns

{ "valid": true }

If invalid, returns

{
  "valid": false,
  "discrepancies": [
    {
      "name": "total_correct",
      "phase": "test",
      "domain": "composite",
      "type": "computed",
      "expected": 1,
      "received": 2
    }
  ]
}

POST /api/measurement/reliability-events

Records a reliability event for a run.

POST /api/measurement/reliability-events
{
  "run_id": uuid,
  "user_id": uuid,
  "task_id": uuid,
  "variant_id": uuid,
  "assignment_id": uuid,
  "trial_id": uuid,
  "reason": "Mean RT under 200ms for 5+ trials",
  "reason_code": "fast_response"
}

PATCH /api/measurement/reliability-events/{run_id}

Marks all reliability events for a run as resolved.

PATCH /api/measurement/reliability-events/{run_id}
{
  "resolution": "Run behavior normalized after block 2",
  "resolution_code": "recovered"
}

POST /api/measurement/browser-interactions

Captures a browser interaction during a trial.

POST /api/measurement/browser-interactions
{
  "trial_id": uuid,
  "run_id": uuid,
  "user_id": uuid,
  "task_id": uuid,
  "variant_id": uuid,
  "assignment_id": uuid,
  "interaction_type": "fullscreen_exit",
  "metadata": { "window_width": 1024, "window_height": 768 }
}

POST /api/measurement/scores

Creates final or partial scores for a completed or aborted run.

POST /api/measurement/scores
{
  "run_id": uuid,
  "user_id": uuid,
  "task_id": uuid,
  "variant_id": uuid,
  "assignment_id": uuid,
  "scores": [
    {
      "name": "total_correct",
      "value": 1,
      "type": "raw",
      "domain": "composite",
      "phase": "test",
    },
    {
      "name": "theta_estimate",
      "value": -0.85,
      "type": "raw",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "theta_se",
      "value": 0.1,
      "type": "raw",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "percentile",
      "value": 48.2,
      "type": "computed",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "standard_score",
      "value": 180,
      "type": "computed",
      "domain": "composite",
      "phase": "test"
    },
  ]
}

POST /api/measurement/trial-scores

Writes a single score from an individual trial (typically in dev or adaptive scenarios).

POST /api/measurement/trial-scores
{
  "trial_id": uuid,
  "run_id": uuid,
  "user_id": uuid,
  "task_id": uuid,
  "variant_id": uuid,
  "assignment_id": uuid,
  "scores": [
    {
      "name": "total_correct",
      "value": 1,
      "type": "raw",
      "domain": "composite",
      "phase": "test",
    },
    {
      "name": "theta_estimate",
      "value": -0.85,
      "type": "raw",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "theta_se",
      "value": 0.1,
      "type": "raw",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "percentile",
      "value": 48.2,
      "type": "computed",
      "domain": "composite",
      "phase": "test"
    },
    {
      "name": "standard_score",
      "value": 180,
      "type": "computed",
      "domain": "composite",
      "phase": "test"
    },
  ]
}

Migration Plan

  • Scores are currently stored in runs documents in Firestore but are converted into a separate table using BigQuery. We will use these BigQuery views to populate the new scores table in Postgres.
  • Derive reliability flags from existing metadata where available and populate reliability_flags.
  • trial_scores will not be backfilled.
  • Introduce score_update_log forward-looking only — no need to backfill.
  • Update all scoring-related API endpoints to align with new schema.
  • The /api/scoring/compute and /api/scoring/validate endpoints (and any services required to support them) will be the last to be implemented. The delivery date for those services and endpoints is after the larger "backend" refactoring of Q3 2025.

Summary

The ROAR scoring system is built for flexibility, reproducibility, and auditability. By clearly separating trial data, final scores, and reliability annotations, we support both exploratory research and robust production-grade deployment.

Edit this page
Last Updated:
Contributors: Adam Richie-Halford
Prev
ROAR Assessment Execution: Technical Specification
Next
ROAR Parent Accounts: Technical Specification