ASRA Phase 2 — Original ARC Full-Dataset Evaluation Results
Empirical results for the ASRA Observation Engine on the Original ARC corpus (800 tasks): 100% rule-candidate coverage, ~98% cross-demo common-rule consistency, transform-event distributions, training vs evaluation complexity gradient, and branched-per-demo resolution for 17 exception tasks.
Status: Published preprint (SciLayer Systems) v1
Repository copy: documents/evaluation/asra-phase-2-original-arc-evaluation-results.md
SciLayer: https://sci-layer.vercel.app/articles/asra-phase-2-original-arc-evaluation-results
Companion theory: Phase 2 preprint
Purpose: Empirical results for the ASRA Observation Engine on the Original ARC corpus (800 tasks). Reports object extraction coverage, transform-event distributions, cross-demo rule consistency, and exception analysis — complementing the Phase 2 conceptual preprint with reproducible benchmark numbers.
Abstract
ASRA Phase 2 segments integer grids into object scenes, detects transform events between demonstration pairs, and induces rule candidates. We evaluate the full BeforeAfterAnalyzer + RuleCandidateGenerator pipeline on all 400 training and 400 evaluation tasks from fchollet/ARC. Every task produces rule candidates (100% coverage). Cross-demo common rules at confidence 1.0 appear in 98.0% of training tasks and 97.75% of evaluation tasks. Evaluation split grids yield approximately 2× the object count and transform events of training — consistent with harder held-out tasks. Seventeen tasks initially lacked perfect cross-demo agreement; root cause analysis shows mixed transform types across demos, resolved by BRANCHED_PER_DEMO rule emission. These metrics measure perception and demo consistency, not ARC test-output solve rate.
1. Evaluation scope
| Dimension | Detail |
|---|---|
| Pipeline | asra.perception.BeforeAfterAnalyzer → RuleCandidateGenerator |
| Corpus | Original ARC — 400 training + 400 evaluation tasks |
| Run date | 2026-06-01 |
| Wall time | ~60 s total (~14 s training, ~46 s evaluation) |
| Output | 800 per-task JSON reports (~87 MB) |
What this eval measures:
- Object scene extraction succeeds on every task
- Transform events are detected per demo pair
- Rule candidates summarize demo-pair consistency
What this eval does not measure:
- ARC test-set solve accuracy (no test outputs used)
- Interactive ARC-AGI-3 competition score
- End-to-end agent win rate
Phase 2 on Original ARC is a supervised abstraction laboratory; interactive deployment uses compact object-scene hints in Kaggle agents (see Phase 2 preprint §8).
2. Run configuration
cd asra-arc
python scripts/eval_phase2_batch.py --split training
python scripts/eval_phase2_batch.py --split evaluation
Each report includes: object extraction → region annotation → transform detection → rule candidates per task.
| Split | Tasks | Reports path | Wall time |
|---|---|---|---|
| Training | 400 | data/analysis/phase2/reports/training/*.json |
~14 s |
| Evaluation | 400 | data/analysis/phase2/reports/evaluation/*.json |
~46 s |
| Total | 800 | 800 JSON files | ~60 s |
Aggregates: data/analysis/phase2/summary_training.json, summary_evaluation.json.
3. Summary metrics
3.1 Training split (400 tasks)
| Metric | Value |
|---|---|
| Tasks with rule candidates | 400 / 400 (100%) |
| Tasks with full-demo common rule (confidence 1.0) | 392 / 400 (98.0%) |
| Avg objects per input scene | 13.16 |
| Avg transform events per demo pair | 16.86 |
| Parse errors | 0 |
Transform event distribution (aggregate):
| Class | Count |
|---|---|
| DELETE | 6,588 |
| ROTATE | 5,065 |
| CREATE | 4,821 |
| IDENTITY | 3,334 |
| TRANSLATE | 2,146 |
3.2 Evaluation split (400 tasks)
| Metric | Value |
|---|---|
| Tasks with rule candidates | 400 / 400 (100%) |
| Tasks with full-demo common rule (confidence 1.0) | 391 / 400 (97.75%) |
| Avg objects per input scene | 25.40 |
| Avg transform events per demo pair | 30.90 |
| Parse errors | 0 |
Transform event distribution (aggregate):
| Class | Count |
|---|---|
| ROTATE | 11,768 |
| DELETE | 11,444 |
| CREATE | 7,492 |
| IDENTITY | 6,332 |
| TRANSLATE | 5,074 |
3.3 Training vs evaluation comparison
| Metric | Training | Evaluation | Ratio (eval/train) |
|---|---|---|---|
| Avg objects per scene | 13.16 | 25.40 | 1.93× |
| Avg transform events / pair | 16.86 | 30.90 | 1.83× |
| Common-rule coverage | 98.0% | 97.75% | ~equal |
Evaluation tasks are structurally richer — more objects and more per-pair events — while rule-candidate coverage remains near-identical.
4. Interpretation
4.1 Coverage
100% rule-candidate coverage indicates the perception pipeline never fails silently on the full ARC corpus. Every task yields at least one interpretable structural hypothesis from demonstration pairs.
~98% common-rule coverage (confidence 1.0 across all training demos) measures structural regularity within tasks — not puzzle-solving success. A task can have perfect demo consistency yet require compositional reasoning at test time.
4.2 Transform mix
DELETE, CREATE, and ROTATE dominate both splits. This is expected for greedy object matching and differencing: many ARC tasks recompose objects (delete + create) or rotate components rather than pure translation.
RECOLOR and REFLECT are under-reported in this baseline — often folded into ROTATE or IDENTITY via shape_hash equivalence. Future Phase 2 revisions should split these explicitly.
4.3 Complexity gradient
The ~2× object/event ratio on evaluation supports using Original ARC evaluation split as a harder perception stress test relative to training — useful for regression testing perception changes before ARC-AGI-3 integration.
5. Exception tasks (17 total)
Historically 8 training + 9 evaluation tasks had top per-object rule confidence < 1.0.
| Split | Exception count | Detail |
|---|---|---|
| Training | 8 | EXCEPTIONS_TRAINING.md |
| Evaluation | 9 | EXCEPTIONS_EVALUATION.md |
Root cause: mixed_transform_types_across_demos — not parse failures. Demos within the same task follow different transform patterns; forcing a single global rule yields low confidence.
Resolution (Phase 2B): RuleCandidateGenerator emits:
BRANCHED_PER_DEMO(confidence 1.0) when demos disagreePER_DEMO_{i}_*rules for each demonstration branch
Training exception task IDs: 22eb0ac0, 67385a82, 794b24be, 9565186b, a740d043, aedd82e4, b1948b0a, cce03e0d
Regenerate exception reports:
cd asra-arc
python scripts/inspect_phase2_exceptions.py \
--report-dir data/analysis/phase2/reports/training \
--summary-json data/analysis/phase2/summary_training.json \
--output-md data/analysis/phase2/EXCEPTIONS_TRAINING.md --label training
Machine-readable: exceptions_training.json, exceptions_evaluation.json.
6. Implementation reference
| Module | Path |
|---|---|
| Object extractor | asra-arc/src/asra/perception/objects.py |
| Transform detector | asra-arc/src/asra/perception/transforms.py |
| Rule generator | asra-arc/src/asra/perception/rules.py |
| Batch eval | asra-arc/scripts/eval_phase2_batch.py |
| Exception inspector | asra-arc/scripts/inspect_phase2_exceptions.py |
CLI: python -m asra run-phase2
Integration with Phase 1: object scenes optional in transition exports via ASRA_OBJECT_SCENES=1 and snapshot.py.
7. Relation to ARC-AGI-3
| Context | Phase 2 role |
|---|---|
| Original ARC (this report) | Full perception + rule induction on static demos |
| ARC-AGI-3 Kaggle agent | Compact object_scene() / object_delta() hints in template agent |
| Competition score | Not directly predicted by Original ARC metrics — see Evaluation Report v0 |
Object-centric hints bias interactive exploration when cell-level diffs are ambiguous; they do not replace transition logging (Phase 1).
8. Limitations
- No test-output solving — metrics are perception + demo consistency only.
- Greedy object matching — may mis-align objects on dense grids; affects event counts.
- Under-reported transform classes — RECOLOR/REFLECT folded into other labels.
- Static corpus only — no action semantics or multi-step episodes in this eval.
- Heuristic rule generator — not a learned program synthesizer.
9. Artifacts
| Path | Description |
|---|---|
data/analysis/phase2/PHASE2_EVALUATION_REPORT.md |
Pipeline run summary (source) |
data/analysis/phase2/summary_training.json |
Aggregated training metrics |
data/analysis/phase2/summary_evaluation.json |
Aggregated evaluation metrics |
data/analysis/phase2/reports/training/ |
Per-task JSON (400) |
data/analysis/phase2/reports/evaluation/ |
Per-task JSON (400) |
data/analysis/phase2/EXCEPTIONS_REPORT.md |
Exception overview |
10. References
- ASRA Phase 2 preprint — Object-Centric Reasoning
- ASRA Integrated Architecture
- Original ARC corpus
- Local theory companion:
kaggle-notebooks/phase2/asra-phase2-object-centric-reasoning.md
11. One-line takeaway
ASRA Phase 2 achieves 100% rule-candidate coverage and ~98% cross-demo rule consistency on all 800 Original ARC tasks, with evaluation split showing ~2× structural complexity — establishing a reproducible perception baseline before interactive ARC-AGI-3 deployment.