Paper review: Not quite the AlphaGo moment yet
ASI-ARCH presents a fully autonomous LLM-driven pipeline that reports discovering 106 "state-of-the-art" linear-attention architectures and frames this as an "AlphaGo-like" leap.
Summary
ASI-ARCH presents a fully autonomous LLM-driven pipeline that reports discovering 106 “state-of-the-art” linear-attention architectures and frames this as an “AlphaGo-like” leap with a scaling law for discovery oai_citation:0‡arXiv.
Table 1 shows only 1-3-point gains over DeltaNet and Mamba2 at 340 M parameters and provides no confidence intervals or efficiency data oai_citation:1‡arXiv.
Key modules (novelty filter, LLM-as-Judge, cognition base) still depend on a hand-curated set of ≈100 prior papers and carefully engineered prompts, so the process is not fully “human-free” oai_citation:2‡GitHub.
Overall, the study is an interesting automation prototype, but the evidence falls short of an AlphaGo-scale breakthrough.
Claimed contributions vs. documented evidence
-
Autonomous discovery loop - Multi-agent system handles idea generation, code, debugging, and scoring oai_citation:3‡GitHub; yet it starts from a fixed DeltaNet baseline and curated knowledge, limiting autonomy.
-
106 SOTA models - Achieved after 1 773 experiments and 20 000 GPU-hours oai_citation:4‡arXiv; evaluation compares only to DeltaNet and Mamba2, omitting stronger baselines.
-
Scaling law - Figure 1 shows a linear relation between discoveries and compute; this is expected when each run has similar cost and does not model diminishing returns oai_citation:5‡arXiv.
Metrics & evaluation design
-
Benchmarks - Seven reasoning datasets plus WikiText-103 and LAMBADA cover limited aspects of language quality oai_citation:6‡arXiv.
-
Scale - Experiments stop at 340 M parameters; DeltaNet itself reaches 1.3 B and improves more with size oai_citation:7‡arXiv.
-
Baselines - Mamba usually shines at 3 B parameters, but only a reduced 340 M “Mamba2” is tested oai_citation:8‡tridao.me.
-
Statistical rigor - Single runs reported; no variance, p-values, or ablations.
Pipeline robustness
-
Self-revision - Engineer agent iteratively fixes training errors using captured logs oai_citation:9‡GitHub.
-
LLM-as-Judge - Provides qualitative novelty/complexity scores without inter-rater agreement or calibration oai_citation:10‡arXiv.
-
Exploration-verification - Two-stage funnel filters 1 350 candidates to 106 but still trains ≈400 large models, leaving efficiency vs. random search unclear oai_citation:11‡arXiv.
External perspective
Verdict & recommendations
-
Interesting engineering - LLM-centric automation pipeline is worth replicating.
-
Claims overstated - Results are narrow, mid-scale, and lack statistical depth.
-
Future validation
- Add Transformer, Performer, and SSM baselines at ≥1 B parameters.
- Report variance, significance, and compute/energy per model.
- Benchmark inference speed vs. sequence length.
- Ablate each agent (Judge, cognition base, etc.) to measure contribution.
© Tobias Klein 2025 · All rights reserved
LinkedIn: https://www.linkedin.com/in/deep-learning-mastery/