Paper review: Not quite the AlphaGo moment yet
ASI‑ARCH presents a fully autonomous LLM‑driven pipeline that reports discovering 106 "state‑of‑the‑art" linear‑attention architectures and frames this as an "AlphaGo‑like" leap.
Summary
ASI‑ARCH presents a fully autonomous LLM‑driven pipeline that reports discovering 106 “state‑of‑the‑art” linear‑attention architectures and frames this as an “AlphaGo‑like” leap with a scaling law for discovery oai_citation:0‡arXiv.
Table 1 shows only 1–3‑point gains over DeltaNet and Mamba2 at 340 M parameters and provides no confidence intervals or efficiency data oai_citation:1‡arXiv.
Key modules (novelty filter, LLM‑as‑Judge, cognition base) still depend on a hand‑curated set of ≈100 prior papers and carefully engineered prompts, so the process is not fully “human‑free” oai_citation:2‡GitHub.
Overall, the study is an interesting automation prototype, but the evidence falls short of an AlphaGo‑scale breakthrough.
Claimed contributions vs. documented evidence
-
Autonomous discovery loop – Multi‑agent system handles idea generation, code, debugging, and scoring oai_citation:3‡GitHub; yet it starts from a fixed DeltaNet baseline and curated knowledge, limiting autonomy.
-
106 SOTA models – Achieved after 1 773 experiments and 20 000 GPU‑hours oai_citation:4‡arXiv; evaluation compares only to DeltaNet and Mamba2, omitting stronger baselines.
-
Scaling law – Figure 1 shows a linear relation between discoveries and compute; this is expected when each run has similar cost and does not model diminishing returns oai_citation:5‡arXiv.
Metrics & evaluation design
-
Benchmarks – Seven reasoning datasets plus WikiText‑103 and LAMBADA cover limited aspects of language quality oai_citation:6‡arXiv.
-
Scale – Experiments stop at 340 M parameters; DeltaNet itself reaches 1.3 B and improves more with size oai_citation:7‡arXiv.
-
Baselines – Mamba usually shines at 3 B parameters, but only a reduced 340 M “Mamba2” is tested oai_citation:8‡tridao.me.
-
Statistical rigor – Single runs reported; no variance, p‑values, or ablations.
Pipeline robustness
-
Self‑revision – Engineer agent iteratively fixes training errors using captured logs oai_citation:9‡GitHub.
-
LLM‑as‑Judge – Provides qualitative novelty/complexity scores without inter‑rater agreement or calibration oai_citation:10‡arXiv.
-
Exploration–verification – Two‑stage funnel filters 1 350 candidates to 106 but still trains ≈400 large models, leaving efficiency vs. random search unclear oai_citation:11‡arXiv.
External perspective
Verdict & recommendations
-
Interesting engineering – LLM‑centric automation pipeline is worth replicating.
-
Claims overstated – Results are narrow, mid‑scale, and lack statistical depth.
-
Future validation
- Add Transformer, Performer, and SSM baselines at ≥1 B parameters.
- Report variance, significance, and compute/energy per model.
- Benchmark inference speed vs. sequence length.
- Ablate each agent (Judge, cognition base, etc.) to measure contribution.