hacker news Hacker News
  1. new
  2. show
  3. ask
  4. jobs
We've been running coding agents against standardized repos with natural-language prompts — no tool names, no hints — and measuring what they actually choose.

Early finding: Claude Code picks Custom/DIY in 12 of 20 categories. Not because it can't use the tools (BFCL scores suggest it can) but because it doesn't reach for them. That's a different failure mode than capability benchmarks measure.

We score each tool on: agent visibility, pick rate vs Custom/DIY, cross-context breadth, expert human ratings, and implementation success rate. Tools above survival=1 persist. Below it, agents synthesize around them.

Methodology is at survivalindex.org/methodology. Very curious what people think of the measurement approach, especially the human coefficient variable.

loading...