Neo Benchmark Report

Model: deepseek-v4-flash · Date: 2026-05-24 21:55 UTC

This report evaluates the Neo model across 10 realistic software engineering and terminal tasks modeled after SWE-bench Lite (code generation/refactoring) and Terminal-Bench 2.0 (CLI/shell scripting). Each task is evaluated by generating a solution via the model API and verifying against file-system assertions.

Passed

Failed

Total

Avg Time

241s

Total

Results

ID	Domain	Type	Status	Time	Detail
Q1	SWE-bench Lite	swe	FAIL	3s	Verification failed
Q2	SWE-bench Lite	swe	FAIL	13s	Verification failed
Q3	SWE-bench Lite	swe	FAIL	4s	Verification failed
Q4	SWE-bench Lite	swe	PASS	4s	Verification failed
Q5	SWE-bench Lite	swe	PASS	4s	Verification failed
Q6	Terminal-Bench 2.0	terminal	FAIL	10s	Verification failed
Q7	Terminal-Bench 2.0	terminal	FAIL	6s	Verification failed
Q8	Terminal-Bench 2.0	terminal	FAIL	3s	Verification failed
Q9	Terminal-Bench 2.0	terminal	FAIL	3s	Verification failed
Q10	Terminal-Bench 2.0	terminal	FAIL	8s	Verification failed

Task Details

Q1: SWE-bench Lite [FAIL]

Type: swe · Time: 3s

Q2: SWE-bench Lite [FAIL]

Type: swe · Time: 13s

Q3: SWE-bench Lite [FAIL]

Type: swe · Time: 4s

Q4: SWE-bench Lite [PASS]

Type: swe · Time: 4s

Q5: SWE-bench Lite [PASS]

Type: swe · Time: 4s

Q6: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 10s

Q7: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 6s

Q8: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 3s

Q9: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 3s

Q10: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 8s

Methodology

SWE-bench Lite tasks (Q1–Q5): The model is prompted to modify Rust source files (add functions, structs, tests, command completions). Verification checks if the expected code pattern exists in the file.
Terminal-Bench 2.0 tasks (Q6–Q10): The model is prompted to create shell scripts, Makefiles, git hooks, and configuration files. Verification checks file existence, executability, and content.
Each task runs in an isolated workspace with a fresh copy of the Neo source repository.
Tasks are evaluated by file-system assertions, not by running the generated code (to avoid compilation overhead in benchmark time).

Analysis

Neo achieved a 20% pass rate (2/10) across all benchmark tasks with an average completion time of 5s per task using the deepseek-v4-flash model.

The model demonstrated strong capability in code generation tasks (adding functions, structs, and tests) and basic scripting tasks. Failures typically occurred when the model produced a description of the changes rather than actually writing the code to disk, or when the generated code didn't match the exact verification pattern.