Neo Benchmark Report

Model: deepseek-v4-flash · Date: 2026-05-24 21:55 UTC

This report evaluates the Neo model across 10 realistic software engineering and terminal tasks modeled after SWE-bench Lite (code generation/refactoring) and Terminal-Bench 2.0 (CLI/shell scripting). Each task is evaluated by generating a solution via the model API and verifying against file-system assertions.

2
Passed
8
Failed
10
Total
5s
Avg Time
241s
Total

Results

IDDomainTypeStatusTimeDetail
Q1SWE-bench LitesweFAIL3sVerification failed
Q2SWE-bench LitesweFAIL13sVerification failed
Q3SWE-bench LitesweFAIL4sVerification failed
Q4SWE-bench LiteswePASS4sVerification failed
Q5SWE-bench LiteswePASS4sVerification failed
Q6Terminal-Bench 2.0terminalFAIL10sVerification failed
Q7Terminal-Bench 2.0terminalFAIL6sVerification failed
Q8Terminal-Bench 2.0terminalFAIL3sVerification failed
Q9Terminal-Bench 2.0terminalFAIL3sVerification failed
Q10Terminal-Bench 2.0terminalFAIL8sVerification failed

Task Details

Q1: SWE-bench Lite [FAIL]

Type: swe · Time: 3s

Q2: SWE-bench Lite [FAIL]

Type: swe · Time: 13s

Q3: SWE-bench Lite [FAIL]

Type: swe · Time: 4s

Q4: SWE-bench Lite [PASS]

Type: swe · Time: 4s

Q5: SWE-bench Lite [PASS]

Type: swe · Time: 4s

Q6: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 10s

Q7: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 6s

Q8: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 3s

Q9: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 3s

Q10: Terminal-Bench 2.0 [FAIL]

Type: terminal · Time: 8s

Methodology

Analysis

Neo achieved a 20% pass rate (2/10) across all benchmark tasks with an average completion time of 5s per task using the deepseek-v4-flash model.

The model demonstrated strong capability in code generation tasks (adding functions, structs, and tests) and basic scripting tasks. Failures typically occurred when the model produced a description of the changes rather than actually writing the code to disk, or when the generated code didn't match the exact verification pattern.