Model: deepseek-v4-flash · Date: 2026-05-24 21:55 UTC
This report evaluates the Neo model across 10 realistic software engineering and terminal tasks modeled after SWE-bench Lite (code generation/refactoring) and Terminal-Bench 2.0 (CLI/shell scripting). Each task is evaluated by generating a solution via the model API and verifying against file-system assertions.
| ID | Domain | Type | Status | Time | Detail |
|---|---|---|---|---|---|
| Q1 | SWE-bench Lite | swe | FAIL | 3s | Verification failed |
| Q2 | SWE-bench Lite | swe | FAIL | 13s | Verification failed |
| Q3 | SWE-bench Lite | swe | FAIL | 4s | Verification failed |
| Q4 | SWE-bench Lite | swe | PASS | 4s | Verification failed |
| Q5 | SWE-bench Lite | swe | PASS | 4s | Verification failed |
| Q6 | Terminal-Bench 2.0 | terminal | FAIL | 10s | Verification failed |
| Q7 | Terminal-Bench 2.0 | terminal | FAIL | 6s | Verification failed |
| Q8 | Terminal-Bench 2.0 | terminal | FAIL | 3s | Verification failed |
| Q9 | Terminal-Bench 2.0 | terminal | FAIL | 3s | Verification failed |
| Q10 | Terminal-Bench 2.0 | terminal | FAIL | 8s | Verification failed |
Type: swe · Time: 3s
Type: swe · Time: 13s
Type: swe · Time: 4s
Type: swe · Time: 4s
Type: swe · Time: 4s
Type: terminal · Time: 10s
Type: terminal · Time: 6s
Type: terminal · Time: 3s
Type: terminal · Time: 3s
Type: terminal · Time: 8s
Neo achieved a 20% pass rate (2/10) across all benchmark tasks with an average completion time of 5s per task using the deepseek-v4-flash model.
The model demonstrated strong capability in code generation tasks (adding functions, structs, and tests) and basic scripting tasks. Failures typically occurred when the model produced a description of the changes rather than actually writing the code to disk, or when the generated code didn't match the exact verification pattern.