Reference code & inference utilities for the VEFX-Bench benchmark โ a comprehensive benchmark for evaluating text-driven video editing and visual effects.
5,049 annotated examples spanning 9 categories & 32 subcategories, evaluated by VEFX-Reward โ a VLM-based reward model that scores edits across three dimensions on a 1โ4 scale:
Does the edit accurately reflect the editing instruction?
Visual clarity, temporal consistency, physical plausibility.
Were only the intended regions modified, without side-effects?
VEFX-Reward scores on 1โ4 scale. Ranked by GeoAgg (ฮฑ=2 for IF, ฮฒ=1 for RQ, ฮณ=1 for EE). Higher is better.
| Rank | Model | Type | IF โ | RQ โ | EE โ | GeoAgg โ |
|---|---|---|---|---|---|---|
| ๐ฅ | Kling o3 Omni | Commercial | 3.033 | 3.588 | 3.043 | 3.057 |
| ๐ฅ | Kling o1 | Commercial | 3.040 | 3.534 | 2.976 | 2.985 |
| ๐ฅ | Runway Gen-4.5 | Commercial | 2.817 | 3.319 | 2.923 | 2.912 |
| 4 | Seedance 2.0 | Commercial | 2.811 | 3.421 | 3.088 | 2.766 |
| 5 | Grok Imagine | Commercial | 2.606 | 3.346 | 3.376 | 2.723 |
| 6 | Luma Ray 3 | Commercial | 2.702 | 3.403 | 2.705 | 2.717 |
| 7 | UniVideo | Open-source | 2.294 | 3.266 | 3.091 | 2.516 |
| 8 | Wan 2.6 | Commercial | 2.012 | 3.317 | 2.446 | 2.146 |
| 9 | Luma Ray 2 | Commercial | 2.038 | 2.532 | 1.363 | 1.804 |
| 10 | VACE | Open-source | 2.027 | 3.172 | 1.180 | 1.775 |
An example pair from examples/sample_videos/ โ original input on the left, edited output on the right.
Original
Edited
pip install -r requirements.txt
python examples/quick_start.py \
--original examples/sample_videos/original.mp4 \
--edited examples/sample_videos/edited.mp4 \
--instruction "Change the color of the trailer to bright yellow"
See examples/ for batch & multi-GPU scoring scripts.