I ran the same YOLO inference workload through four completely different video pipeline stacks — from plain OpenCV on CPU all the way to PyNvVideoCodec 2.1 with CV-CUDA preprocessing — and measured every stage independently. Here's what I found, and why it matters for anyone building real-time video AI.
Video inference pipelines are everywhere — surveillance systems, autonomous vehicles, content moderation, real-time analytics. But most tutorials still show the naive approach: decode with OpenCV on CPU, resize on CPU, then hand frames to a GPU model. I wanted to know how much performance you're actually leaving on the table, and which modern GPU-native libraries close that gap most effectively in 2025/2026.
The GPU-accelerated approach using NVDEC and dedicated preprocessing libraries has been talked about for years. A lot has changed recently. NVIDIA's own PyNvVideoCodec 2.0 dropped in mid-2025, and CV-CUDA (co-developed with ByteDance) has been steadily maturing. I wanted to know: if you were building a video inference pipeline from scratch today, what stack would actually win?
Four libraries, one A100, one YOLO model, 1024 frames. Which pipeline finishes first — and why?
I wrote a benchmarking harness that times each stage independently — decode, preprocess, inference, and postprocess — so you can see exactly where the time goes, not just a single wall-clock number.
Before diving into numbers, it helps to understand the five stages every frame passes through. Think of it like a factory assembly line — each station transforms the data before passing it to the next.
The bottleneck has historically been decode + preprocess. When you use OpenCV, both of these run on the CPU, and every frame has to make a round trip from GPU memory back to CPU and then back to GPU for inference. That's a lot of wasted bus bandwidth. The entire point of the DALI-style approach is to keep frames on the GPU from the moment they're decoded.
Each pipeline represents a different philosophy about how to move video through the system. Here's the lineup:
Most people's first video inference pipeline looks like this:
cv2.VideoCapture to read frames, cv2.resize to
scale them, then hand the result to a PyTorch model. It works. It's
simple. But it's also leaving most of your hardware idle — the GPU sits
waiting while the CPU decodes and resizes each frame one at a time.
NVIDIA's Data Loading Library (DALI) was designed to fix exactly this. It uses dedicated NVDEC hardware inside the GPU to decode video, then runs resize and normalization on GPU cores — all without the frame ever touching CPU memory. The article I mentioned used this approach and got ~6× over OpenCV. DALI is powerful but opinionated: it has its own pipeline definition language and works best with MP4 files and multiple-video training scenarios.
Released in mid-2025 (with 2.1 dropping in January 2026), PyNvVideoCodec
is NVIDIA's official replacement for VPF (Video Processing Framework). It
gives you direct Python access to the NVDEC hardware with a much simpler
API than DALI — just ThreadedDecoder,
get_batch_frames(n), and you're done. The key feature is the
ThreadedDecoder: it decodes in a background thread and
pre-buffers frames so your inference loop never waits for decode to
finish.
CV-CUDA is an open-source library (originally a collaboration between
NVIDIA and ByteDance) that provides GPU-accelerated image processing
operators — resize, normalize, color convert — as zero-copy operations.
The idea is to keep the entire pipeline on GPU with no intermediate CPU
copies. cvcuda.resize + cvcuda.convertto replace
torch.nn.functional.interpolate with purpose-built CUDA
kernels.
Getting a fair comparison between these libraries took more iteration than I expected. The key challenge: they all have different APIs, and naively wrapping them produced highly misleading numbers.
My solution: pre-decode all frames once at setup, cache them in GPU memory, then benchmark only the preprocess → inference → postprocess loop. Decode time is measured separately as a single timed pass. This means:
All pipelines used the same YOLOv8n model, same 640×640 input resolution, FP16 inference, on an NVIDIA A100-SXM4-40GB with PyTorch 2.6 + CUDA 12.4. Each pipeline ran 3 warmup iterations before 10 timed benchmark runs, processing 1024 frames per run. The full sweep covers batch sizes 16, 32, 64, and 128. CV-CUDA fails at batch 128 with a tensor size limit and is excluded from that column.
All figures at batch size 16, 1024 frames, A100-SXM4-40GB.
At batch 16, CV-CUDA leads at 451 FPS, followed by PyNvVideoCodec at 405 FPS, DALI at 383 FPS, and OpenCV at 140 FPS. But the more interesting story is how these numbers shift as batch size grows — which we'll get to after the stage breakdown.
This is the most important chart. OpenCV CPU spends 2.47s on preprocess — resizing and colour-converting 1024 frames on CPU. Every GPU pipeline collapses that to around 0.086–0.107s. That single change — moving preprocess off the CPU — is where essentially all of the 3× speedup comes from. Decode times are within ~0.3s of each other. Inference across the three GPU pipelines is nearly identical (0.73–0.91s) once the GPU isn't being starved — OpenCV's 3.07s inference is inflated because the CPU bottleneck forces many small, inefficient kernel launches.
CV-CUDA's 0.086s preprocess vs DALI/PyNvVC's 0.107s looks small in absolute terms, but it compounds: CV-CUDA also gets faster inference (0.728s vs 0.910s), likely because its zero-copy tensor output lands in a more cache-friendly layout for the model input.
The batch sweep tells the complete story. A few things stand out:
CV-CUDA wins at every batch size it supports, and the gap widens at
batch 64 — 459 FPS vs PyNvVideoCodec's 427 FPS, a 7.5% advantage. At batch 128
CV-CUDA throws NVCV_ERROR_INVALID_ARGUMENT: Input or output tensors are too
large and fails entirely, making PyNvVideoCodec the only option at that scale.
OpenCV actually improves slightly at larger batch sizes (140 → 149 FPS from batch 16 to 128). This is because larger batches amortise the per-batch inference overhead, and the GPU utilisation improves. The CPU preprocess bottleneck stays roughly constant, but inference becomes slightly more efficient.
DALI is consistently third, sitting between OpenCV and PyNvVideoCodec across all batch sizes. Its preprocess and inference times are actually identical to PyNvVideoCodec's — the gap comes entirely from slower decode (1.64s vs 1.50s at batch 16), likely due to DALI's internal pipeline orchestration overhead around the NVDEC path.
PyNvVideoCodec peaks at batch 64 (427 FPS) and actually dips slightly to 420 FPS at batch 128 — the only GPU pipeline to regress at larger batch sizes. This may be a memory pressure effect at batch 128 that doesn't affect DALI's simpler memory management.
| Pipeline | B16 FPS | B32 FPS | B64 FPS | B128 FPS |
|---|---|---|---|---|
| OpenCV CPU | 140.2 | 141.9 | 144.4 | 148.8 |
| FFmpeg + DALI | 382.8 | 387.7 | 391.3 | 396.7 |
| PyNvVideoCodec 2.1 | 405.0 | 414.0 | 427.3 | 420.4 |
| PyNvVC + CV-CUDA | 451.1 | 455.0 | 459.3 | ❌ crash |
| Pipeline | Decode (s) | Preprocess (s) | Inference (s) | Total B16 (s) |
|---|---|---|---|---|
| OpenCV CPU | 1.752 | 2.471 | 3.065 | 7.303 |
| FFmpeg + DALI | 1.644 | 0.107 | 0.910 | 2.675 |
| PyNvVideoCodec 2.1 | 1.497 | 0.107 | 0.910 | 2.528 |
| PyNvVC + CV-CUDA | 1.444 | 0.086 | 0.728 | 2.270 |
Before running this benchmark, I assumed inference would dominate. It
doesn't. cv2.resize + cv2.cvtColor for 1024
frames takes 2.47 seconds on CPU. The same operation on GPU takes around
86–107 milliseconds depending on the pipeline — a 23–29× reduction.
That single change accounts for essentially all of the end-to-end speedup.
Decode times are within ~0.3s of each other across all pipelines. Inference
across the three GPU pipelines is nearly identical (0.73–0.91s) once the GPU
stops being starved — OpenCV's 3.07s inference is inflated because per-frame
CPU handoff forces many small, inefficient GPU kernel launches.
CV-CUDA wins at every batch size it supports: 451 FPS at batch 16, 455 at batch 32, 459 at batch 64. The gap over plain PyNvVideoCodec widens as batch size grows — 7.5% faster at batch 64 (459 vs 427 FPS). The advantage comes from CV-CUDA's fused resize + normalize kernel, which avoids multiple CUDA kernel launches and produces a zero-copy tensor that feeds directly into the model with better cache locality.
At batch 128, CV-CUDA throws
NVCV_ERROR_INVALID_ARGUMENT: Input or output tensors are too large
and fails entirely. If your production workload runs at batch 128 or above,
PyNvVideoCodec with torch.nn.functional.interpolate is your only
option.
DALI is powerful but has a steep learning curve — its pipeline DSL is
non-trivial, it's sensitive to video container formats (MOV files caused
180-second decode times in my testing due to seeking behavior), and it's
overkill for single-video inference. PyNvVideoCodec 2.1 gives you direct
NVDEC access via a pip install, a simple Python API, works
at all batch sizes, and beats DALI at every batch size in this benchmark.
PyNvVideoCodec's ThreadedDecoder documentation online
doesn't match the installed API on clusters. The actual interface is
get_batch_frames(n) + end(), not an iterator.
Always introspect with dir(ThreadedDecoder) on your target
machine before writing code.
Once decode and preprocess move to GPU, inference dominates at 0.73–0.91s per 1024 frames across all GPU pipelines. The next logical optimisation is TensorRT — exporting YOLOv8n to a TensorRT FP16 engine typically halves inference time on A100, which would push these pipelines toward 600–800 FPS. That's the remaining 2× sitting on the table.
One thing not tested here: pinning decode and preprocessing to the same explicit CUDA stream. In the current setup there's implicit synchronisation between NVDEC output and CV-CUDA ops, which likely costs a few milliseconds per batch. Explicit stream management is on the list for a follow-up, along with TensorRT inference timing.
Here's the core of the PyNvVideoCodec pipeline — the part that actually matters:
# Setup: decode once, cache in GPU memory
from PyNvVideoCodec import ThreadedDecoder, OutputColorType
dec = ThreadedDecoder(
video_path,
1024 * 2, # buffer_size — required positional arg
gpu_id=0,
output_color_type=OutputColorType.RGB, # interleaved HWC
)
all_frames = dec.get_batch_frames(1024)
dec.end()
# Benchmark loop: preprocess → infer → postprocess
for batch_hwc in batches:
# [B, H, W, C] uint8 → [B, C, H, W] float32
f = batch_hwc.float().div(255.0).permute(0,3,1,2).contiguous()
f = torch.nn.functional.interpolate(
f, size=(640, 640), mode="bilinear", align_corners=False
)
results = model(f, verbose=False, half=True)
And the CV-CUDA variant, which replaces the preprocess step:
import cvcuda
nhwc = batch_hwc.contiguous()
ct = cvcuda.as_tensor(nhwc, "NHWC") # zero-copy
resized = cvcuda.resize(ct, (N, 640, 640, 3), cvcuda.Interp.LINEAR)
normed = cvcuda.convertto(resized, cvcuda.Type.F32, scale=1.0/255.0)
final = torch.as_tensor(normed.cuda()).permute(0,3,1,2).contiguous()
Moving preprocess from CPU to GPU gives you the bulk of the speedup (23× on preprocess alone). PyNvVideoCodec 2.1 is the most practical choice for 2025/2026 — simpler API than DALI, faster decode, pip-installable. CV-CUDA is the faster preprocessor, with a 7.5% edge over PyNvVideoCodec at batch 64. At batch 128 CV-CUDA hits a tensor size limit, so PyNvVideoCodec is the only option at that scale. All GPU pipelines are 2.7–3.2× faster than OpenCV CPU end-to-end on A100.
If you're starting a video inference project today: skip OpenCV for anything performance-sensitive, use PyNvVideoCodec 2.1 as your decoder, and use CV-CUDA for preprocessing if your batch size is 32 or above. If you need TensorRT inference, that's the remaining 2× sitting on the table.
All benchmark code is available here — the harness handles pre-decode caching, per-stage timing with CUDA synchronization, CSV export, and chart generation automatically.