Beyond DALI: Benchmarking Modern GPU Video Inference Pipelines

Yash Jain April 2026

I ran the same YOLO inference workload through four completely different video pipeline stacks — from plain OpenCV on CPU all the way to PyNvVideoCodec 2.1 with CV-CUDA preprocessing — and measured every stage independently. Here's what I found, and why it matters for anyone building real-time video AI.

Why I built this benchmark

Video inference pipelines are everywhere — surveillance systems, autonomous vehicles, content moderation, real-time analytics. But most tutorials still show the naive approach: decode with OpenCV on CPU, resize on CPU, then hand frames to a GPU model. I wanted to know how much performance you're actually leaving on the table, and which modern GPU-native libraries close that gap most effectively in 2025/2026.

The GPU-accelerated approach using NVDEC and dedicated preprocessing libraries has been talked about for years. A lot has changed recently. NVIDIA's own PyNvVideoCodec 2.0 dropped in mid-2025, and CV-CUDA (co-developed with ByteDance) has been steadily maturing. I wanted to know: if you were building a video inference pipeline from scratch today, what stack would actually win?

The question

Four libraries, one A100, one YOLO model, 1024 frames. Which pipeline finishes first — and why?

I wrote a benchmarking harness that times each stage independently — decode, preprocess, inference, and postprocess — so you can see exactly where the time goes, not just a single wall-clock number.

What's inside a video inference pipeline?

Before diving into numbers, it helps to understand the five stages every frame passes through. Think of it like a factory assembly line — each station transforms the data before passing it to the next.

Demux CPU / FFmpeg Decode NVDEC Preprocess GPU cores Inference YOLO / GPU Postprocess PyTorch Split streams Raw frames Resize/norm Predictions Boxes/labels

The bottleneck has historically been decode + preprocess. When you use OpenCV, both of these run on the CPU, and every frame has to make a round trip from GPU memory back to CPU and then back to GPU for inference. That's a lot of wasted bus bandwidth. The entire point of the DALI-style approach is to keep frames on the GPU from the moment they're decoded.

The four pipelines I tested

Each pipeline represents a different philosophy about how to move video through the system. Here's the lineup:

Pipeline Decode Preprocess Inference OpenCV CPU Baseline CPU (cv2) CPU (cv2) GPU (torch) FFmpeg + DALI Article baseline NVDEC (DALI) GPU (torch) GPU (torch) PyNvVideoCodec 2.1 NVIDIA 2025 NVDEC direct GPU (torch) GPU (torch) PyNvVC + CV-CUDA NVIDIA + ByteDance NVDEC direct CV-CUDA ops GPU (torch)

OpenCV CPU — the baseline everyone starts with

Most people's first video inference pipeline looks like this: cv2.VideoCapture to read frames, cv2.resize to scale them, then hand the result to a PyTorch model. It works. It's simple. But it's also leaving most of your hardware idle — the GPU sits waiting while the CPU decodes and resizes each frame one at a time.

FFmpeg + DALI — the 2022 state of the art

NVIDIA's Data Loading Library (DALI) was designed to fix exactly this. It uses dedicated NVDEC hardware inside the GPU to decode video, then runs resize and normalization on GPU cores — all without the frame ever touching CPU memory. The article I mentioned used this approach and got ~6× over OpenCV. DALI is powerful but opinionated: it has its own pipeline definition language and works best with MP4 files and multiple-video training scenarios.

PyNvVideoCodec 2.1 — the 2025 successor

Released in mid-2025 (with 2.1 dropping in January 2026), PyNvVideoCodec is NVIDIA's official replacement for VPF (Video Processing Framework). It gives you direct Python access to the NVDEC hardware with a much simpler API than DALI — just ThreadedDecoder, get_batch_frames(n), and you're done. The key feature is the ThreadedDecoder: it decodes in a background thread and pre-buffers frames so your inference loop never waits for decode to finish.

PyNvVideoCodec + CV-CUDA — fused GPU preprocessing

CV-CUDA is an open-source library (originally a collaboration between NVIDIA and ByteDance) that provides GPU-accelerated image processing operators — resize, normalize, color convert — as zero-copy operations. The idea is to keep the entire pipeline on GPU with no intermediate CPU copies. cvcuda.resize + cvcuda.convertto replace torch.nn.functional.interpolate with purpose-built CUDA kernels.

How I benchmarked fairly

Getting a fair comparison between these libraries took more iteration than I expected. The key challenge: they all have different APIs, and naively wrapping them produced highly misleading numbers.

My solution: pre-decode all frames once at setup, cache them in GPU memory, then benchmark only the preprocess → inference → postprocess loop. Decode time is measured separately as a single timed pass. This means:

Methodology note

All pipelines used the same YOLOv8n model, same 640×640 input resolution, FP16 inference, on an NVIDIA A100-SXM4-40GB with PyTorch 2.6 + CUDA 12.4. Each pipeline ran 3 warmup iterations before 10 timed benchmark runs, processing 1024 frames per run. The full sweep covers batch sizes 16, 32, 64, and 128. CV-CUDA fails at batch 128 with a tensor size limit and is excluded from that column.

Results

140
OpenCV FPS
383
DALI FPS
405
PyNvVC FPS
451
CV-CUDA FPS

All figures at batch size 16, 1024 frames, A100-SXM4-40GB.

Fig 1 — End-to-end throughput at batch 16 (FPS) · higher is better

At batch 16, CV-CUDA leads at 451 FPS, followed by PyNvVideoCodec at 405 FPS, DALI at 383 FPS, and OpenCV at 140 FPS. But the more interesting story is how these numbers shift as batch size grows — which we'll get to after the stage breakdown.

Fig 2 — Per-stage latency breakdown at batch 16 (avg over 10 runs)

This is the most important chart. OpenCV CPU spends 2.47s on preprocess — resizing and colour-converting 1024 frames on CPU. Every GPU pipeline collapses that to around 0.086–0.107s. That single change — moving preprocess off the CPU — is where essentially all of the 3× speedup comes from. Decode times are within ~0.3s of each other. Inference across the three GPU pipelines is nearly identical (0.73–0.91s) once the GPU isn't being starved — OpenCV's 3.07s inference is inflated because the CPU bottleneck forces many small, inefficient kernel launches.

CV-CUDA's 0.086s preprocess vs DALI/PyNvVC's 0.107s looks small in absolute terms, but it compounds: CV-CUDA also gets faster inference (0.728s vs 0.910s), likely because its zero-copy tensor output lands in a more cache-friendly layout for the model input.

Fig 3 — Speedup vs OpenCV CPU baseline (batch 16)
Fig 4 — FPS across all batch sizes · all 4 pipelines

The batch sweep tells the complete story. A few things stand out:

CV-CUDA wins at every batch size it supports, and the gap widens at batch 64 — 459 FPS vs PyNvVideoCodec's 427 FPS, a 7.5% advantage. At batch 128 CV-CUDA throws NVCV_ERROR_INVALID_ARGUMENT: Input or output tensors are too large and fails entirely, making PyNvVideoCodec the only option at that scale.

OpenCV actually improves slightly at larger batch sizes (140 → 149 FPS from batch 16 to 128). This is because larger batches amortise the per-batch inference overhead, and the GPU utilisation improves. The CPU preprocess bottleneck stays roughly constant, but inference becomes slightly more efficient.

DALI is consistently third, sitting between OpenCV and PyNvVideoCodec across all batch sizes. Its preprocess and inference times are actually identical to PyNvVideoCodec's — the gap comes entirely from slower decode (1.64s vs 1.50s at batch 16), likely due to DALI's internal pipeline orchestration overhead around the NVDEC path.

PyNvVideoCodec peaks at batch 64 (427 FPS) and actually dips slightly to 420 FPS at batch 128 — the only GPU pipeline to regress at larger batch sizes. This may be a memory pressure effect at batch 128 that doesn't affect DALI's simpler memory management.

Pipeline B16 FPS B32 FPS B64 FPS B128 FPS
OpenCV CPU 140.2141.9144.4148.8
FFmpeg + DALI 382.8387.7391.3396.7
PyNvVideoCodec 2.1 405.0414.0427.3420.4
PyNvVC + CV-CUDA 451.1455.0459.3❌ crash
Pipeline Decode (s) Preprocess (s) Inference (s) Total B16 (s)
OpenCV CPU 1.7522.4713.0657.303
FFmpeg + DALI 1.6440.1070.9102.675
PyNvVideoCodec 2.1 1.4970.1070.9102.528
PyNvVC + CV-CUDA 1.4440.0860.7282.270

What I learned

Preprocess is the real CPU bottleneck — not inference

Before running this benchmark, I assumed inference would dominate. It doesn't. cv2.resize + cv2.cvtColor for 1024 frames takes 2.47 seconds on CPU. The same operation on GPU takes around 86–107 milliseconds depending on the pipeline — a 23–29× reduction. That single change accounts for essentially all of the end-to-end speedup. Decode times are within ~0.3s of each other across all pipelines. Inference across the three GPU pipelines is nearly identical (0.73–0.91s) once the GPU stops being starved — OpenCV's 3.07s inference is inflated because per-frame CPU handoff forces many small, inefficient GPU kernel launches.

CV-CUDA is the fastest — if your batch size is under 128

CV-CUDA wins at every batch size it supports: 451 FPS at batch 16, 455 at batch 32, 459 at batch 64. The gap over plain PyNvVideoCodec widens as batch size grows — 7.5% faster at batch 64 (459 vs 427 FPS). The advantage comes from CV-CUDA's fused resize + normalize kernel, which avoids multiple CUDA kernel launches and produces a zero-copy tensor that feeds directly into the model with better cache locality.

At batch 128, CV-CUDA throws NVCV_ERROR_INVALID_ARGUMENT: Input or output tensors are too large and fails entirely. If your production workload runs at batch 128 or above, PyNvVideoCodec with torch.nn.functional.interpolate is your only option.

PyNvVideoCodec 2.1 is the safest default for new projects

DALI is powerful but has a steep learning curve — its pipeline DSL is non-trivial, it's sensitive to video container formats (MOV files caused 180-second decode times in my testing due to seeking behavior), and it's overkill for single-video inference. PyNvVideoCodec 2.1 gives you direct NVDEC access via a pip install, a simple Python API, works at all batch sizes, and beats DALI at every batch size in this benchmark.

Gotcha I hit

PyNvVideoCodec's ThreadedDecoder documentation online doesn't match the installed API on clusters. The actual interface is get_batch_frames(n) + end(), not an iterator. Always introspect with dir(ThreadedDecoder) on your target machine before writing code.

Inference is the new bottleneck

Once decode and preprocess move to GPU, inference dominates at 0.73–0.91s per 1024 frames across all GPU pipelines. The next logical optimisation is TensorRT — exporting YOLOv8n to a TensorRT FP16 engine typically halves inference time on A100, which would push these pipelines toward 600–800 FPS. That's the remaining 2× sitting on the table.

What's next

One thing not tested here: pinning decode and preprocessing to the same explicit CUDA stream. In the current setup there's implicit synchronisation between NVDEC output and CV-CUDA ops, which likely costs a few milliseconds per batch. Explicit stream management is on the list for a follow-up, along with TensorRT inference timing.

The key code pattern

Here's the core of the PyNvVideoCodec pipeline — the part that actually matters:

# Setup: decode once, cache in GPU memory
from PyNvVideoCodec import ThreadedDecoder, OutputColorType

dec = ThreadedDecoder(
    video_path,
    1024 * 2,         # buffer_size — required positional arg
    gpu_id=0,
    output_color_type=OutputColorType.RGB,  # interleaved HWC
)
all_frames = dec.get_batch_frames(1024)
dec.end()

# Benchmark loop: preprocess → infer → postprocess
for batch_hwc in batches:
    # [B, H, W, C] uint8 → [B, C, H, W] float32
    f = batch_hwc.float().div(255.0).permute(0,3,1,2).contiguous()
    f = torch.nn.functional.interpolate(
        f, size=(640, 640), mode="bilinear", align_corners=False
    )
    results = model(f, verbose=False, half=True)

And the CV-CUDA variant, which replaces the preprocess step:

import cvcuda

nhwc     = batch_hwc.contiguous()
ct       = cvcuda.as_tensor(nhwc, "NHWC")           # zero-copy
resized  = cvcuda.resize(ct, (N, 640, 640, 3), cvcuda.Interp.LINEAR)
normed   = cvcuda.convertto(resized, cvcuda.Type.F32, scale=1.0/255.0)
final    = torch.as_tensor(normed.cuda()).permute(0,3,1,2).contiguous()

Summary

TL;DR

Moving preprocess from CPU to GPU gives you the bulk of the speedup (23× on preprocess alone). PyNvVideoCodec 2.1 is the most practical choice for 2025/2026 — simpler API than DALI, faster decode, pip-installable. CV-CUDA is the faster preprocessor, with a 7.5% edge over PyNvVideoCodec at batch 64. At batch 128 CV-CUDA hits a tensor size limit, so PyNvVideoCodec is the only option at that scale. All GPU pipelines are 2.7–3.2× faster than OpenCV CPU end-to-end on A100.

If you're starting a video inference project today: skip OpenCV for anything performance-sensitive, use PyNvVideoCodec 2.1 as your decoder, and use CV-CUDA for preprocessing if your batch size is 32 or above. If you need TensorRT inference, that's the remaining 2× sitting on the table.

All benchmark code is available here — the harness handles pre-decode caching, per-stage timing with CUDA synchronization, CSV export, and chart generation automatically.