What does video object detection actually do?

The tool samples your video at regular intervals (1 to 10 frames per second), runs each sampled frame through an AI model, and returns for every detected object its class (e.g. person, dog, car), its confidence (0 to 1), and its pixel-box (xyxy). You don't get a cut video file — you get structured data ready for further analysis.

Which object classes does the tool recognize?

Eighty everyday classes — people, animals (dog, cat, bird, horse, …), vehicles (car, bicycle, motorcycle, bus, …), furniture, sports gear, kitchen items, electronics. You can filter the class list before analysis so only the ones you care about are counted and drawn.

Are my videos uploaded?

No. The analysis runs entirely in your browser. Neither the video nor the computed boxes or classes ever leave your device. Only the AI model is fetched once on first use (about 9 MB for the fast variant, about 43 MB for the accurate one) — that download carries no video data, just the model file.

How do the model variants differ?

The fast variant (~9 MB) prioritizes first-inference speed and fits comfortably on mobile devices with limited memory. The accurate variant (~43 MB) gives noticeably tighter boxes and higher confidence, but takes roughly three times as long per frame. Recommended workflow: first run with the fast variant to verify the classes work for your footage, then re-run with the accurate variant for the final pass.

What does the confidence threshold mean?

Every detection carries a score between 0 and 1 — how sure the model is. 0.5 is a sensible default. Lower (e.g. 0.3) finds more objects but yields more false positives. Higher (e.g. 0.8) shows only very confident detections but misses small or partially occluded objects. Use 0.5 to 0.6 for statistics, 0.7 to 0.85 for visual review.

How long does the analysis take?

Three factors: video length, sample rate, and model variant. Example: a 5-minute video sampled at 1 fps with the fast variant takes about 100 seconds on a mid-range laptop. The same file with the accurate variant: about 5 minutes. At 10 fps the time scales up accordingly. The estimate appears in the status line after the model finishes loading.

What is the heatmap for?

The heatmap PNG aggregates the centers of every detection across the entire video onto a pixel map at the original resolution. You see at a glance where movement concentrates — valuable for sports analysis, traffic studies, or picking crops for video editing. Bright cells mean many detections, dark cells mean few.

What are the JSON, CSV, and SVG exports good for?

The **JSON file** holds the complete detection list per frame, with timestamps, class, confidence, and xyxy box — ready for Python, JavaScript, or a spreadsheet. The **CSV file** flattens the same to one row per detection — ideal for pivot tables in [Microsoft Excel](https://www.microsoft.com/en-us/microsoft-365/excel) or [Google Sheets](https://www.google.com/sheets/about/). The **SVG bundle** stacks one block per keyframe with the boxes drawn over the frame — great for reports and visual spot-checks.

Video Object Detection — AI Boxes Offline in Browser

What does video object detection do?

Video object detection samples your video at regular intervals and runs each sampled frame through a specialized neural network for object detection. For every detected object you get its class (such as person, dog, or car), its confidence between 0 and 1, and its pixel position as a bounding box (xyxy: left, top, right, bottom). The result is not a cut video — it’s structured data ready for statistics, analytics, visualization, or as input to downstream workflows.

The tool runs entirely inside your browser tab via WebAssembly or WebGPU. No video data is sent to a server. Only the AI model is fetched once on first use and cached in your browser; subsequent videos run fully offline.

Which object classes are recognized?

Eighty everyday classes. They cover the categories that appear most often in normal footage:

Living things: person, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe.
Vehicles: car, bicycle, motorcycle, bus, train, truck, boat, airplane.
Street furniture and signage: traffic light, fire hydrant, stop sign, parking meter, bench.
Sports and leisure: frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket.
Bags, clothing, accessories: backpack, umbrella, handbag, tie, suitcase.
Kitchen: bottle, wine glass, cup, fork, knife, spoon, bowl, microwave, oven, toaster, sink, refrigerator.
Food: banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake.
Furniture: chair, couch, potted plant, bed, dining table, toilet.
Electronics: TV, laptop, mouse, remote, keyboard, cell phone.
Other indoor: book, clock, vase, scissors, teddy bear, hair dryer, toothbrush.

Before analysis you filter the list by clicking the class pills. By default all 80 classes are active — if you only need people and dogs, click “Clear all” and re-activate the two you want. Filtering shortens the result list and makes the exported data immediately useful.

How does the frame-by-frame analysis work?

The tool decodes your video using your browser’s native WebCodecs APIs. On a real timeline the decoder jumps to the chosen sample points — at 1 fps once per second, at 10 fps ten times. Each extracted frame is converted to an internal image representation and handed to the loaded AI model.

The model performs classic bounding-box detection: it tiles the frame into many regions, estimates a class probability for each, and returns the regions whose best class is above your set threshold. The boxes are in the source video’s pixel coordinates — they fit your video 1:1 with no need to denormalize.

During processing the browser shows a progress bar and a streaming frame list. On the accurate variant a single frame can take about a second for large videos; on the fast variant it’s more like 300 to 500 milliseconds. You can stop the analysis at any time with the “Cancel” button — the data for already-processed frames is kept and stays exportable.

What does the class filter do before analysis?

You set the class filter before the run, not afterwards. Two benefits:

First, the model still scans for all 80 classes, but only those you care about land in the result stream. This shrinks the exported file and makes the JSON/CSV directly meaningful — no second filter pass in a spreadsheet.

Second, it simplifies the heatmap. If you only care about people, you don’t want a heatmap dominated by chairs and tables. With the filter on, the heatmap shows only the positions of the chosen classes — perfect for crowd heatmaps or movement studies.

Classic use cases: just “person” for crowd tracking; “person, dog” for dog-walk paths; “car, truck, bus, motorcycle” for traffic flow; “bird, sports_ball” for wildlife or sports clips.

What exports are available?

Four formats, one click each:

JSON — a structured file with frame list, timestamps, class, confidence, and box. Ready for Python (pandas.read_json), JavaScript (JSON.parse), or any analytics pipeline. The primary export for your own analysis.
CSV — one row per detection with a fixed header order: frame_index, timestamp_s, class_id, label, score, x1, y1, x2, y2. Ideal for pivot tables in Microsoft Excel, LibreOffice Calc, or Google Sheets.
SVG — a vector file with one <g> block per keyframe; each block draws the boxes with class labels and confidence percentages. Crisp at any zoom, perfect for reports and visual spot-checks.
Heatmap PNG — a pixel map at the source-video resolution with the box centers of every (or filtered) detection plotted as a density. Bright cells mean many detections, dark cells mean few. Useful for sports clips, traffic analysis, or picking the perfect crop in a video editor.

How does confidence work and how do I pick the right threshold?

Every detection has a confidence score between 0 and 1. At 0.5 the model is roughly “good enough” sure, at 0.9 very sure. The threshold you set before the run drops everything below — those detections show up neither in the stream nor in the exports.

Recommendation: start with the default 0.5. If the result shows many false positives (furniture as people, shadows as animals), raise to 0.7. If you’re sure objects are present but they don’t appear, drop to 0.4 or 0.35.

Important: confidence is not a probability in the strict statistical sense — it’s a model-internal score. For critical applications such as security or legal review, validate the results with a manual spot-check.

How fast is it on my device?

Three factors govern runtime: video length, sample rate, and model variant.

5 minute video, 1 fps, fast variant: about 100 seconds on a current laptop — most users start here.
5 minute video, 1 fps, accurate variant: about 5 minutes. Worth it when the fast variant misses too much in your footage.
5 minute video, 10 fps, fast variant: about 15 minutes. Useful for motion studies or sports clips where every second matters.
Phone browser: about three times slower than a laptop. For larger videos prefer desktop.

The estimate appears in the status line after the model finishes loading. If the run is taking too long, click “Cancel” any time — the data for already-processed frames remains and stays exportable.

How private is this?

Everything runs on your device. There is no upload, no server component, no cloud inference. That’s a meaningful difference compared to many commercial offerings where the video is uploaded to a US server, analyzed there, and the result returned. Even without active tracking, your data sovereignty is not yours.

Here the video stays in the browser tab. Closing the tab frees the memory and the video is gone. The only network connection the tool makes is the one-time model download on first use — afterwards the tool runs offline.

That’s GDPR-compatible and works for confidential or legally sensitive scenarios — kids’ sports clips, business videos with people, medical or security footage.

Possible use cases

Sports analysis: how many players are visible over time, heatmap of player positions, count of ball sightings.
Traffic flow: vehicles per second, heatmap of traffic nodes.
Crowd counting: number of people per frame as a CSV time series.
Video editing: heatmap as a crop guide to find the perfect 9:16 cut for social media.
Research: animal observations with timestamps instead of manual annotation.
Content review: a class list of every category appearing in a video before publishing.

Video Object Detection — Boxes Offline in Browser

How It Works

Pick a video

Choose analysis settings

Start analysis and export

Privacy

How do you use this tool?