How do you use this tool?
- Drop a video file in the dropzone or click to pick one (up to 500 MB, MP4, WebM, MOV, or MKV)
- Pick a sample rate — 1 fps saves time, 10 fps gives finer motion data
- Set the confidence threshold (default 0.5) and the class filter
- Start analysis — detections stream in frame-by-frame during processing
- Download JSON, CSV, SVG, or heatmap PNG
What does video object detection do?
Video object detection samples your video at regular intervals and runs each sampled frame through a specialized neural network for object detection. For every detected object you get its class (such as person, dog, or car), its confidence between 0 and 1, and its pixel position as a bounding box (xyxy: left, top, right, bottom). The result is not a cut video — it’s structured data ready for statistics, analytics, visualization, or as input to downstream workflows.
The tool runs entirely inside your browser tab via WebAssembly or WebGPU. No video data is sent to a server. Only the AI model is fetched once on first use and cached in your browser; subsequent videos run fully offline.
Which object classes are recognized?
Eighty everyday classes. They cover the categories that appear most often in normal footage:
- Living things: person, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe.
- Vehicles: car, bicycle, motorcycle, bus, train, truck, boat, airplane.
- Street furniture and signage: traffic light, fire hydrant, stop sign, parking meter, bench.
- Sports and leisure: frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket.
- Bags, clothing, accessories: backpack, umbrella, handbag, tie, suitcase.
- Kitchen: bottle, wine glass, cup, fork, knife, spoon, bowl, microwave, oven, toaster, sink, refrigerator.
- Food: banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake.
- Furniture: chair, couch, potted plant, bed, dining table, toilet.
- Electronics: TV, laptop, mouse, remote, keyboard, cell phone.
- Other indoor: book, clock, vase, scissors, teddy bear, hair dryer, toothbrush.
Before analysis you filter the list by clicking the class pills. By default all 80 classes are active — if you only need people and dogs, click “Clear all” and re-activate the two you want. Filtering shortens the result list and makes the exported data immediately useful.
How does the frame-by-frame analysis work?
The tool decodes your video using your browser’s native WebCodecs APIs. On a real timeline the decoder jumps to the chosen sample points — at 1 fps once per second, at 10 fps ten times. Each extracted frame is converted to an internal image representation and handed to the loaded AI model.
The model performs classic bounding-box detection: it tiles the frame into many regions, estimates a class probability for each, and returns the regions whose best class is above your set threshold. The boxes are in the source video’s pixel coordinates — they fit your video 1:1 with no need to denormalize.
During processing the browser shows a progress bar and a streaming frame list. On the accurate variant a single frame can take about a second for large videos; on the fast variant it’s more like 300 to 500 milliseconds. You can stop the analysis at any time with the “Cancel” button — the data for already-processed frames is kept and stays exportable.
What does the class filter do before analysis?
You set the class filter before the run, not afterwards. Two benefits:
First, the model still scans for all 80 classes, but only those you care about land in the result stream. This shrinks the exported file and makes the JSON/CSV directly meaningful — no second filter pass in a spreadsheet.
Second, it simplifies the heatmap. If you only care about people, you don’t want a heatmap dominated by chairs and tables. With the filter on, the heatmap shows only the positions of the chosen classes — perfect for crowd heatmaps or movement studies.
Classic use cases: just “person” for crowd tracking; “person, dog” for dog-walk paths; “car, truck, bus, motorcycle” for traffic flow; “bird, sports_ball” for wildlife or sports clips.
What exports are available?
Four formats, one click each:
- JSON — a structured file with frame list, timestamps, class, confidence, and box. Ready for Python (pandas.read_json), JavaScript (JSON.parse), or any analytics pipeline. The primary export for your own analysis.
- CSV — one row per detection with a fixed header order:
frame_index,timestamp_s,class_id,label,score,x1,y1,x2,y2. Ideal for pivot tables in Microsoft Excel, LibreOffice Calc, or Google Sheets. - SVG — a vector file with one
<g>block per keyframe; each block draws the boxes with class labels and confidence percentages. Crisp at any zoom, perfect for reports and visual spot-checks. - Heatmap PNG — a pixel map at the source-video resolution with the box centers of every (or filtered) detection plotted as a density. Bright cells mean many detections, dark cells mean few. Useful for sports clips, traffic analysis, or picking the perfect crop in a video editor.
How does confidence work and how do I pick the right threshold?
Every detection has a confidence score between 0 and 1. At 0.5 the model is roughly “good enough” sure, at 0.9 very sure. The threshold you set before the run drops everything below — those detections show up neither in the stream nor in the exports.
Recommendation: start with the default 0.5. If the result shows many false positives (furniture as people, shadows as animals), raise to 0.7. If you’re sure objects are present but they don’t appear, drop to 0.4 or 0.35.
Important: confidence is not a probability in the strict statistical sense — it’s a model-internal score. For critical applications such as security or legal review, validate the results with a manual spot-check.
How fast is it on my device?
Three factors govern runtime: video length, sample rate, and model variant.
- 5 minute video, 1 fps, fast variant: about 100 seconds on a current laptop — most users start here.
- 5 minute video, 1 fps, accurate variant: about 5 minutes. Worth it when the fast variant misses too much in your footage.
- 5 minute video, 10 fps, fast variant: about 15 minutes. Useful for motion studies or sports clips where every second matters.
- Phone browser: about three times slower than a laptop. For larger videos prefer desktop.
The estimate appears in the status line after the model finishes loading. If the run is taking too long, click “Cancel” any time — the data for already-processed frames remains and stays exportable.
How private is this?
Everything runs on your device. There is no upload, no server component, no cloud inference. That’s a meaningful difference compared to many commercial offerings where the video is uploaded to a US server, analyzed there, and the result returned. Even without active tracking, your data sovereignty is not yours.
Here the video stays in the browser tab. Closing the tab frees the memory and the video is gone. The only network connection the tool makes is the one-time model download on first use — afterwards the tool runs offline.
That’s GDPR-compatible and works for confidential or legally sensitive scenarios — kids’ sports clips, business videos with people, medical or security footage.
Possible use cases
- Sports analysis: how many players are visible over time, heatmap of player positions, count of ball sightings.
- Traffic flow: vehicles per second, heatmap of traffic nodes.
- Crowd counting: number of people per frame as a CSV time series.
- Video editing: heatmap as a crop guide to find the perfect 9:16 cut for social media.
- Research: animal observations with timestamps instead of manual annotation.
- Content review: a class list of every category appearing in a video before publishing.
What else should I know?
Last updated: