This February release introduces major improvements across DeepVA, with a strong focus on live transcription quality, latency reduction, and more predictable AI output. From ultra-low-latency roll-up captions in the Deep Live Hub to structured Visual Understanding in the Deep Media Analyzer, this update lays the groundwork for more workflow automation and fewer latency.
Earlier this month, we introduced single sign-on, unifying the DeepVA platform and the Deep Live Hub under a single login and bringing the two platforms closer together. Using your DeepVA account, you can now access the Deep Live Hub and vice versa. This enables you to run quick tests, evaluate our live system or quickly build a POC, since the platform is already included in your subscription. Try it out!
Deep Live Hub Updates
Now is a great time to test out our Deep Live Hub, since we have introduced numerous updates to improve working with custom dictionaries, decrease latency and offer rolling subtitles, as well as enhance the model’s overall quality.
Dictionaries & “Sounds Like” – Smarter ASR Assistance
Dictionaries act as an additional knowledge source for the ASR engine whenever transcription confidence of the model is low. With this release, dictionary handling has been significantly expanded through the introduction of “Sounds Like” entries.
“Sounds Like” entries allow you to define alternative phonetic representations of a dictionary term, helping the ASR engine recognize how a word might be spoken while enforcing how it should appear in the transcript.
- Names from foreign languages
- Brand or product names, abbreviations, and acronyms
- Technical or domain-specific terminology
Dictionary management remains available via the Dictionaries & Glossaries section and supports manual editing as well as bulk CSV imports with “Sounds Like” definitions. In the future, we will also implement a way of using an LLM to compile the dictionaries. See the full documentation here
Roll-Up Captions – Ultra-Low Latency Live Subtitles
This release introduces roll-up captions as feature in the Deep Live Hub, dramatically reducing latency for live subtitles while maintaining readability. Roll-up captions can now be configured directly in the ASR settings with 1, 2, 3, or 4‑second roll-up intervals for the HLS output.
Unlike the traditional live subtitles—which only appear after a full sentence has been completed—roll-up captions stream spoken content word by word, almost simultaneously with the speaker. As new words arrive, the current subtitle line grows dynamically, while completed sentences smoothly move upward.
Roll-up captions are now available as an HLS output, making them suitable for broadcast and professional live workflows. One example of use case is integration with Steam Engineering’s SDI Teletext inserter, enabling near real-time subtitle insertion for linear broadcast environments. If you are using other output formats or the live editor, we do not support roll-up captions yet.
By combining faster ASR processing with rolling caption output, Deep Live Hub can now display spoken content almost in sync with live speech, significantly improving accessibility and viewer experience.
Deep Live Hub — Further Improvements
- Updated ASR Engine: Improved quality and speed (now ultra-low latency)
- Improved stability during live transcription
- Minor bug fixes
DeepVA Platform Updates
Our DeepVA platform has also been updated, and all Visual Understanding updates that are already available via API are now included in the UI. Many new models have been added also for our API customers to evaluate.
Visual Understanding – Structured Output & Expanded Model Choice
The Visual Understanding Module now supports structured output also in the UI, making AI-driven image and video analysis more predictable and easier to integrate into workflows. Being already available via API for two months, structured output is now also available directly in the UI when configuring the module.
Instead of receiving free-form responses, you can now define a JSON schema that the Vision Language Model (VLM) must follow—combining prompt flexibility with reliable, machine-readable results. Additionally, new VLM models have been implemented, featuring also a larger Qwen 3 8B model.
See full documentation here
In the coming months, we’ll be preparing new AI-powered workflows for transcription, export creation, and metadata generation to further optimize your media processing.
Visual Understanding – Shot Segmentation Upgrade
We’ve completely revamped shot segmentation to give you far more control over how videos are split and analyzed. Shot detection was limited to: activation and a threshold controlling sensitivity. That was it. Useful, but limited.
Now it has become a fully flexible, multi-method Shot Segmentation for the visual understanding module.
Note: This is not available as standalone function yet, only in combination with Visual Understanding.
In Visual Understanding you can now choose how shots are detected, how sensitive each method is, or skip detection entirely and use fixed-length segments instead. These are the new options:
A. Shot Detection
When Enable shot detection is turned on:
The video is automatically split into shots based on visual changes
Content– detects semantic visual changes (default)Adaptive– dynamically adjusts sensitivity based on the videoThreshold– brightness-based fade/cut detectionHistogram– color distribution changesHash– perceptual image differences
Each detected shot is then processed independently with prompt & JSON sheme
Each shot produces its own result segment
B. Fixed-Length Segments
If you set Fixed shot length (in seconds) > 0:
The video is split into equal time segments (e.g. every 5s, 10s, etc.)
Each segment is analyzed independently
Cannot be combined with shot detection and overwrites shot detection if both are activated.
This is method is ideal for:
Long videos with few shots, like CCTV footage
Uniform analysis based on timed chunks
Timeline-based chunking
This gives you more accurate shot boundaries and better results and fine-grained control for your use case. In short: you decide how your video is split, and how detailed the analysis should be.
Improved Speech Recognition
The Speech Recognition model has also been updated to improve the quality of transcripts, reduce latency and reducing the processing duration.
DeepVA — Further Improvements
- Fixed-length shot segmentation now supported
- Longer prompts and larger JSON inputs now supported
- On-Prem: RAM exhaustion for long videos issue solved
- Detailed Text Recognition results now visualized also as vertical list
Over the next months, we will be preparing more post-processing results and breaking down silos between metadata types on our platform. We will also be adding extra inputs, outputs and dictionary management for the Deep Live Hub.


