Changelog February 2026: New Sign-on, faster and smarter Speech Recog­nition & improved Visual Under­standing

This February release intro­duces major improve­ments across DeepVA, with a strong focus on live transcription quality, latency reduction, and more predictable AI output. From ultra-low-latency roll-up captions in the Deep Live Hub to struc­tured Visual Under­standing in the Deep Media Analyzer, this update lays the groundwork for more workflow automation and fewer latency. 
 
Earlier this month, we intro­duced single sign-on, unifying the DeepVA platform and the Deep Live Hub under a single login and bringing the two platforms closer together. Using your DeepVA account, you can now access the Deep Live Hub and vice versa. This enables you to run quick tests, evaluate our live system or quickly build a POC, since the platform is already included in your subscription. Try it out! 

Deep Live Hub Updates

Now is a great time to test out our Deep Live Hub, since we have intro­duced numerous updates to improve working with custom dictio­naries, decrease latency and offer rolling subtitles, as well as enhance the model’s overall quality. 

Dictio­naries & “Sounds Like” – Smarter ASR Assis­tance

Dictio­naries act as an additional knowledge source for the ASR engine whenever transcription confi­dence of the model is low. With this release, dictionary handling has been signif­i­cantly expanded through the intro­duction of “Sounds Like” entries. 

“Sounds Like” entries allow you to define alter­native phonetic repre­sen­ta­tions of a dictionary term, helping the ASR engine recognize how a word might be spoken while enforcing how it should appear in the transcript.

The new dictionary function with “Sounds like” words added.
For example, you may want the word CEO to always be written without dots. Spoken audio might sound like See E Oh or C.E.O.. By creating a dictionary entry for CEO and adding See E Oh and C.E.O. as “Sounds Like” variants behind, the ASR engine will consis­tently output CEO in the finished transcript.  This feature is partic­u­larly valuable when working with: 
  • Names from foreign languages 
  • Brand or product names, abbre­vi­a­tions, and acronyms 
  • Technical or domain-specific termi­nology

Dictionary management remains available via the Dictio­naries & Glossaries section and supports manual editing as well as bulk CSV imports with “Sounds Like” defin­i­tions. In the future, we will also implement a way of using an LLM to compile the dictio­naries. See the full documen­tation here 

Roll-Up Captions – Ultra-Low Latency Live Subtitles

This release intro­duces roll-up captions as feature in the Deep Live Hub, dramat­i­cally reducing latency for live subtitles while maintaining readability. Roll-up captions can now be configured directly in the ASR settings with 1, 2, 3, or 4‑second roll-up intervals for the HLS output.

Example for roll-up captions appearing and shifting the lines.

Unlike the tradi­tional live subtitles—which only appear after a full sentence has been completed—roll-up captions stream spoken content word by word, almost simul­ta­ne­ously with the speaker. As new words arrive, the current subtitle line grows dynam­i­cally, while completed sentences smoothly move upward. 

Roll-up captions are now available as an HLS output, making them suitable for broadcast and profes­sional live workflows. One example of use case is integration with Steam Engineering’s SDI Teletext inserter, enabling near real-time subtitle insertion for linear broadcast environ­ments. If you are using other output formats or the live editor, we do not support roll-up captions yet. 

By combining faster ASR processing with rolling caption output, Deep Live Hub can now display spoken content almost in sync with live speech, signif­i­cantly improving acces­si­bility and viewer experience. 

Deep Live Hub — Further Improve­ments

  • Updated ASR Engine: Improved quality and speed (now ultra-low latency)
  • Improved stability during live transcription
  • Minor bug fixes

DeepVA Platform Updates

Our DeepVA platform has also been updated, and all Visual Under­standing updates that are already available via API are now included in the UI. Many new models have been added also for our API customers to evaluate.

Visual Under­standing – Struc­tured Output & Expanded Model Choice

The Visual Under­standing Module now supports struc­tured output also in the UI, making AI-driven image and video analysis more predictable and easier to integrate into workflows. Being already available via API for two months, struc­tured output is now also available directly in the UI when config­uring the module. 

Instead of receiving free-form responses, you can now define a JSON schema that the Vision Language Model (VLM) must follow—combining prompt flexi­bility with reliable, machine-readable results. Additionally, new VLM models have been imple­mented, featuring also a larger Qwen 3 8B model. 
See full documen­tation here 

In the coming months, we’ll be preparing new AI-powered workflows for transcription, export creation, and metadata gener­ation to further optimize your media processing.

Visual Under­standing – Shot Segmen­tation Upgrade

We’ve completely revamped shot segmen­tation to give you far more control over how videos are split and analyzed. Shot detection was limited to: activation and a threshold controlling sensi­tivity. That was it. Useful, but limited.

Now it has become a fully flexible, multi-method Shot Segmen­tation for the visual under­standing module.

New config­u­ration options for shot segmen­tation in Visual Under­standing.

Note: This is not available as stand­alone function yet, only in combi­nation with Visual Under­standing.

In Visual Under­standing you can now choose how shots are detected, how sensitive each method is, or skip detection entirely and use fixed-length segments instead. These are the new options:

A. Shot Detection

When Enable shot detection is turned on:

  • The video is automat­i­cally split into shots based on visual changes

    • Content – detects semantic visual changes (default)

    • Adaptive – dynam­i­cally adjusts sensi­tivity based on the video

    • Threshold – brightness-based fade/cut detection

    • Histogram – color distri­b­ution changes

    • Hash – perceptual image differ­ences

  • Each detected shot is then processed indepen­dently with prompt & JSON sheme

  • Each shot produces its own result segment

B. Fixed-Length Segments

If you set Fixed shot length (in seconds) > 0:

  • The video is split into equal time segments (e.g. every 5s, 10s, etc.)

  • Each segment is analyzed indepen­dently

  • Cannot be combined with shot detection and overwrites shot detection if both are activated.

This is method is ideal for:

  • Long videos with few shots, like CCTV footage

  • Uniform analysis based on timed chunks

  • Timeline-based chunking

This gives you more accurate shot bound­aries and better results and fine-grained control for your use case. In short: you decide how your video is split, and how detailed the analysis should be.

Improved Speech Recog­nition

The Speech Recog­nition model has also been updated to improve the quality of transcripts, reduce latency and reducing the processing duration.

DeepVA — Further Improve­ments

  • Fixed-length shot segmen­tation now supported 
  • Longer prompts and larger JSON inputs now supported 
  • On-Prem: RAM exhaustion for long videos issue solved 
  • Detailed Text Recog­nition results now visualized also as vertical list 

Over the next months, we will be preparing more post-processing results and breaking down silos between metadata types on our platform. We will also be adding extra inputs, outputs and dictionary management for the Deep Live Hub. 

Share

Email
LinkedIn
Facebook
Twitter
Search

Table of Contents

latest AI news

Subscribe to our newsletter

Don’t worry, we reserve our newsletter for important news, so we only send a few updates once in a while. No spam!