Ever felt like prompt-based AI gives you flexibility but not quite the reliability you need in your workflow? You’re not alone. Prompts are great—they let you ask domain-specific questions, explore edge cases, even get creative. But without structure? You’re often left guessing what the output will look like.
That’s where structured visual understanding comes in. Imagine combining the agility of prompt engineering with the predictability of JSON output. Suddenly, you’ve got clear expectations, consistent results, and full control over how your Vision Language Model (VLM) behaves. Oh, and yes—you get to choose the model that best fits your use case.
So, what does that actually look like in practice? Let’s walk through it.
So, what’s new here?
-
Prompt-based analysis for both images and videos—opening up countless possibilities, from sports to journalism to compliance
-
Structured output via predefined JSON schemas, making results easier to parse and integrate, whether you’re working with fixed categories or open text fields.
-
Model choice flexibility, so you can tailor the analysis to the nature of your content.
-
File upload or API access—just like all our modules, it’s built API-first but fully UI-accessible too.
-
Composite AI capabilities—combine visual summaries with speech recognition and large language models (LLMs) for richer content understanding.
Use cases
-
Highlight Clipping just got smarter
-
Ingest Automation for consistency (API)
-
Finding archive treasures though data
Let’s be real: manually clipping highlights in video content is tedious. But what if your system could spot key moments on its own?
With DeepVA’s Visual Understanding Module tightly integrated into your Media Asset Management (MAM) system, those moments can now be detected, tagged, and clipped—automatically. Whether it’s:
- A game-winning goal or a crowd’s reaction in a sports broadcast,
- Speaker changes in a panel discussion (as a visual supplement to speaker identification),
- High-drama scenes in a movie trailer, or
- Brand-relevant segments in influencer content or product videos
The structured results—delivered with frame-accurate timecodes—can then be used to generate Edit Decision Lists (EDLs) automatically, allowing editors to quickly fine-tune and export highlights for further processing in NLEs or for direct publishing.
How it works (MAM-integrated workflow):
- Video is ingested into the MAM, and an internal job sends it to DeepVA with a defined prompt and JSON schema.
- Structured metadata is returned per video or shot (e.g., content classification or presence of products)
- The MAM’s workflow engine evaluates this data, combines it with our audio transcription and automatically marks highlight-worthy segments.
- In addition, using our Speech Recognition, the text can simultaneously be extracted and using the same workflow reduced via LLM e.g. for an automated text-to-speech voice-over
- An EDL is generated based on timecoded markers and exported for editorial approval or further post-production, combined with the text-to-speech output.
This automates the highlight clipping process and dramatically speeds up production timelines, reduces repetitive editing work, and ensures a consistent editorial standard—especially useful for newsrooms, sports broadcasters, and social media publishing teams. The real power comes not by using powerful algorithms, but with a smart combination in a workflow engine like Helmut Cloud or provided by your MAM.
Ingest automation streamlines the way video and image assets are processed at scale. With DeepVA’s structured Visual Understanding, all incoming media can be analyzed using a consistent prompt and metadata schema—ensuring uniformity and reliability across departments.
Automated tagging and content descriptions not only support editorial workflows and marketing reuse but also enable accessibility features such as ALT texts. Custom prompts can extend this even further, allowing for use-case-specific analysis like:
- Logo or text detection → for compliance in public broadcasting
- Scene context (e.g., indoor/outdoor, event type)
- Emotion analysis for storytelling tone
- Presence of minors, animals, product placement or sensitive content
By integrating this analysis directly into your ingest pipeline, media companies can save time, reduce legal risks, and unlock new value from their content from day one.
Media archives often contain large amounts of unlabelled or inconsistently tagged content. Structured VLM analysis can automatically enrich existing image and video archives with detailed, contextual and searchable metadata, eliminating the need for manual effort.
By reprocessing archive content using a predefined JSON structure, you can extract: Scene summaries for quick previews
- Add additional metadata like categories, sentiment, tone, time of day or weather conditions
- Read out Text overlays or signs in the footage (e.g. historical references)
- Read out text and handwritten information on film cans, the backs of photos or packaging to enable automated tagging.
- Cultural/historical symbols (important for documentaries or regional content)
With structured metadata from DeepVA, media archives are transformed into dynamic, searchable resources. Smart search and content discovery within Media Asset Management (MAM) systems becomes more intuitive, while similar scenes or topics can be automatically clustered for easier access and thematic organization. Journalists and editors benefit from faster research workflows, accessing relevant footage faster.
Combined with our customisable face and landmark recognition, you can automate even very local archives with granular tagging.
Why it matters?
-
Sovereignty
Use VLM capabilities and prompt-based questioning in your own secure environment, without data leaving your company.
-
Automation
Editing automation needs contextual awareness. Structured visual data provides the necessary understanding of scenes, people, and content to trigger smart editing decisions and downstream processes.
-
Consistency
Metadata is only valuable when it follows a reliable structure. By using predefined JSON schemas, this tool ensures uniform tagging across all media types—laying the groundwork for scalable AI applications.
-
Flexibility
Whether you’re automating image ALT-text generation, highlight-clipping, or checking content compliance, this tool adapts to dozens of real-world use cases with customizable prompts and models.
-
Foundation
Structured visual data becomes a building block for composite AI workflows—such as combining speech recognition with visual cues to create highly searchable, fully indexed, and context-aware media assets.