Last updated: August 6, 2025

Visual Understanding – Now with Machine-Readable Structured Output

Ever felt like prompt-based AI gives you flexibility but not quite the reliability you need in your workflow? You’re not alone. Prompts are great—they let you ask domain-specific questions, explore edge cases, even get creative. But without structure? You’re often left guessing what the output will look like.

That’s where structured visual understanding comes in. Imagine combining the agility of prompt engineering with the predictability of JSON output. Suddenly, you’ve got clear expectations, consistent results, and full control over how your Vision Language Model (VLM) behaves. Oh, and yes—you get to choose the model that best fits your use case.

So, what does that actually look like in practice? Let’s walk through it.

So, what’s new here?

This isn’t just about throwing prompts at an image and hoping for the best. The Visual Understanding Module now gives you:

Prompt-based analysis for both images and videos—opening up countless possibilities, from sports to journalism to compliance
Structured output via predefined JSON schemas, making results easier to parse and integrate, whether you’re working with fixed categories or open text fields.
Model choice flexibility, so you can tailor the analysis to the nature of your content.
File upload or API access—just like all our modules, it’s built API-first but fully UI-accessible too.
Composite AI capabilities—combine visual summaries with speech recognition and large language models (LLMs) for richer content understanding.

And that’s just the beginning.

Use cases

Highlight Clipping just got smarter
Ingest Automation for consistency (API)
Finding archive treasures though data

Let’s be real: manually clipping highlights in video content is tedious. But what if your system could spot key moments on its own?

With DeepVA’s Visual Understanding Module tightly integrated into your Media Asset Management (MAM) system, those moments can now be detected, tagged, and clipped—automatically. Whether it’s:

A game-winning goal or a crowd’s reaction in a sports broadcast,
Speaker changes in a panel discussion (as a visual supplement to speaker identification),
High-drama scenes in a movie trailer, or
Brand-relevant segments in influencer content or product videos

The structured results—delivered with frame-accurate timecodes—can then be used to generate Edit Decision Lists (EDLs) automatically, allowing editors to quickly fine-tune and export highlights for further processing in NLEs or for direct publishing.

How it works (MAM-integrated workflow):

Video is ingested into the MAM, and an internal job sends it to DeepVA with a defined prompt and JSON schema.
Structured metadata is returned per video or shot (e.g., content classification or presence of products)
The MAM’s workflow engine evaluates this data, combines it with our audio transcription and automatically marks highlight-worthy segments.
In addition, using our Speech Recognition, the text can simultaneously be extracted and using the same workflow reduced via LLM e.g. for an automated text-to-speech voice-over
An EDL is generated based on timecoded markers and exported for editorial approval or further post-production, combined with the text-to-speech output.

This automates the highlight clipping process and dramatically speeds up production timelines, reduces repetitive editing work, and ensures a consistent editorial standard—especially useful for newsrooms, sports broadcasters, and social media publishing teams. The real power comes not by using powerful algorithms, but with a smart combination in a workflow engine like Helmut Cloud or provided by your MAM.

Ingest automation streamlines the way video and image assets are processed at scale. With DeepVA’s structured Visual Understanding, all incoming media can be analyzed using a consistent prompt and metadata schema—ensuring uniformity and reliability across departments.

Automated tagging and content descriptions not only support editorial workflows and marketing reuse but also enable accessibility features such as ALT texts. Custom prompts can extend this even further, allowing for use-case-specific analysis like:

Logo or text detection → for compliance in public broadcasting
Scene context (e.g., indoor/outdoor, event type)
Emotion analysis for storytelling tone
Presence of minors, animals, product placement or sensitive content

By integrating this analysis directly into your ingest pipeline, media companies can save time, reduce legal risks, and unlock new value from their content from day one.

Media archives often contain large amounts of unlabelled or inconsistently tagged content. Structured VLM analysis can automatically enrich existing image and video archives with detailed, contextual and searchable metadata, eliminating the need for manual effort.

By reprocessing archive content using a predefined JSON structure, you can extract: Scene summaries for quick previews

Add additional metadata like categories, sentiment, tone, time of day or weather conditions
Read out Text overlays or signs in the footage (e.g. historical references)
Read out text and handwritten information on film cans, the backs of photos or packaging to enable automated tagging.
Cultural/historical symbols (important for documentaries or regional content)

With structured metadata from DeepVA, media archives are transformed into dynamic, searchable resources. Smart search and content discovery within Media Asset Management (MAM) systems becomes more intuitive, while similar scenes or topics can be automatically clustered for easier access and thematic organization. Journalists and editors benefit from faster research workflows, accessing relevant footage faster.

Combined with our customisable face and landmark recognition, you can automate even very local archives with granular tagging.

Why it matters?

Structured visual understanding is more than metadata extraction—it’s a strategic enabler for intelligent media workflows:

Sovereignty

Use VLM capabilities and prompt-based questioning in your own secure environment, without data leaving your company.
Automation

Editing automation needs contextual awareness. Structured visual data provides the necessary understanding of scenes, people, and content to trigger smart editing decisions and downstream processes.
Consistency

Metadata is only valuable when it follows a reliable structure. By using predefined JSON schemas, this tool ensures uniform tagging across all media types—laying the groundwork for scalable AI applications.
Flexibility

Whether you’re automating image ALT-text generation, highlight-clipping, or checking content compliance, this tool adapts to dozens of real-world use cases with customizable prompts and models.
Foundation

Structured visual data becomes a building block for composite AI workflows—such as combining speech recognition with visual cues to create highly searchable, fully indexed, and context-aware media assets.

Our AI applications

Deep Media Analyzer

Deep Model Customizer

Deep Collector

Deep Live Hub

Deep Indexer

Deep Explorer

by solution

by customer story

Customer Success Story: Zebra Live meets European Publishing Congress

Visual Understanding – Now with Machine-Readable Structured Output

So, what’s new here?

Use cases

Highlight Clipping just got smarter

Ingest Automation for consistency (API)

Finding archive treasures though data

Why it matters?

Sovereignty

Automation

Consistency

Flexibility

Foundation

Table of Contents

Related news

Get the latest AI news delivered to your inbox

aiconix

Product

Functions

Resources

Our AI applications

by solution

by customer story

Visual Under­standing – Now with Machine-Readable Struc­tured Output

So, what’s new here?

Use cases

Highlight Clipping just got smarter

Ingest Automation for consis­tency (API)

Finding archive treasures though data

Why it matters?

Sover­eignty

Automation

Consis­tency

Flexi­bility

Foundation

Table of Contents

Related news

Get the latest AI news delivered to your inbox

aiconix

Product

Functions

Resources

Subscribe to our newsletter

Visual Understanding – Now with Machine-Readable Structured Output

Ingest Automation for consistency (API)

Sovereignty

Consistency

Flexibility