Visual Under­standing – Now with Machine-Readable Struc­tured Output

Ever felt like prompt-based AI gives you flexi­bility but not quite the relia­bility you need in your workflow? You’re not alone. Prompts are great—they let you ask domain-specific questions, explore edge cases, even get creative. But without structure? You’re often left guessing what the output will look like.

That’s where struc­tured visual under­standing comes in. Imagine combining the agility of prompt engineering with the predictability of JSON output. Suddenly, you’ve got clear expec­ta­tions, consistent results, and full control over how your Vision Language Model (VLM) behaves. Oh, and yes—you get to choose the model that best fits your use case.

So, what does that actually look like in practice? Let’s walk through it.

So, what’s new here?

This isn’t just about throwing prompts at an image and hoping for the best. The Visual Under­standing Module now gives you:
  • Prompt-based analysis for both images and videos—opening up countless possi­bil­ities, from sports to journalism to compliance

  • Struc­tured output via prede­fined JSON schemas, making results easier to parse and integrate, whether you’re working with fixed categories or open text fields.

  • Model choice flexi­bility, so you can tailor the analysis to the nature of your content.

  • File upload or API access—just like all our modules, it’s built API-first but fully UI-accessible too.

  • Composite AI capabilities—combine visual summaries with speech recog­nition and large language models (LLMs) for richer content under­standing.

And that’s just the beginning.

Use cases

Let’s be real: manually clipping highlights in video content is tedious. But what if your system could spot key moments on its own?

With DeepVA’s Visual Under­standing Module tightly integrated into your Media Asset Management (MAM) system, those moments can now be detected, tagged, and clipped—automatically. Whether it’s:

  • A game-winning goal or a crowd’s reaction in a sports broadcast,
  • Speaker changes in a panel discussion (as a visual supplement to speaker identi­fi­cation),
  • High-drama scenes in a movie trailer, or
  • Brand-relevant segments in influ­encer content or product videos

 

The struc­tured results—delivered with frame-accurate timecodes—can then be used to generate Edit Decision Lists (EDLs) automat­i­cally, allowing editors to quickly fine-tune and export highlights for further processing in NLEs or for direct publishing.

How it works (MAM-integrated workflow):

  1. Video is ingested into the MAM, and an internal job sends it to DeepVA with a defined prompt and JSON schema.
  2. Struc­tured metadata is returned per video or shot (e.g., content classi­fi­cation or presence of products)
  3. The MAM’s workflow engine evaluates this data, combines it with our audio transcription and automat­i­cally marks highlight-worthy segments.
  4. In addition, using our Speech Recog­nition, the text can simul­ta­ne­ously be extracted and using the same workflow reduced via LLM e.g. for an automated text-to-speech voice-over
  5. An EDL is generated based on timecoded markers and exported for editorial approval or further post-production, combined with the text-to-speech output.

 

This automates the highlight clipping process and dramat­i­cally speeds up production timelines, reduces repet­itive editing work, and ensures a consistent editorial standard—especially useful for newsrooms, sports broad­casters, and social media publishing teams. The real power comes not by using powerful algorithms, but with a smart combi­nation in a workflow engine like Helmut Cloud or provided by your MAM.

Ingest automation stream­lines the way video and image assets are processed at scale. With DeepVA’s struc­tured Visual Under­standing, all incoming media can be analyzed using a consistent prompt and metadata schema—ensuring uniformity and relia­bility across depart­ments.

Automated tagging and content descrip­tions not only support editorial workflows and marketing reuse but also enable acces­si­bility features such as ALT texts. Custom prompts can extend this even further, allowing for use-case-specific analysis like:

  • Logo or text detection → for compliance in public broad­casting
  • Scene context (e.g., indoor/outdoor, event type)
  • Emotion analysis for story­telling tone
  • Presence of minors, animals, product placement or sensitive content

By integrating this analysis directly into your ingest pipeline, media companies can save time, reduce legal risks, and unlock new value from their content from day one.

Media archives often contain large amounts of unlabelled or incon­sis­tently tagged content. Struc­tured VLM analysis can automat­i­cally enrich existing image and video archives with detailed, contextual and searchable metadata, elimi­nating the need for manual effort.

By repro­cessing archive content using a prede­fined JSON structure, you can extract: Scene summaries for quick previews

  • Add additional metadata like categories, sentiment, tone, time of day or weather condi­tions
  • Read out Text overlays or signs in the footage (e.g. historical refer­ences)
  • Read out text and handwritten infor­mation on film cans, the backs of photos or packaging to enable automated tagging.
  • Cultural/historical symbols (important for documen­taries or regional content)

 

With struc­tured metadata from DeepVA, media archives are trans­formed into dynamic, searchable resources. Smart search and content discovery within Media Asset Management (MAM) systems becomes more intuitive, while similar scenes or topics can be automat­i­cally clustered for easier access and thematic organi­zation. Journalists and editors benefit from faster research workflows, accessing relevant footage faster.

Combined with our customisable face and landmark recog­nition, you can automate even very local archives with granular tagging.

Why it matters?

Struc­tured visual under­standing is more than metadata extraction—it’s a strategic enabler for intel­ligent media workflows:
  • Sover­eignty

    Use VLM capabil­ities and prompt-based questioning in your own secure environment, without data leaving your company.

  • Automation

    Editing automation needs contextual awareness. Struc­tured visual data provides the necessary under­standing of scenes, people, and content to trigger smart editing decisions and downstream processes.

  • Consis­tency

    Metadata is only valuable when it follows a reliable structure. By using prede­fined JSON schemas, this tool ensures uniform tagging across all media types—laying the groundwork for scalable AI appli­ca­tions.

  • Flexi­bility

    Whether you’re automating image ALT-text gener­ation, highlight-clipping, or checking content compliance, this tool adapts to dozens of real-world use cases with customizable prompts and models.

  • Foundation

    Struc­tured visual data becomes a building block for composite AI workflows—such as combining speech recog­nition with visual cues to create highly searchable, fully indexed, and context-aware media assets.

Share

Email
LinkedIn
Facebook
Twitter
Search

Table of Contents

latest AI news

Subscribe to our newsletter

Don’t worry, we reserve our newsletter for important news, so we only send a few updates once in a while. No spam!