Metadata is the new model tuning: How DeepVA makes GenAI ready for production

Gener­ative AI often feels like magic—until you try using it in real production workflows. Very quickly, it becomes clear: it may be suffi­cient for creative tasks and enter­tainment, but it often failed when it came to precision, taxonomies, and above all, repeata­bility. Non-reproducible results, incon­sistent termi­nology for identical concepts, hallu­ci­na­tions, or contradictions—these issues frequently made classic gener­ative image models unreliable in the past. 

This is precisely why struc­tured data is experi­encing a major renais­sance. Notably, even companies like Black Forest Labs—famous for ground­breaking gener­ative AI innovation—demonstrate how essential formal structure is for achieving reliable results. 

The start-up, headquar­tered like us in Freiburg, is rightly seen as a true thought leader in gener­ative AI. It was recently valued among Germany’s most promising start-ups—even though our Freiburg “neighbors” are three years younger. 

The Trend: Struc­tured Prompts become the industry standard — Example Black Forest Labs

The latest FLUX.2 release by Black Forest Labs showcases just how rapidly gener­ative AI is evolving in the creative domain. The model is clearly designed for real production and studio use cases: 

  • consistent characters and styles across multiple reference images   
  • precise execution of complex prompts   
  • realistic lighting, materials, and layouts   
  • reliably readable typography—a long-standing weakness of earlier models   

  

This is made possible through the combi­nation of a latent-flow archi­tecture with a Mistral VLM as the control layer. And this VLM layer is crucial: 

Without clear structure within the input, even a cutting-edge model like FLUX.2 cannot deliver the desired precision and repro­ducibility. 

FLUX.2 demon­strates just how important struc­tured prompts have become— Only structure trans­forms gener­ative AI from a creative machine into a reliable tool. 

Even its direct competitor, the open-source model Z‑Image from China, is built on the same principle of struc­tured input archi­tec­tures. 

Competitors have already paved the way

Google & Alibaba: Google’s models, including Nano Banana (Gemini 2.5 Flash Image), effec­tively use JSON prompts to generate hyper-realistic images. In addition, the Veo 3 video gener­ation model accepts complex JSON instruc­tions. Alibaba’s Qwen-Image stands out for its strong prompt conformity, making JSON ideal for product images that require strict consis­tency.

 

OpenAI: Since GPT‑3 in 2020 and increas­ingly with GPT‑4 Turbo, OpenAI has integrated better JSON parsing function­ality directly into its models, so JSON prompting is expected to become the standard for reliable automation.

Why structure remains essential — also for analytical AI

Gener­ative models are developed using huge amounts of struc­tured training data. But in practical use, many users suddenly expect the models to meet all require­ments without any structure—as if they were pure oracles.

However, the more complex the tasks become, the clearer it becomes that GenAI needs struc­tured inputs and struc­tured outputs in order to function reliably. And DeepVA also shares the basis for this.

DeepVA Visual Under­standing: Struc­tured Metadata for real-world workflows

For AICONIX customers and partners, this commitment to struc­turing is imple­mented through the Visual Under­standing Module (DeepVA), which offers a compet­itive advantage in media workflows.

 

Aiconix provides struc­tured visual under­standing by combining the flexi­bility of prompt-based analysis with the relia­bility of prede­fined JSON schemas for output. This approach was developed for profes­sional media opera­tions:

  • Workflow Automation

    Struc­tured visual data provides the necessary contextual awareness of scenes, people, and content to trigger intel­ligent editing decisions and downstream processes. Struc­tured results with frame-accurate timecodes can be used to automat­i­cally generate edit decision lists (EDLs) for tasks such as cutting highlights in sports broad­casts.

  • Consis­tency and compliance:

    The use of prede­fined JSON schemas guarantees a uniform tagging convention for all media types. This supports important functions for automating capture, such as content classi­fi­cation, logo recog­nition for compliance purposes, and emotion analysis.

  • Composite AI Foundation

    The struc­tured visual data serves as a building block for composite AI workflows, allowing Aiconix to seamlessly combine visual cues with other capabil­ities such as speech recog­nition and large language models to obtain richer, fully indexed media resources.

  • Sover­eignty:

    The ability to use VLM functions and prompt-based queries in a secure environment ensures that valuable data remains within the company.

Struc­tured Prompting: a win for your workflows

  • 1. Automated and consistent tagging according to your taxonomy

    Supports content classi­fi­cation, logo recog­nition (compliance), and emotion analysis

  • Consistent, searchable metadata

    Perfect for MAM/DAM, archives, or recom­men­dation engines.

  • Faster content pipelines

    Reliable data speeds up analysis, packaging, and redis­tri­b­ution.

  • Better training data for AI systems

    Struc­tured visual metadata is essential for robust models.

  • Intel­ligent editing decisions to trigger downstream processes

  • Automatic creation of Edit Decision Lists

  • Seamless combi­nation with other functions

Structure creates the magic — not the other way around

The devel­op­ments at Black Forest Labs make one thing clear:   

Unstruc­tured creativity is impressive. But struc­tured intel­li­gence is what creates real value. Gener­ative AI shines when given the right framework.  In media, this framework is struc­tured input—enabling entire smart gener­ative workflows to emerge. 

Vidispine showcased this with us at IBC 2025:

Using our struc­tured metadata from automated ingest and an intel­ligent workflow engine, an LLM can generate a rough cut aligned to the speaker’s script—without manual review or tedious assembly. 

Thanks to struc­tured metadata and smart reasoning, creative teams can focus on creative tasks while repet­itive processes become automated. 

This is exactly the approach we have success­fully pursued with Visual Under­standing for a year now—and, together with our many partners, are integrating into their workflows. 

 

If you would like to learn how struc­tured metadata and visual under­standing can transform your own workflows, please feel free to contact us — we look forward to hearing from you.

Share

Email
LinkedIn
Facebook
Twitter
Search

Table of Contents

latest AI news

Subscribe to our newsletter

Don’t worry, we reserve our newsletter for important news, so we only send a few updates once in a while. No spam!