Thursday, January 16, 2025
HomeRoboticsCooking Up Narrative Consistency for Lengthy Video Technology

Cooking Up Narrative Consistency for Lengthy Video Technology


The latest public launch of the Hunyuan Video generative AI mannequin has intensified ongoing discussions in regards to the potential of enormous multimodal vision-language fashions to at some point create complete films.

Nevertheless, as we have noticed, it is a very distant prospect in the mean time, for various causes. One is the very brief consideration window of most AI video mills, which battle to keep up consistency even in a brief single shot, not to mention a sequence of photographs.

One other is that constant references to video content material (corresponding to explorable environments, which shouldn’t change randomly in case you retrace your steps by way of them) can solely be achieved in diffusion fashions by customization strategies corresponding to low-rank adaptation (LoRA), which limits the out-of-the-box capabilities of basis fashions.

Subsequently the evolution of generative video appears set to stall until new approaches to narrative continuity are developed.

Recipe for Continuity

With this in thoughts, a brand new collaboration between the US and China has proposed using tutorial cooking movies as a attainable template for future narrative continuity programs.

Click on to play. The VideoAuteur mission systematizes the evaluation of elements of a cooking course of, to provide a finely-captioned new dataset and an orchestration methodology for the technology of cooking movies. Consult with supply web site for higher decision.  Supply: https://videoauteur.github.io/

Titled VideoAuteur, the work proposes a two-stage pipeline to generate tutorial cooking movies utilizing cohered states combining keyframes and captions, attaining state-of-the-art leads to – admittedly – an under-subscribed area.

VideoAuteur’s mission web page additionally contains various reasonably extra attention-grabbing movies that use the identical approach, corresponding to a proposed trailer for a (non-existent) Marvel/DC crossover:

Click on to play. Two superheroes from alternate universes come nose to nose in a pretend trailer from VideoAuteur. Consult with supply web site for higher decision.

The web page additionally options similarly-styled promo movies for an equally non-existent Netflix animal sequence and a Tesla automobile advert.

In growing VideoAuteur, the authors experimented with various loss capabilities, and different novel approaches. To develop a recipe how-to technology workflow, in addition they curated CookGen, the biggest dataset centered on the cooking area, that includes 200, 000 video clips with a mean period of 9.5 seconds.

At a mean of 768.3 phrases per video, CookGen is comfortably essentially the most extensively-annotated dataset of its type. Numerous imaginative and prescient/language fashions had been used, amongst different approaches, to make sure that descriptions had been as detailed, related and correct as attainable.

Cooking movies had been chosen as a result of cooking instruction walk-throughs have a structured and unambiguous narrative, making annotation and analysis a better activity. Aside from pornographic movies (more likely to enter this explicit area sooner reasonably than later), it’s tough to think about every other style fairly as visually and narratively ‘formulaic’.

The authors state:

‘Our proposed two-stage auto-regressive pipeline, which features a lengthy narrative director and visual-conditioned video technology, demonstrates promising enhancements in semantic consistency and visible constancy in generated lengthy narrative movies.

By way of experiments on our dataset, we observe enhancements in spatial and temporal coherence throughout video sequences.

‘We hope our work can facilitate additional analysis in lengthy narrative video technology.’

The new work is titled VideoAuteur: In the direction of Lengthy Narrative Video Technology, and comes from eight authors throughout Johns Hopkins College, ByteDance, and ByteDance Seed.

Dataset Curation

To develop CookGen, which powers a two-stage generative system for producing AI cooking movies, the authors used materials from the YouCook and HowTo100M collections. The authors examine the dimensions of CookGen to earlier datasets centered on narrative improvement in generative video, such because the Flintstones dataset, the Pororo cartoon dataset, StoryGen, Tencent’s StoryStream, and VIST.

Comparison of images and text length between CookGen and the nearest-most populous similar datasets. Source: https://arxiv.org/pdf/2501.06173

Comparability of photos and textual content size between CookGen and the nearest-most populous related datasets. Supply: https://arxiv.org/pdf/2501.06173

CookGen focuses on real-world narratives, notably procedural actions like cooking, providing clearer and easier-to-annotate tales in comparison with image-based comedian datasets. It exceeds the biggest present dataset, StoryStream, with 150x extra frames and 5x denser textual descriptions.

The researchers fine-tuned a captioning mannequin utilizing the methodology of LLaVA-NeXT as a base. The automated speech recognition (ASR) pseudo-labels obtained for HowTo100M had been used as ‘actions’ for every video, after which refined additional by massive language fashions (LLMs).

As an example, ChatGPT-4o was used to provide a caption dataset, and was requested to concentrate on subject-object interactions (corresponding to arms dealing with utensils and meals), object attributes, and temporal dynamics.

Since ASR scripts are more likely to include inaccuracies and to be usually ‘noisy’, Intersection-over-Union (IoU) was used as a metric to measure how carefully the captions conformed to the part of the video they had been addressing. The authors word that this was essential for the creation of narrative consistency.

The curated clips had been evaluated utilizing Fréchet Video Distance (FVD), which measures the disparity between floor reality (actual world) examples and generated examples, each with and with out floor reality keyframes, arriving at a performative end result:

Using FVD to evaluate the distance between videos generated with the new captions, both with and without the use of keyframes captured from the sample videos.

Utilizing FVD to judge the space between movies generated with the brand new captions, each with and with out using keyframes captured from the pattern movies.

Moreover, the clips had been rated each by GPT-4o, and 6 human annotators, following LLaVA-Hound‘s definition of ‘hallucination’ (i.e., the capability of a mannequin to invent spurious content material).

The researchers in contrast the standard of the captions to the Qwen2-VL-72B assortment, acquiring a barely improved rating.

Comparison of FVD and human evaluation scores between Qwen2-VL-72B and the authors' collection.

Comparability of FVD and human analysis scores between Qwen2-VL-72B and the authors’ assortment.

Methodology

VideoAuteur’s generative part is split between the Lengthy Narrative Director (LND) and the visual-conditioned video technology mannequin (VCVGM).

LND generates a sequence of visible embeddings or keyframes that characterize the narrative stream, just like ‘important highlights’. The VCVGM generates video clips primarily based on these decisions.

Schema for the VideoAuteur processing pipeline. The Long Narrative Video Director makes apposite selections to feed to the Seed-X-powered generative module.

Schema for the VideoAuteur processing pipeline. The Lengthy Narrative Video Director makes apposite choices to feed to the Seed-X-powered generative module.

The authors extensively focus on the differing deserves of an interleaved image-text director and a language-centric keyframe director, and conclude that the previous is the more practical strategy.

The interleaved image-text director generates a sequence by interleaving textual content tokens and visible embeddings, utilizing an auto-regressive mannequin to foretell the subsequent token, primarily based on the mixed context of each textual content and pictures. This ensures a good alignment between visuals and textual content.

In contrast, the language-centric keyframe director synthesizes keyframes utilizing a text-conditioned diffusion mannequin primarily based solely on captions, with out incorporating visible embeddings into the technology course of.

The researchers discovered that whereas the language-centric methodology generates visually interesting keyframes, it lacks consistency throughout frames, arguing that the interleaved methodology achieves increased scores in realism and visible consistency. Additionally they discovered that this methodology was higher in a position to be taught a practical visible fashion by way of coaching, although generally with some repetitive or noisy components.

Unusually, in a analysis strand dominated by the co-opting of Steady Diffusion and Flux into workflows, the authors used Tencent’s SEED-X 7B-parameter multi-modal LLM basis mannequin for his or her generative pipeline (although this mannequin does leverage Stability.ai’s SDXL launch of Steady Diffusion for a restricted a part of its structure).

The authors state:

‘In contrast to the basic Picture-to-Video (I2V) pipeline that makes use of a picture because the beginning body, our strategy leverages [regressed visual latents] as steady circumstances all through the [sequence].

‘Moreover, we enhance the robustness and high quality of the generated movies by adapting the mannequin to deal with noisy visible embeddings, because the regressed visible latents will not be good as a consequence of regression errors.’

Although typical visual-conditioned generative pipelines of this sort typically use preliminary keyframes as a place to begin for mannequin steering, VideoAuteur expands on this paradigm by producing multi-part visible states in a semantically coherent latent area, avoiding the potential bias of basing additional technology solely on ‘beginning frames’.

Schema for the use of visual state embeddings as a superior conditioning method.

Schema for using visible state embeddings as a superior conditioning methodology.

Checks

In step with the strategies of SeedStory, the researchers use SEED-X to use LoRA fine-tuning on their narrative dataset, enigmatically describing the end result as a ‘Sora-like mannequin’, pre-trained on large-scale video/textual content couplings, and able to accepting each visible and textual content prompts and circumstances.

32,000 narrative movies had been used for mannequin improvement, with 1,000 held apart as validation samples. The movies had been cropped to 448 pixels on the brief aspect after which center-cropped to 448x448px.

For coaching, the narrative technology was evaluated totally on the YouCook2 validation set. The Howto100M set was used for knowledge high quality analysis and in addition for image-to-video technology.

For visible conditioning loss, the authors used diffusion loss from DiT and a 2024 work primarily based round Steady Diffusion.

To show their competition that interleaving is a superior strategy, the authors pitted VideoAuteur in opposition to a number of strategies that rely solely on text-based enter: EMU-2, SEED-X, SDXL and FLUX.1-schnell (FLUX.1-s).

Given a global prompt, 'Step-by-step guide to cooking mapo tofu', the interleaved director generates actions, captions, and image embeddings sequentially to narrate the process. The first two rows show keyframes decoded from EMU-2 and SEED-X latent spaces. These images are realistic and consistent but less polished than those from advanced models like SDXL and FLUX.

Given a world immediate, ‘Step-by-step information to cooking mapo tofu’, the interleaved director generates actions, captions, and picture embeddings sequentially to relate the method. The primary two rows present keyframes decoded from EMU-2 and SEED-X latent areas. These photos are lifelike and constant however much less polished than these from superior fashions like SDXL and FLUX.

The authors state:

‘The language-centric strategy utilizing text-to-image fashions produces visually interesting keyframes however suffers from a scarcity of consistency throughout frames as a consequence of restricted mutual data. In distinction, the interleaved technology methodology leverages language-aligned visible latents, attaining a practical visible fashion by way of coaching.

‘Nevertheless, it sometimes generates photos with repetitive or noisy components, because the auto-regressive mannequin struggles to create correct embeddings in a single move.’

Human analysis additional confirms the authors’ competition in regards to the improved efficiency of the interleaved strategy, with interleaved strategies attaining the very best scores in a survey.

Comparisons of approaches from a human study conducted for the paper.

Comparability of approaches from a human research performed for the paper.

Nevertheless we word that language-centric approaches obtain the perfect aesthetic scores. The authors contend, nonetheless, that this isn’t the central problem within the technology of lengthy narrative movies.

Click on to play. Segments generated for a pizza-building video, by VideoAuteur.

Conclusion

The preferred strand of analysis in regard to this problem, i.e., narrative consistency in long-form video technology, is worried with single photos. Initiatives of this sort embrace DreamStory, StoryDiffusion, TheaterGen and NVIDIA’s ConsiStory.

In a way, VideoAuteur additionally falls into this ‘static’ class, because it makes use of seed photos from which clip-sections are generated. Nevertheless, the interleaving of video and semantic content material brings the method a step nearer to a sensible pipeline.

 

First revealed Thursday, January 16, 2025

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments