OmniHuman-1: Advancing Single-Image Video Generation

Input: first frame + audio track

Output:

ByteDance has introduced OmniHuman-1, a sophisticated AI model that transforms single images into dynamic videos, complete with natural movements and synchronized speech.

The model represents a significant step forward in video synthesis technology, particularly in its ability to handle full-body animation and complex motion patterns.

Before you get too excited, there are no downloads, and you can’t try it yourself on any service right now, it’s just research.

“If you created a TikTok video there’s a good chance you’re now in a database that’s going to be used to create virtual humans.”Freddy Tran Nager, Clinical associate professor of communications at the University of Southern California’s Annenberg School for Communication and Journalism

https://omnihuman-lab.github.io/

Technical Innovation Through Scale

At the heart of OmniHuman-1's capabilities lies an impressive training dataset of over 18,700 hours of human video data. This extensive dataset, combined with a novel "omni-conditions" training approach, enables the model to process multiple types of inputs simultaneously - including text, audio, and physical poses.

The model employs a Diffusion Transformer (DiT) architecture, merging diffusion-based generative modeling with transformer-based sequence handling. This allows for one-stage, end-to-end generation of output videos, streamlining what traditionally required multiple specialized models. Notably this is audio and video without syncing tricks.

I would bet that in 6 months we have an OpenPose, ControlNet UI that lets us textually direct from an image an entire scene. And I believe that will be doable on my RTX-4090. Watch this open-source space.

Thanks for reading 🤖 Agentic Newsletter! This post is public so feel free to share it.