I am checking out another AI video fine-tune model that makes characters talk. It is called Humo. Humo stands for human centric video generation via collaborative multimodal conditioning. The core idea is simple: it focuses on generating videos where characters speak, with facial motion synced to audio.
This project comes from the ByteDance research team. Before Humo, there was another video generation project named Phantom. Phantom was built around decoding and generation based on the WAN 2.1 14B model. With Humo, the team added multimodal conditioning, which expands what the model can do.
What Multimodal Conditioning Means in Humo
With Humo, the model is not limited to using a single character reference image. Multimodal conditioning means that audio can also be provided so the character’s facial movement matches the spoken content.
The workflow supports:
- A reference image of a character’s face
- An audio file for speech
- Optional text prompts to guide the scene
This setup allows the model to generate scenes where the character talks, with lip movement synced to the provided audio.
More Than Just Image References
One important detail is that this multimodal setup does not strictly require an image reference. It is also possible to use only a text prompt and an audio file. In that configuration, the model generates characters that speak with expressive facial movement matched to the audio.
Humo also supports attaching multiple images within a single video scene. This allows more than one subject to appear in the same frame. The model can place these subjects together and animate them in response to the same audio input.
Text Condition and Edit Feature
Another feature available in Humo is called text condition and edit. With this feature:
- Different text prompts can be applied to the same inputs
- Each prompt produces a different video output
- The changes in output reflect the changes in text conditioning
On the project page, there are comparisons that show how the video output shifts based on different input combinations. This helps in understanding how text, image, and audio inputs affect the final result.
Accessing the Humo Model and Files
On the Hugging Face research page maintained by the ByteDance team, the Humo repository is available. Here's the official website humoai.net ,all the details are provided there, including the full model weights for this fine-tuned model, named Humo 17B.
Inside the repository, there are multiple SafeTensor files. The model is based on Wan 2.1, so much of the setup feels familiar if I have used wan-based workflows before. However, because this is a full model weight, it is not something that can simply be dropped into ComfyUI and run easily on a local PC in a user-friendly way.
Attach up to 5 files which will be available for other members to download.