Explore/muapi.ai/omnihuman-1-5

muapi/omnihuman-1-5

Audio to Video

Generate realistic talking head video from portrait image and audio using KIE OmniHuman 1.5.

Input

Configure the model parameters below.

Drag & drop, paste file/image, or paste a link

Drag & drop, paste file/image, or paste a link

Enable fast generation mode.

Result

Price varies by resolution per second of audio (max 60s billed)

ResolutionDurationCost
7205s$0.22
72010s$0.45
72030s$1.35
72060s$2.70
10805s$0.30
108010s$0.60
108030s$1.80
108060s$3.60

🚀Related Models

View all
creatify-lipsync

creatify-lipsync

Realistic lipsync video - optimized for speed, quality, and consistency.

Audio to Video
kling-v1-avatar-standard

kling-v1-avatar-standard

Kling AI Avatar Standard creates talking avatar videos from a single image + audio input. It supports realistic humans, animals, or stylized characters, producing lip-synced avatar videos easily.

Audio to Video
veed-lipsync

veed-lipsync

Generate realistic lipsync from any audio using VEED's latest model

Audio to Video
infinitetalk-image-to-video

infinitetalk-image-to-video

InfiniteTalk Image-to-Video brings still portraits and character photos to life by generating natural, realistic talking videos. You provide a single face image and a dialogue script, and the model animates lip movement, facial expressions, and subtle head gestures to match the speech.

Audio to Video
kling-v2-avatar-pro

kling-v2-avatar-pro

AI-Avatar v2 Pro takes a reference image of a person/character and an audio dialogue clip, then generates a realistic talking-avatar video. It preserves identity, lip syncs accurately to the audio, adds natural head movement, eye motion, expressions, and cinematic lighting.

Audio to Video
latent-sync

latent-sync

LatentSync is a video-to-video model that generates lip sync animations from audio using advanced algorithms for high-quality synchronization.

Audio to Video
kling-v1-avatar-pro

kling-v1-avatar-pro

Kling AI Avatar Pro is the premium tier for making high-quality talking avatars. You upload a character image plus an audio file, and the model generates a realistic avatar video with lip-sync.

Audio to Video
ltx-2-19b-lipsync

ltx-2-19b-lipsync

LTX-2-19B LipSync generates a realistic talking video by synchronizing a person’s mouth movements to an input audio clip. It preserves facial identity, head position, lighting, and natural expressions while producing accurate lip motion, subtle blinking, and stable temporal consistency. Ideal for avatars, dubbing, dialogue replacement, and character narration.

Audio to Video
ltx-2.3-lipsync

ltx-2.3-lipsync

LTX-2.3 LipSync generates a realistic talking video by synchronizing mouth movements to an input audio clip. It preserves facial identity, head position, lighting, and natural expressions while producing accurate lip motion, subtle blinking, and stable temporal consistency—powered by the upgraded LTX-2.3 architecture.

Audio to Video
sync-lipsync

sync-lipsync

Generate realistic lipsync animations from audio using advanced algorithms for high-quality synchronization.

Audio to Video
wan2.2-speech-to-video

wan2.2-speech-to-video

WAN2.2 Speech-to-Video transforms a static image into a talking video by synchronizing lip movements and facial expressions with an audio input. Simply provide a character image along with a speech dialogue, and the model generates a natural, expressive video where the subject speaks your lines.

Audio to Video
kling-v2-avatar-standard

kling-v2-avatar-standard

AI-Avatar v2 Standard generates a talking-avatar video from a reference image and an audio dialogue. It performs accurate lip-sync, natural facial expressions, subtle head motion, blinking, and light emotional cues based on voice tone. This Standard version focuses on speed and natural realism.

Audio to Video
📝

Overview

About this model

OmniHuman 1.5 is a state-of-the-art lipsync and talking head model that animates a portrait image using an input audio track. It achieves high fidelity, realistic lip-syncing, natural facial expressions, and fluid head movements to create realistic speaking or singing videos.

1Digital Avatars: Create realistic talking avatars for virtual presentations and customer engagement.
2Entertainment: Animate portrait photos and artwork to sing or speak with natural lip sync.
3Social Media: Generate engaging video clips with animated characters matching voiceovers.
💰

Pricing & Value

Cost analysis

muapiapp$0.045/sec (720p) / $0.060/sec (1080p)

Dynamic per-second billing based on audio duration (5s to 60s).

Fal.aiNot available

Not available

ReplicateNot available

Not available

* Competitor pricing is estimated based on similar model architectures and usage tiers.

⚙️

Technical Details

Configuration schema

Promptstring

Optional prompt to guide lipsync style.

Default ValueMake her sing confidently into a microphone with natural lip sync
Image URLstring

URL of the input portrait image.

Default Valuehttps://cdn.muapi.ai/assets/omnihuman-1-5.jpg
Audio URLstring

URL of the input audio track.

Default Valuehttps://cdn.muapi.ai/assets/omnihuman-1-5.mp3
Output ResolutionEnum (2 options)

Output video resolution.

Default Value1080
Fast Modeboolean

Enable fast generation mode.

Default Valuefalse
📖

Implementation Guide

Developer documentation

How to Use OmniHuman 1.5

  1. Prepare Your Inputs:

    • Image URL: Select a clear, high-quality portrait image. Upload or provide its URL.
    • Audio URL: Provide an audio file containing the speech or song that will drive the avatar's lip-sync.
  2. Configure Parameters:

    • Prompt: Optionally add a prompt to guide the style of the lip sync (e.g., 'Make her sing confidently with natural lip sync').
    • Output Resolution: Choose between 720p or 1080p. The default is 1080p.
    • Fast Mode: Enable fast generation mode (pe_fast_mode) for quicker iterations.
    • Seed: Use a custom seed for reproducible results, or set to -1 for random.
  3. Submit Your Request:

    • Send your prepared inputs to the omnihuman-1-5 endpoint as defined in the technical schema.
  4. Receive and Review the Output:

    • Once processing completes, retrieve the output video URL. Review the generated animation and iterate as needed.

Common Questions

Frequently asked

What is the maximum duration of generation?

The generation duration is determined by the input audio track length, with a minimum of 5 seconds and a maximum of 60 seconds.

What parameters are required?

Both `image_url` and `audio_url` are mandatory parameters to generate a talking head video.

How is the credit cost calculated?

Cost is calculated dynamically per second of the input audio's duration: $0.045 per second for 720p resolution and $0.06 per second for 1080p resolution.