In this paper, we propose a novel framework HyperHuman to generate in-the-wild human images of high quality, using a Latent Structural Diffusion Model that integrates denoising of image elements.
Extensive experiments demonstrate that our framework yields superior performance, generating realistic humans under diverse scenarios, addressing both image appearance and spatial relationships effectively.
However, limitations remain, particularly due to the performance of existing pose/depth/normal estimators that can result in failures to generate subtle details, such as fingers and eyes.
Our current pipeline requires body skeleton input, indicating a potential area for future work involving deep priors like large language models for text-to-pose generation.
Collection
[
|
...
]