This paper presents X-Oscar, a progressive framework designed for generating high-quality text-guided avatars. While notable advancements have been made in this field, existing methods suffer from several limitations, such as producing over-saturated and low-quality output. To create high-quality 3D avatars, X-Oscar follows a “Geometry→Texture→Animation” paradigm. This framework enables the gradual generation of avatars, reducing the complexity of optimization through step-by-step learning. To address the issue of over-saturation, we propose Adaptive Variational Parameter (AVP), which represents a 3D avatar as a distribution rather than fixed parameters. Additionally, we introduce Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into the rendered image instead of random noise. This modification significantly enhances the quality of the generated results. Extensive evaluations have been conducted to assess the effectiveness of X-Oscar in generating complex shapes, appearances, and poses of 3D avatars.
Overview of X-Oscar, which consists of geometry modeling, appearance modeling, and animation refinement. X-Oscar is a progressive framework for text-to-avatar generation that follows a “Geometry→Texture→Animation” paradigm. This approach decomposes the complex task of avatar generation into a series of manageable subtasks, each focusing on a specific aspect of the avatar’s creation. In geometry modeling, we optimize the geometry of the avatars, represented by the SMPL-Xmodel, to align with the input text prompt by employing a differentiable rendering pipeline. After geometry modeling, we obtain a mesh that matches the prompt in shape. In appearance modeling, we represent the appearance of the result by optimizing an albedo map. In animation refinement, we change the pose of the avatar and optimize both geometry and appearance to address some inevitable obstructed parts. By minimizing the animation loss, we can refine the geometry and appearance of the avatar in various poses, resulting in improved quality and reduced artifacts in the final result.
X-Oscar enables the creation of delicate animatable 3D avatars from text prompts.
We demonstrate the generation process of X-Oscar. It can be observed that the objects generated by X-Oscar exhibit high quality and fidelity.
Anna in Frozen | Warren Buffett | Hermoine Granger |
Ada Wong | Aragorn from The Lord of the Rings | Flynn Rider |
 
 Exclusively guided by a textual depiction, X-Oscar possesses the capability to produce a superior-quality canonical 3D avatar.
Aladdin in Aladdin | Frodo Baggins from The Lord of the Rings | Batman | Captain America |
Gardener | Geralt of Rivia | IronMan | Jeff Bezos |
Knight | Link from Zelda | Mulan | Steven Paul Jobs |
 
 In the presence of motion sequences, X-Oscar demonstrates the capacity to animate 3D avatars.
 
 X-Oscar has the capability to generate 3D human avatars across diverse poses while preserving superior texture and geometry standards.
 
 (The motivation of AvatarCLIP is generated by the professional tool Mixamo)
 
 X-Oscar can facilitate avatar customization through text editing, as depicted in the following results. It is evident that X-Oscar is capable of generating realistic objects.
Jack Ma wearing a flowing sky-blue sundress |   Jack Ma wearing a blue beanie, a black leather jacket, and blue jeans | Jack Ma wearing a blue shirt | Jack Ma wearing a down jacket |   Jack Ma wearing a green t-shirt and a blue jeans |
  | ||||
Jack Ma wearing a pink jacket | Jack Ma wearing a suit | Jack Ma wearing fitness clothing | Jack Ma wearing ski clothes | Jack Ma wearing a navy blue beanie, a blue sweater, and gray trousers |
 
 X-Oscar's high-quality 3D model can seamlessly undergo edits using popular 3D graphics and image software like Blender.