Qwen Team Open Sources State-of-the-Art Image Model Qwen-Image
Briefly

Qwen-Image is an open-source multimodal image foundation model built from a Qwen2.5-VL text encoder, a Variational AutoEncoder for images, and a Multimodal Diffusion Transformer for generation. The model demonstrates strong text rendering in both English and Chinese and attains the highest overall score across T2I and TI2I benchmarks such as DPG, GenEval, GEdit, and ImgEdit. Image understanding performance approaches that of task-specific models. An AI Arena comparison site enables human evaluators to rate generated image pairs, where Qwen-Image currently ranks third among several high-quality closed models. The training set comprises billions of annotated image-text pairs across nature, design, people, and synthetic categories.
Qwen-Image uses a Qwen2.5-VL for text inputs, a Variational AutoEncoder (VAE) for image inputs, and a Multimodal Diffusion Transformer (MMDiT) for image generation. The combined model "excels" at text rendering, including both English and Chinese text. Qwen evaluated the model on a suite of T2I and TI2I benchmarks, including DPG, GenEval, GEdit and ImgEdit, where it achieved the highest overall score. On image understanding tasks, while not as good as specially trained models, Qwen-Image has performance "remarkably close" to theirs.
Qwen-Image is more than a state-of-the-art image generation model-it represents a paradigm shift in how we conceptualize and build multimodal foundation models. Its contributions extend beyond technical benchmarks, challenging the community to rethink the roles of generative models in perception, interface design, and cognitive modeling...As we continue to scale and refine such systems, the boundary between visual understanding and generation will blur further, paving the way for truly interactive, intuitive, and intelligent multimodal agents.
Read at InfoQ
[
|
]