
"Zhipu, which styles itself Z.ai and runs a chatbot at that address, offers several models named General Language Model (GLM). On Wednesday the company announced GLM-Image, that it says employs "an independently developed 'autoregressive + diffusion decoder' hybrid architecture, which enables the joint generation of image and language models." represents an important advance on the Nano Banana Pro image-generating AI."
"On model-mart Hugging Face, Zhipu describes GLM-Image's architecture as comprising two elements: Autoregressive generator: a 9B-parameter model initialized from GLM-4-9B-0414, with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K-4K tokens, corresponding to 1K-2K high-resolution image outputs. Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images."
"The post also states that Z.ai developed the model using the Ascend Atlas 800T A2, a Huawei server that can run four Kunpeng 920 processors packing either 64 or 48 cores. The servers also use Huawei's Ascend 910 AI processors. The most recent Ascend model is 2025's 910C, which Huawei claims "can achieve around 800 TFLOPS of computing power per card at FP16 precision, which is approximately 80% of the computing power of NVIDIA's H100 chip (launched in 2022).""
Zhipu AI developed GLM-Image using Huawei Ascend Atlas 800T A2 servers and Ascend 910 AI processors, claiming the entire pipeline ran on Chinese hardware. GLM-Image uses an 'autoregressive + diffusion decoder' hybrid architecture to jointly generate images and language. The autoregressive generator is a 9B-parameter model initialized from GLM-4-9B-0414 with an expanded vocabulary for visual tokens, producing a compact 256-token encoding that expands to 1K–4K tokens for high-resolution outputs. The diffusion decoder is a 7B-parameter DiT-based latent-space image decoder that includes a Glyph Encoder module to improve text rendering inside images. Zhipu positions this model as an advance over Nano Banana Pro.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]