Diffusers documentation
GlmImageTransformer2DModel
Get started
Pipelines
Adapters
Inference
Inference optimization
Modular Diffusers
Training
Quantization
Model accelerators and hardware
Specific pipeline examples
Resources
API
Main Classes
Modular
Loaders
Models
OverviewAutoModel
ControlNets
Transformers
AceStepTransformer1DModelAllegroTransformer3DModelAuraFlowTransformer2DModelBriaFiboTransformer2DModelBriaTransformer2DModelChromaTransformer2DModelChronoEditTransformer3DModelCogVideoXTransformer3DModelCogView3PlusTransformer2DModelCogView4Transformer2DModelConsisIDTransformer3DModelCosmosTransformer3DModelDiTTransformer2DModelEasyAnimateTransformer3DModelErnieImageTransformer2DModelFlux2Transformer2DModelFluxTransformer2DModelGlmImageTransformer2DModelHeliosTransformer3DModelHiDreamImageTransformer2DModelHunyuanDiT2DModelHunyuanImageTransformer2DModelHunyuanVideo15Transformer3DModelHunyuanVideoTransformer3DModelLatteTransformer3DModelLongCatImageTransformer2DModelLTX2VideoTransformer3DModelLTXVideoTransformer3DModelLumina2Transformer2DModelLuminaNextDiT2DModelMochiTransformer3DModelOmniGenTransformer2DModelOvisImageTransformer2DModelPixArtTransformer2DModelPriorTransformerQwenImageTransformer2DModelSanaTransformer2DModelSanaVideoTransformer3DModelSD3Transformer2DModelSkyReelsV2Transformer3DModelStableAudioDiTModelTransformer2DModelTransformerTemporalModelWanAnimateTransformer3DModelWanTransformer3DModelZImageTransformer2DModel
UNets
VAEs
Pipelines
Schedulers
Internal classes
GlmImageTransformer2DModel
A Diffusion Transformer model for 2D data from [GlmImageTransformer2DModel] (TODO).
GlmImageTransformer2DModel
class diffusers.GlmImageTransformer2DModel
< source >( patch_size: int = 2 in_channels: int = 16 out_channels: int = 16 num_layers: int = 30 attention_head_dim: int = 40 num_attention_heads: int = 64 text_embed_dim: int = 1472 time_embed_dim: int = 512 condition_dim: int = 256 prior_vq_quantizer_codebook_size: int = 16384 )
Parameters
- patch_size (
int, defaults to2) — The size of the patches to use in the patch embedding layer. - in_channels (
int, defaults to16) — The number of channels in the input. - num_layers (
int, defaults to30) — The number of layers of Transformer blocks to use. - attention_head_dim (
int, defaults to40) — The number of channels in each head. - num_attention_heads (
int, defaults to64) — The number of heads to use for multi-head attention. - out_channels (
int, defaults to16) — The number of channels in the output. - text_embed_dim (
int, defaults to1472) — Input dimension of text embeddings from the text encoder. - time_embed_dim (
int, defaults to512) — Output dimension of timestep embeddings. - condition_dim (
int, defaults to256) — The embedding dimension of the input SDXL-style resolution conditions (original_size, target_size, crop_coords). - pos_embed_max_size (
int, defaults to128) — The maximum resolution of the positional embeddings, from which slices of shapeH x Ware taken and added to input patched latents, whereHandWare the latent height and width respectively. A value of 128 means that the maximum supported height and width for image generation is128 * vae_scale_factor * patch_size => 128 * 8 * 2 => 2048. - sample_size (
int, defaults to128) — The base resolution of input latents. If height/width is not provided during generation, this value is used to determine the resolution assample_size * vae_scale_factor => 128 * 8 => 1024