The field of generative visual models is shifting towards diffusion-based architectures, which are showing promise in improving training efficiency, inference speed, and transferability to broader vision tasks. This shift is driven by the need to move beyond traditional variational autoencoders (VAEs) and generative adversarial networks (GANs), which have limitations in terms of semantic separation and discriminative structure. Recent work has focused on developing novel latent diffusion models that leverage self-supervised representations and fragment the burden of representation across layers, enabling more efficient learning and improved generative quality. Notable papers in this area include: Latent Diffusion Model without Variational Autoencoder, which introduces a novel latent diffusion model without VAEs, leveraging self-supervised representations for visual generation. Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge, which proposes a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models.