The field of multimodal learning is moving towards increased interpretability, with a focus on understanding and explaining the alignments between different modalities such as vision and language. Recent research has shown that sparse autoencoders can be effective in interpreting and enhancing these alignments, and that incorporating clinical knowledge and structured observations can lead to more transparent and trustworthy AI systems. Noteworthy papers include: VL-SAE, which proposes a sparse autoencoder to encode vision-language representations into a unified concept set, and CXR-LanIC, which introduces a framework for interpretable chest X-ray diagnosis through task-aligned pattern discovery. Additionally, MedSAE and ZETA demonstrate the potential of applying sparse autoencoders and structured clinical knowledge to medical vision and ECG diagnosis, respectively. CAVE provides a benchmark for detecting and explaining commonsense anomalies in visual environments, highlighting the challenges and opportunities in this area.