The field of spoken language modeling is moving towards more advanced and innovative approaches, with a focus on improving the acoustic-semantic gap and developing more effective models for speech-to-speech and speech-to-text translation. Researchers are exploring new methods, such as generative models and cross-modal knowledge distillation, to improve the performance of spoken language models. Noteworthy papers include: GmSLM, which introduces a novel pipeline for modeling Marmoset vocal communication and demonstrates its advantage over basic human-speech-based baselines. EchoX, which proposes a new approach to mitigating the acoustic-semantic gap via echo training for speech-to-speech LLMs and achieves advanced performance on multiple knowledge-based question-answering benchmarks.