The field of music and video generation is witnessing significant advancements, with a focus on improving control and realism in generated content. Researchers are exploring new techniques to preserve the temporal structure of source music during editing, and to achieve precise motion control in video generation. The use of attention mechanisms, diffusion models, and hierarchical conditional models is becoming increasingly prominent. These approaches enable more accurate modification of musical characteristics, improved motion control, and better integration of visual features in video-to-music generation. Noteworthy papers include: Melodia, which presents a training-free technique for music editing that preserves the temporal structure of source music. Time-to-Move, which introduces a training-free framework for motion- and appearance-controlled video generation. Diff-V2M, which proposes a hierarchical conditional diffusion model for video-to-music generation with explicit rhythmic modeling.