The field of autonomous aerial systems is moving towards more advanced and realistic benchmarking and evaluation of AI models. Researchers are focusing on creating comprehensive benchmarks that can assess the capabilities of large language models and vision-language models in complex scenarios, including multi-drone collaborative perception and UAV navigation. These benchmarks are designed to evaluate the models' performance in realistic operational contexts, including degraded perception conditions and dynamic environments. The development of these benchmarks is expected to drive the advancement of next-generation UAV reasoning intelligence. Notable papers include: UAVBench, which introduces an open benchmark dataset for autonomous and agentic AI UAV systems, and AirCopBench, which provides a comprehensive benchmark for multi-drone collaborative embodied perception and reasoning. Additionally, papers such as From Synthetic Scenes to Real Performance and Is your VLM Sky-Ready highlight the importance of fine-tuning and evaluating vision-language models in UAV scenarios.