Advancements in Large Language Models

The field of large language models (LLMs) is rapidly evolving, with a focus on improving their reliability, security, and ability to reason and understand complex tasks. Recent developments have seen the introduction of new benchmarks and evaluation frameworks, such as GAUSS and AECBench, which assess the mathematical and domain-specific abilities of LLMs. Additionally, researchers have proposed novel methods for enhancing the self-awareness and introspection capabilities of LLMs, including the use of question-side effect quantification and semantic compression techniques. Noteworthy papers include 'Quantifying Self-Awareness of Knowledge in Large Language Models', which introduces a method for disentangling question-side shortcuts from true model-side introspection, and 'Beyond Pointwise Scores', which proposes a decomposed evaluation framework for assessing the precision and recall of LLM responses.

Sources

Quantifying Self-Awareness of Knowledge in Large Language Models

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses

GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models

An N-Plus-1 GPT Agency for Critical Solution of Mechanical Engineering Analysis Problems

G\"odel Test: Can Large Language Models Solve Easy Conjectures?

ATLAS: Benchmarking and Adapting LLMs for Global Trade via Harmonized Tariff Code Classification

CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs

Solving Math Word Problems Using Estimation Verification and Equation Generation

AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Memory in Large Language Models: Mechanisms, Evaluation and Evolution

Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

Benchmarking PDF Accessibility Evaluation A Dataset and Framework for Assessing Automated and LLM-Based Approaches for Accessibility Testing

LLMs as verification oracles for Solidity

Identifying and Addressing User-level Security Concerns in Smart Homes Using "Smaller" LLMs

Estimating the Self-Consistency of LLMs

GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation

CON-QA: Privacy-Preserving QA using cloud LLMs in Contract Domain

Integrated Framework for LLM Evaluation with Answer Generation

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning