ControlBench

Abstract

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.

ControlBench Dataset

We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. The statistics of ControlBench for each sub-topic is summarzed as below.

ControlBench comprises 147 problems sourced from exercises in Schaum’s outline of feedback and control systems, and problems gathered from undergraduate control classes at University of Michigan (EECS 460) and University of Illinois, Urbana-Champaign (ECE 486).
Both textual and visual elements are inclcuded in ControlBench to mirror the multifaceted nature of real-world applications.
We also introduced ControlBench-C, tansformed from ControlBench by reformulating 100 open-ended problems into multiple-choice questions for efficient evaluations.

Here we show an example extracted from ControlBench and ControlBench-C

Evaluations of Leading LLMs on ControlBench

We assess the capabilities of GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra on ControlBench using a zero-shot prompting followed by self-checking strategy, and examine the responses through human annotation. Below we present our zero-shot prompt and self-checking prompt.

Evaluation Results

Evaluation Metrics: Accuracy (ACC) is defined as the proportion of instances where the LLMs correctly solve the given problems, while Self-Checked Accuracy (ACC-s) quantifies the instances in which LLMs successfully amend their answers after a self-review process. The best results for each metric are highlighted in bold.

Claude 3 Opus emerges as the standout model, demonstrating superior performance in both ACC and ACC-s.
GPT-4 displays competitive accuracy in specific areas, such as Block Diagrams, Root-Locus Design, and System Sensitivity Measures, but does not match Claude 3 Opus in overall performance.
Gemini 1.0 Ultra ranks lower in general effectiveness.

Failure Modes Analysis

Despite their great potential, LLMs can fail in many different ways. The following figure categorizes the LLM failures into seven types, and highlight the proportions (%) for GPT-4 and Claude 3 Opus, respectively.

The biggest bottleneck preventing GPT-4 to achieve better accuracy on ControlBench is its limited reasoning capabilities.
The performance of Claude 3 Opus can be further boosted by improving its calculation abilities or calling external tools.

Examples

Here are sampled responses from GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra for selected problems in ControlBench:

BibTeX

@article{Darioush2024ControlBench,
  author    = {Darioush, Kevian and Usman, Syed and Xingang, Guo and Aaron, Havens and Geir, Dullerud and Peter, Seiler and Lianhui, Qin and Bin, Hu},
  title     = {Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra},
  journal   = {arXiv preprint arXiv:2404.03647},
  year      = {2024},
}