Logo EngDesign

Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs

Xingang Guo*†, Yaxin Li*, Xiangyi Kong*, Yilan Jiang*, Xiayu Zhao*, Zhihua Gong*, Yufan Zhang*,
Daixuan Li, Tianle Sang, Beixiao Zhu, Gregory Jun, Yingbing Huang, Yiqi Liu, Yuqi Xue, Rahul Dev Kundu, Qi Jian Lim, Yizhou Zhao, Luke Alexander Granger, Mohamed Badr Younis, Darioush Keivan,
Nippun Sabharwal, Shreyanka Sinha, Prakhar Agarwal, Kojo Vandyck, Hanlin Mai, Zichen Wang, Aditya Venkatesh, Ayush Barik, Jiankun Yang, Chongying Yue, Jingjie He, Libin Wang, Licheng Xu,
Hao Chen, Jinwen Wang, Liujun Xu, Rushabh Shetty, Ziheng Guo, Dahui Song, Manvi Jha, Weijie Liang, Weiman Yan, Bryan Zhang, Sahil Bhandary Karnoor, Jialiang Zhang, Rutva Pandya,
Xinyi Gong, Mithesh Ballae Ganesh, Feize Shi, Ruiling Xu, Yifan Zhang,
Yanfeng Ouyang, Lianhui Qin, Elyse Rosenbaum, Corey Snyder, Peter Seiler, Geir Dullerud, Xiaojia Shelly Zhang, Zuofu Cheng, Pavan Kumar Hanumolu, Jian Huang,
Mayank Kulkarni, Mahdi Namazifar, Huan Zhang, Bin Hu†

EngDesign Team

*Core Contributors
†Corresponding to: xingang2@illinois.edu, binhu7@illinois.edu,
benchmark overview

Comparison between conventional QA-style benchmarks (left) and the design-style benchmark EngDesign (right). Conventional QA benchmarks evaluate LLMs through static answer extraction and string-matching, while EngDesign involves open-ended design tasks with potentially non-unique solutions. LLMs must propose candidate design specifications, which are evaluated via program-based simulations and performance validation pipelines.

🔔News

2025-09: Thrilled to share that EngDesign has been officially accepted to the NeurIPS 2025 Datasets & Benchmarks Track! 🎉

2025-06: EngDesign website is live! 🎉

Introduction

Engineering design represents a fundamentally different challenge for AI compared to traditional problem solving. While existing benchmarks focus on factual recall and textbook-style questions, real-world engineering design demands synthesis of domain knowledge, navigation of complex trade-offs, and management of practical design processes.

We introduce EngDesign, a benchmark that evaluates AI systems' abilities to perform practical engineering design tasks across nine domains: Operating System Design, Computer Architecture Design, Control System Design, Mechanical Systems, Structural Design, Digital Hardware Design, Analog Integrated Circuit Design, Robotics, and Signal Processing.

EngDesign pioneers a simulation-based evaluation paradigm where AI-generated designs undergo rigorous testing through executable, domain-specific simulations—from circuit SPICE simulations to structural finite element analysis. This establishes a new benchmark paradigm that moves beyond textbook knowledge to assess genuine engineering capability through dynamic, simulation-driven functional verification.

EngDesign

Overview

EngDesign is a comprehensive multi-domain benchmark that evaluates the capabilities of Large Language Models (LLMs) in real-world engineering design tasks. Unlike conventional question-answering benchmarks, EngDesign employs a rigorous simulation-based evaluation pipeline to assess model performance in practical, design-oriented scenarios. The benchmark comprises 101 design tasks spanning 9 engineering domains, with a total of 473 gradable items. With an average prompt length of 778.71 tokens—substantially higher than typical QA benchmarks—EngDesign captures the contextual richness and complexity of realistic engineering design problems.

example

Key features of the EngDesign benchmark:

  • Task Distribution: 48 tasks require domain-specific scientific software (MATLAB, Cadence), while 53 tasks are fully open-sourced with manually authored evaluation scripts.
  • EngDesign-OPEN: A consolidated subset of 53 fully open-sourced tasks that supports broader community adoption without licensing constraints.
  • Multimodal Content: 23 tasks incorporate images as part of the task input to LLMs, enhancing the complexity and realism of design challenges.

Each task of EngDesign consists of the following four key components:

  • Task Description: A comprehensive query prompt that clearly defines the engineering design problem, including design objectives, specifications, and constraints.
  • Evaluation Rubrics: Detailed assessment criteria that break down each task into multiple gradable items, enabling partial credit scoring with a maximum of 100 points.
  • Evaluation Pipeline: Automated evaluation scripts that assess LLM-generated designs and return pass/fail indicators, numerical scores, and detailed evaluation logs.
  • Reference Design: A validated reference solution that fully satisfies all specified requirements, ensuring the feasibility and realism of each design challenge.

Examples

Experiment Results

Leaderboard

We evaluate frontier models on EngDesign including both general-purpose chat models and reasoning models. Each task is evaluated over three independent trials per model using three primary metrics:

  • Average Pass Rate (APR): Percentage of tasks that models correctly solve
  • Average Score (AS): Average numerical score across all tasks
  • Reasoning Robustness (RR): Model robustness across different reasoning tasks

General-Purpose-Chat-Model Reasoning-Model

Click on APR or AS to expand detailed results for each domain.

Last updated: 2025-06-18

Reset Average Pass Rate (APR) Average Score (AS) Reasoning Robustness (RR)
Name Date Overall Overall Overall

Overall results of different models on the EngDesign leaderboard. The best-performing model in each category is in-bold, and the second best is underlined.

Iterative Design

To emulate the workflow of human engineers, we implement an iterative design protocol that allows LLMs to refine their solutions based on feedback from previous attempts. The LLM is provided with its previous design output along with corresponding evaluation results—including scores, performance metrics, and diagnostic logs—and is then prompted to generate an improved design in the subsequent iteration.

As shown in the figure below, model performance consistently improves with additional iterations. Notably, o3 achieves nearly a 60% pass rate after ten iterations, demonstrating the effectiveness of iterative refinement. However, we also observe that iterative design does not help in all cases—for example, in Analog IC design tasks, models still fail to meet requirements even after multiple iterations.

Iterative design performance improvement

Average pass rate of GPT-4o, o1, o3, and o4-mini with the iterative design setup.

Error Analysis

To understand the failure modes of LLMs on engineering design tasks, we conducted a comprehensive error analysis of o4-mini's responses to 70 failed tasks. Given the complexity of engineering design problems, many responses exhibited multiple failure modes, leading us to allow multi-label assignments per task.

Our analysis identified 111 distinct error types across the failed responses. The distribution of these errors is illustrated in the figure below, providing insights into the most common challenges that LLMs face when tackling engineering design problems.

Error type distribution analysis

Distribution of 111 annotated error types for o4-mini on EngDesign tasks.

Error Type Examples

Below are representative examples of different error types identified in our analysis, showcasing the diverse challenges that LLMs face when attempting engineering design tasks.

BibTeX


          @inproceedings{guo2025engdesign,
            title={Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs},
            author={Guo, Xingang and Li, Yaxin and Kong, Xiangyi and Jiang, Yilan and Zhao, Xiayu and Gong, Zhihua and Zhang, Yufan and Li, Daixuan and Sang, Tianle and Zhu, Beixiao and Jun, Gregory and Huang, Yingbing and Liu, Yiqi and Xue, Yuqi and Kundu, Rahul Dev and Lim, Qi Jian and Zhao, Yizhou and Granger, Luke Alexander and Younis, Mohamed Badr and Keivan, Darioush and Sabharwal, Nippun and Sinha, Shreyanka and Agarwal, Prakhar and Vandyck, Kojo and Mai, Hanlin and Wang, Zichen and Venkatesh, Aditya and Barik, Ayush and Yang, Jiankun and Yue, Chongying and He, Jingjie and Wang, Libin and Xu, Licheng and Chen, Hao and Wang, Jinwen and Xu, Liujun and Shetty, Rushabh and Guo, Ziheng and Song, Dahui and Jha, Manvi and Liang, Weijie and Yan, Weiman and Zhang, Bryan and Karnoor, Sahil Bhandary and Zhang, Jialiang and Pandya, Rutva and Gong, Xinyi and Ganesh, Mithesh Ballae and Shi, Feize and Xu, Ruiling and Zhang, Yifan and Ouyang, Yanfeng and Qin, Lianhui and Rosenbaum, Elyse and Snyder, Corey and Seiler, Peter and Dullerud, Geir and Zhang, Xiaojia Shelly and Cheng, Zuofu and Hanumolu, Pavan Kumar and Huang, Jian and Kulkarni, Mayank and Namazifar, Mahdi and Zhang, Huan and Hu, Bin},
            booktitle={arXiv preprint 2509.16204},
            year={2025},
          }