2025-09: Thrilled to share that EngDesign has been officially accepted to the NeurIPS 2025 Datasets & Benchmarks Track! 🎉
2025-06: EngDesign website is live! 🎉
Engineering design represents a fundamentally different challenge for AI compared to traditional problem solving. While existing benchmarks focus on factual recall and textbook-style questions, real-world engineering design demands synthesis of domain knowledge, navigation of complex trade-offs, and management of practical design processes.
We introduce EngDesign, a benchmark that evaluates AI systems' abilities to perform practical engineering design tasks across nine domains: Operating System Design, Computer Architecture Design, Control System Design, Mechanical Systems, Structural Design, Digital Hardware Design, Analog Integrated Circuit Design, Robotics, and Signal Processing.
EngDesign pioneers a simulation-based evaluation paradigm where AI-generated designs undergo rigorous testing through executable, domain-specific simulations—from circuit SPICE simulations to structural finite element analysis. This establishes a new benchmark paradigm that moves beyond textbook knowledge to assess genuine engineering capability through dynamic, simulation-driven functional verification.
EngDesign is a comprehensive multi-domain benchmark that evaluates the capabilities of Large Language Models (LLMs) in real-world engineering design tasks. Unlike conventional question-answering benchmarks, EngDesign employs a rigorous simulation-based evaluation pipeline to assess model performance in practical, design-oriented scenarios. The benchmark comprises 101 design tasks spanning 9 engineering domains, with a total of 473 gradable items. With an average prompt length of 778.71 tokens—substantially higher than typical QA benchmarks—EngDesign captures the contextual richness and complexity of realistic engineering design problems.
Key features of the EngDesign benchmark:
Each task of EngDesign consists of the following four key components:
EngDesign example XG_05
EngDesign example TB_04
EngDesign example RK_03
EngDesign example DL_01
We evaluate frontier models on EngDesign including both general-purpose chat models and reasoning models. Each task is evaluated over three independent trials per model using three primary metrics:
Click on APR or AS to expand detailed results for each domain.
Last updated: 2025-06-18
Reset | Average Pass Rate (APR) | Average Score (AS) | Reasoning Robustness (RR) | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Date | Overall | AICD | Arch | Ctrl | DHD | Mech | OS | Robo | SigP | Stru | Overall | AICD | Arch | Ctrl | DHD | Mech | OS | Robo | SigP | Stru | Overall |
Overall results of different models on the EngDesign leaderboard. The best-performing model in each category is in-bold, and the second best is underlined.
To emulate the workflow of human engineers, we implement an iterative design protocol that allows LLMs to refine their solutions based on feedback from previous attempts. The LLM is provided with its previous design output along with corresponding evaluation results—including scores, performance metrics, and diagnostic logs—and is then prompted to generate an improved design in the subsequent iteration.
As shown in the figure below, model performance consistently improves with additional iterations. Notably, o3 achieves nearly a 60% pass rate after ten iterations, demonstrating the effectiveness of iterative refinement. However, we also observe that iterative design does not help in all cases—for example, in Analog IC design tasks, models still fail to meet requirements even after multiple iterations.
Average pass rate of GPT-4o, o1, o3, and o4-mini with the iterative design setup.
To understand the failure modes of LLMs on engineering design tasks, we conducted a comprehensive error analysis of o4-mini's responses to 70 failed tasks. Given the complexity of engineering design problems, many responses exhibited multiple failure modes, leading us to allow multi-label assignments per task.
Our analysis identified 111 distinct error types across the failed responses. The distribution of these errors is illustrated in the figure below, providing insights into the most common challenges that LLMs face when tackling engineering design problems.
Distribution of 111 annotated error types for o4-mini on EngDesign tasks.
Below are representative examples of different error types identified in our analysis, showcasing the diverse challenges that LLMs face when attempting engineering design tasks.
Error Type Example 1
Error Type Example 2
Error Type Example 3
Error Type Example 4
Error Type Example 5
@inproceedings{guo2025engdesign,
title={Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs},
author={Guo, Xingang and Li, Yaxin and Kong, Xiangyi and Jiang, Yilan and Zhao, Xiayu and Gong, Zhihua and Zhang, Yufan and Li, Daixuan and Sang, Tianle and Zhu, Beixiao and Jun, Gregory and Huang, Yingbing and Liu, Yiqi and Xue, Yuqi and Kundu, Rahul Dev and Lim, Qi Jian and Zhao, Yizhou and Granger, Luke Alexander and Younis, Mohamed Badr and Keivan, Darioush and Sabharwal, Nippun and Sinha, Shreyanka and Agarwal, Prakhar and Vandyck, Kojo and Mai, Hanlin and Wang, Zichen and Venkatesh, Aditya and Barik, Ayush and Yang, Jiankun and Yue, Chongying and He, Jingjie and Wang, Libin and Xu, Licheng and Chen, Hao and Wang, Jinwen and Xu, Liujun and Shetty, Rushabh and Guo, Ziheng and Song, Dahui and Jha, Manvi and Liang, Weijie and Yan, Weiman and Zhang, Bryan and Karnoor, Sahil Bhandary and Zhang, Jialiang and Pandya, Rutva and Gong, Xinyi and Ganesh, Mithesh Ballae and Shi, Feize and Xu, Ruiling and Zhang, Yifan and Ouyang, Yanfeng and Qin, Lianhui and Rosenbaum, Elyse and Snyder, Corey and Seiler, Peter and Dullerud, Geir and Zhang, Xiaojia Shelly and Cheng, Zuofu and Hanumolu, Pavan Kumar and Huang, Jian and Kulkarni, Mayank and Namazifar, Mahdi and Zhang, Huan and Hu, Bin},
booktitle={arXiv preprint 2509.16204},
year={2025},
}