Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors

1University of Illinois Urbana-Champaign 2University of California San Diego

Abstract

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level transportation engineering problems. We introduce TransportBench, a benchmark dataset that includes a sample of transportation engineering problems on a wide range of subjects in the context of planning, design, management, and control of transportation systems. This dataset is used by human experts to evaluate the capabilities of various commercial and open-sourced LLMs, especially their accuracy, consistency, and reasoning behaviors, in solving transportation engineering problems. Our comprehensive analysis uncovers the unique strengths and limitations of each LLM, e.g. our analysis shows the impressive accuracy and some unexpected inconsistent behaviors of Claude 3.5 Sonnet in solving TransportBench problems. Our study marks a thrilling first step toward harnessing artificial general intelligence for complex transportation challenges.

TransportBench Dataset

We introduce TransportBench, a collection of 140 undergraduate problems that span a broad spectrum of topics related to transport engineering. TransportBench consists of both the true or false problems and the general Q&A problems. We summarize the statistics of our TransportBench dataset for each topic as below.

  • TransportBench comprises 140 problems sourced from a junior-level introductory course CEE 310 - Transportation Engineering and a senior-level focused course CEE 418 - Public Transportation Systems, both offered regularly at UIUC.
  • Each problem is human crafted. They are selected by the domain expert Prof. Yanfeng Ouyang based on his teaching at UIUC.

Here we show two representative examples extracted from TransportBench

Evaluating Accuracy of Leading LLMs on TransportBench

We evaluate the accuracy of leading LLMs such as GPT-4, GPT-4o, Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3 (70B) and Llama 3.1 (405B) on TransportBench via zero-shot pormpting strategy through human expert annotation. Our main evaluation metric is Accuracy (ACC), defined as the proportion of instances where the LLMs correctly solve the given problems. The results are reported as below.

  • Claude 3.5 Sonnet achieves the best ACC for most topics and the entire TransportBench dataset.
  • Gemini 1.5 Pro, GPT-4o, and Claude 3 Opus all demonstrate competitive performance.
  • The open-source model Llama 3.1 has reached the level of the commercial model GPT-4.

CEE 310 vs. CEE 418 and True/False vs General Q&A

It is interesting to investigate the impact of problem difficulty levels. The problems of TransportBench sourced from two classes: CEE 310 and CEE 418. CEE 310 is an introductory course that covers a very broad range of topics in transportation engineering, while CEE 418 is considered as a more advanced and more focused follow-up course (whose prerequisite is CEE 310). The following table shows the ACC of these two courses.

  • All the LLMs have lower ACC on CEE 418, and higher ACC on CEE 310.
  • Claude 3.5 Sonnet significantly outperforms all the other LLMs on CEE 418, demonstrating its superior capabilities in handling more advanced topics in transportation engineering.

We have studied how the problem type affects the LLM performance. The TransportBench consists of True or False problems and general Q&A problems. Intuitively, True or False problems are easier than more general Q&A problems. We report the ACC of the seven evaluated LLMs for each problem type in the above Table. The ACC results for True/False vs General Q&A shows: <\p>

  • Most LLMs have shown consistently lower ACC scores for general Q&A problems compared to True or False problems.
  • Claude 3.5 Sonnet achieves similar ACC for both general Q&A problems (71.6%) and True or False problems (72.6%).

Zero-shot Consistency on True/False Problems

Consistency refers to uniform, reliable, and logically coherent responses that maintain the same principles and reasoning across different inquiries. We have studied zero-shot consistency of LLMs on the True or False problem. In this setting, we independently test five trials of each problem in the zero-shot setting. We used two metrics to quantify the zero-shot consistency: (i) Mixed Response Rate (MRR), which is the percentage of the True or False problems that received mixed responses (non-identical answers) in any of the five trials; and (ii) aggregate ACC, which is the proportion of the trials where LLMs give the correct true or false label over the total 73 × 5 = 365 trials.

  • Llama 3 achieves the lowest MRR. However, the aggregate ACC for Llama 3 is also the lowest which suggests it has strong bias and consistently generate incorrect answers. Our study has shown Llama 3 to reports True for almost 90% of the total problem trials.
  • Claude 3.5 Sonnet achieves the highest aggregate ACC while maintaining a very low MRR making it the state-of-the art LLM in terms of zero-shot consistency for TransportBench.

Consistency under self-checking prompt

The LLM literature reports that sometimes LLMs can correct their mistakes if given simple self-checking prompts, such as "carefully check your solutions". We examine whether LLMs will generate consistent answers and reasoning when they are prompted to double check their original answers. We provide a complementary perspective on consistency of LLMs using two metrics. The first metric is self-checking accuracy (denoted as ACC- s ̅ ), which quantifies the instances in which LLMs give correct answers after the self-checking process. The second metric is the number of the True or False problems in which the LLMs flip the original correct answers to wrong ones. For a consistent LLM, we ideally want ACC- s ̅ to be higher than ACC and the number of incorrect flips to be low.

  • The self-checking prompts proved useful for GPT-4o and GPT-4 to boost their accuracy. The number of incorrect flips for GPT-4 and GPT-4o is very low thus showing these models to be consistent.
  • Given self-checking prompts, Claude 3.5 Sonnet is still more consistent than Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1, but less consistent than GPT-4 and GPT-4o.

Examples

Here are selected examples in TransportBench:

Image 1
Image 2
Image 3
Image 4
Image 5
Image 6
Image 6
Image 6
Image 6
Image 6
Image 6

BibTeX

@article{Usman2024TransportBench,
  author    = {Usman, Syed and Ethan, Light and Xingang, Guo and Huan, Zhang and Lianhui, Qin and Yanfeng, Ouyang and Bin, Hu},
  title     = {Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy, Consistency, and Reasoning Behaviors},
  journal   = {https://arxiv.org/abs/2408.08302},
  year      = {2024},
}