Evaluating OpenAI Large Language Models for Generating Logical Abstractions of Technical Requirements Documents

Alexander Perko*, Franz Wotawa

*Korrespondierende/r Autor/-in für diese Arbeit

Publikation: Beitrag in Buch/Bericht/KonferenzbandBeitrag in einem KonferenzbandBegutachtung

Abstract

Since the advent of Large Language Models (LLM[s]) a few years ago, they have not only reached the mainstream but have become a commodity. Their application areas steadily expand because of sophisticated model architectures and enormous training corpora. However, accessible chatbot user interfaces and human-like responses may cause a tendency to overestimate their abilities. This study contributes to demonstrating the strengths and weaknesses of LLMs. In this work, we bridge methods from sub-symbolic and symbolic AI. In particular, we evaluate the capabilities of LLMs to convert textual requirements documents into their logical representation, enabling analysis and reasoning. This task demonstrates a use case close to industry, as requirements analysis is key in requirements and system engineering. Our experiments evaluate the popular model family used in OpenAI's ChatGPT, GPT-3.5, and GPT-4. The underlying goal of testing for the correct abstraction of meaning is not trivial, as the relationship between input and output semantics is not directly measurable. Thus, it is necessary to approximate translation correctness through quantifiable criteria. Most notably, we defined consistency-based metrics for the plausibility and stability of translations. Our experiments give insights into syntactical validity, semantic plausibility, stability of translations, and parameter configurations for LLM translations. We use real-world requirements and test the LLMs' performance out of the box and after pre-training. Experimentally, we demonstrated the strong relation between ChatGPT parameters and the stability of translations. Finally, we showed that even the best model configurations produced syntactically faulty (5%) or semantically implausible (7%) output and are not stable in their results.

Originalspracheenglisch
TitelProceedings - 2024 IEEE 24th International Conference on Software Quality, Reliability and Security, QRS 2024
Herausgeber (Verlag)IEEE
Seiten238-249
Seitenumfang12
ISBN (elektronisch)9798350365634
DOIs
PublikationsstatusVeröffentlicht - 26 Sept. 2024
Veranstaltung24th IEEE International Conference on Software Quality, Reliability and Security, QRS 2024 - Cambridge, Großbritannien / Vereinigtes Königreich
Dauer: 1 Juli 20245 Juli 2024

Publikationsreihe

NameIEEE International Conference on Software Quality, Reliability and Security, QRS
ISSN (Print)2693-9177

Konferenz

Konferenz24th IEEE International Conference on Software Quality, Reliability and Security, QRS 2024
Land/GebietGroßbritannien / Vereinigtes Königreich
OrtCambridge
Zeitraum1/07/245/07/24

ASJC Scopus subject areas

  • Software
  • Sicherheit, Risiko, Zuverlässigkeit und Qualität
  • Artificial intelligence

Fields of Expertise

  • Information, Communication & Computing

Fingerprint

Untersuchen Sie die Forschungsthemen von „Evaluating OpenAI Large Language Models for Generating Logical Abstractions of Technical Requirements Documents“. Zusammen bilden sie einen einzigartigen Fingerprint.

Dieses zitieren