Volltext-Downloads (blau) und Frontdoor-Views (grau)

Evaluating the Effectiveness of LLM-Generated Unit Tests in the Context of Industrial DBMS

  • Software testing consumes considerable resources in industrial database management systems, where undetected faults can cause data loss and costly operational disruptions. While Large Language Models (LLMs) have demonstrated remarkable code-generation capabilities, existing evaluations mostly rely on benchmarks that are likely present in LLM training data. This makes it unclear whether reported performance reflects genuine reasoning or memorization. We provide empirical evaluation of LLMs on unit test generation in a large proprietary database system that is absent from public training corpora. We compare three commercial general-purpose and one open-weight coding model across two systems: the proprietary SAP HANA and the open-source LevelDB. The study employs two generation methodologies. The first is test amplification, where existing human-written test suites are extended. The second is whole-suite generation, where tests are synthesized from source code files. To address the compilation challenges inherent in complex C++ projects, we implement an iterative repair loop guided by compiler feedback, enabling models to resolve compilation errors over multiple attempts. Additionally, to assess the impact of project-specific context on generation quality, we evaluate model performance under two configurations: source code only and source code augmented with its depended header files. We adopt an evaluation framework comprising compilation success rate, code coverage, and mutation score to evaluate syntactic correctness and the fault-detection e!ectiveness of the generated tests. Our results reveal a substantial performance gap between open-source and proprietary software systems. On the LevelDB project, multiple models achieved near-perfect mutation scores up to 100%, surpassing the human-written baseline by up to 50 percentage points, while GPT-5 increased line coverage from 73.78% to 82.69%. In contrast, performance on SAP HANA degraded considerably. Our test amplification approach achieved 77.33% line coverage and a mutation score of 39.54%, while whole-suite generation with auxiliary context reached a maximum of 62.11% line coverage and 25.14% mutation score. Both approaches performed substantially below the human-written baseline (baseline values are confidential). The iterative repair loop increases compilation success rates by approximately 2–3× within the first few iterations, with GPT-5 reaching up to 99%. However, qualitative analysis reveals that as the number of repair iterations increases, models shift their optimization toward compilability at the expense of e!ectiveness. Including header files as auxiliary context in the prompt increased line coverage by 10–15 percentage points and mutation scores by 1.5–4.4x across all four models compared to source-only input.

Export metadata

Additional Services

Search Google Scholar

Statistics

frontdoor_oas
Metadaten
Author:Vekilmuhammet Bekmyradov
URN:urn:nbn:de:hbz:832-cos4-14410
DOI:https://doi.org/10.57684/COS-1441
Series (Serial Number):CIplus (1/2026)
Document Type:Report
Language:English
Release Date:2026/04/14
GND Keyword:Künstliche Intelligenz
Page Number:69
Institutes and Central Facilities:Fakultät für Informatik und Ingenieurwissenschaften (F10) / Fakultät 10 / Institut für Data Science, Engineering, and Analytics
CCS-Classification:D. Software
Dewey Decimal Classification:000 Allgemeines, Informatik, Informationswissenschaft
JEL-Classification:C Mathematical and Quantitative Methods
Open Access:Open Access
Licence (German):License LogoCreative Commons - CC BY-NC-ND - Namensnennung - Nicht kommerziell - Keine Bearbeitungen 4.0 International