Study and quality evaluation of LLM’s generated unit test sets for C Programs
Resumen
As technology becomes more integrated into our daily routines, reliable software becomes increasingly critical. However, the high cost of manual test generation often leads developers to neglect software quality concepts. In this context, the growing demand for automated test generation is a crucial response to the potential negative consequences of inadequate software testing. Problem: Various tools designed explicitly for automated program testing exist for different programming languages, including C. However, learning and properly configuring these tools is often not trivial, and users must install and set them up for use. Solution: This work leverages the rapid rise of Large Language Models (LLMs) to evaluate their capability in generating unit tests for C programs, using code coverage and mutation score as metrics to assess the quality of the generated test sets. Method: This study selected 27 C programs from the literature. We grouped these programs into three non-overlapping categories, depending on how each one accepts inputs (Basic Input – inputs provided as program parameters; Driver Type 1 – each test case is a case option in a switch command and the inputs are hard-coded inside the case option; and Driver Type 2 – similar to Driver Type 1 but with the inputs encoded on external data files). For each program, we interactively asked the LLM to generate tests automatically. After generating the test sets, we collected metrics such as code coverage, mutation score, and test execution success rate to evaluate the efficiency and effectiveness of each set. We then used these metrics as new parameters to enhance the efficiency of the sets. Results: The test sets generated by LLMs demonstrate significant relevance by presenting substantial results, given the ease of use and low need for human intervention in adjusting the necessary configuration guidelines. On average, LLMs test sets reached 100% of code coverage and 98,7% of mutation score on testing programs with basic inputs. The worst results are in testing programs requiring a driver of Type 1, reaching 91,8% of code coverage and 95.2% of mutation score. Nevertheless, these results are very satisfactory, mainly due to the prompt simplicity and the effort required for test case generation.
Colecciones
El ítem tiene asociados los siguientes ficheros de licencia: