Sophisticated Large Language Models (LLMs), such as ChatGPT, have become highly effective in comprehending and generating human-like texts and have become pivotal in various applications, including writing assistance (Seßler et al., 2023). From an ethical standpoint, appropriately acknowledging the use of LLMs is conscientious as it underscores a commitment to transparency, honesty, and integrity in writing (Sallam, 2023). In scientific communication, where the pursuit and dissemination of knowledge are deeply guided by these principles (Sikes, 2009), clearly articulating the involvement of these models in the writing process is of great importance (Bin-Nashwan et al., 2023). Indeed, such acknowledgment is mandated by most publishers (Nature editorial, 2023).
Unfortunately, the current landscape presents a formidable challenge in terms of enforcing the explicit acknowledgment of LLMs in general and in scientific communication in particular. First, the evolving capabilities of LLMs and their intricate role in the writing process introduce uncertainty and ambiguity, thus estimating the extent of LLM influence and establishing clear boundaries for acknowledgment can be elusive. In addition, certain authors may hesitate to overtly admit their use of LLM for various reasons, including a traditional authorship viewpoint, concerns about potential negative perceptions, or the lack of established guidelines in this regard, to name a few (Draxler et al., 2023).
One promising strategy to promote genuine disclosure of LLM usage involves the development of automated tools for LLM use detection. Indeed, various detectors were created for distinguishing between human and LLM-generated texts (Tang et al., 2023). Nevertheless, these detectors need not necessarily be proficient in detecting LLM-assisted writing, as they were not originally designed for this purpose. To our knowledge, the automated detection of LLM-assisted writing in scientific communication has yet to be explicitly considered by the scientific community.
In this work, we investigate the viability of using four state-of-the-art LLM-generated text detectors for detecting LLM-assisted writing in scientific communication. Our findings reveal subpar performance, raising substantial concerns regarding the practical value of using these models to detect potentially undisclosed LLM-assisted writing. To make the case for the viability of the LLM-assisted writing detection challenge, we present and evaluate an alternative detector designed for identifying abrupt “writing style changes” occurring around the period of LLM proliferation, which can reasonably be associated with LLM writing. While our proposed approach need not claim optimality and is limited in several respects, it does demonstrate a noteworthy improvement compared to existing detectors, indicating that the challenge of detecting LLM-assisted writing remains unsolved.
For our evaluation, we curated two data sets: First, an assessment set consisted of a meticulously garnered set of twenty-two scientific publications in the form of eleven matched samples. Specifically, we manually identified and extracted eleven publications where ChatGPT was either listed as a co-author or appropriately acknowledged in the text. As these publications self-evidently belong to the “LLM-Assisted” category, for each of these publications, a counterpart publication that was authored by the leading human author (i.e. first author) during the 2021-2022 period was matched, resulting in eleven paired samples. Note that the publications chosen from the 2021-2022 period are assumed to be free of any LLM influences given that this period predates LLM proliferation (Lund & Wang, 2023). The full list of publications considered is provided in Appendix 1. Second, a false-positive set was assembled, comprising of a varied compilation of 1,094 publications published in or before 2022 (i.e. devoid of any LLM influences). The curation process, which is detailed in Appendix 2, follows a similar technique to that presented in recent literature (Alexi et al., 2024; Zargari et al., 2023). The resulting false-positive set consists of full-text manuscripts from established authors across diverse academic institutions, disciplines, and ranks.
We consider four state-of-the-art open-access LLM-generated text detectors: DetectLLM (Su et al., 2023), Zippy, LLMDet (Wu et al., 2023), and ConDA (Bhattacharjee et al., 2023). These four detectors exemplify the two predominant approaches in detecting LLM-generated text: zero-shot, represented by the first two detectors, meaning they do not necessitate additional input during inference other than the provided text of interest; and few-shot, represented by the latter two detectors, re-quiring a small number of reference samples for their inference. We utilized the implementations of these detectors as originally published by their authors for our analysis. It is worth noting that two of the detectors offer a “soft classification”, meaning they provide a continuous measure that requires conversion into a “hard classification”, i.e. a binary label of “LLM-assisted” or not. We tune these decision threshold parameters using a simple grid search approach with the assessment set.
We further evaluate a simple writing style-based approach for detecting LLM-assisted writing. The approach is based on the premise that a sudden change in one’s writing style around the time of LLM proliferation could potentially indicate LLM-assisted writing, especially if the change aligns with LLM writing style. Our detector, which we term the LLM-Assisted Writing (LAW) detector for simplicity, works as illustrated in Figure 1: First, for training, we adopt the writing style modeling technique provided by (Lazebnik & Rosenfeld, 2023), and for a given author
Table 1 presents the results of each model in terms of accuracy,
The performance of the examined detectors (columns) on the assessment set (first row) and the false-positive set (second row). The performance is presented as the accuracy with the
Model | LLMDet | DetectLLM | ZipPy | ConDA | LAW |
---|---|---|---|---|---|
0.546 | 0.591 | 0.637 | 0.637 | 0.727 | |
0.286 | 0.471 | 0.600 | 0.600 | 0.700 | |
0.334 | 0.534 | 0.627 | 0.627 | 0.700 | |
0.250 | 0.421 | 0.575 | 0.575 | 0.700 | |
17.2% | 13.8% | 9.7% | 8.8% | 3.1% |
Statistically, the five detectors do not differ significantly on the assessment set given its very limited size (11 paired samples). Nevertheless, the detectors statistically differ in their performance on the false positive set χ2 = 133,
The observed results suggest that existing state-of-the-art LLM-generated text detectors are suboptimal, at the very least, for the task of detecting LLM-assisted writing in scientific communication. This subpar performance is manifested as low accuracy, F1-scores, and high false positive rate, particularly when contrasted with the simple writing style change detector we implemented in this study. We contend that these results should prompt a call for the development of specialized detectors, exclusively dedicated to LLM-assisted writing detection, aiming for more robust performance in the near future. We are of the opinion that such development is warranted and could play a pivotal role in fostering more authentic recognition of LLM-assisted writing and, consequently, it has the potential to enhance transparency, honesty, and integrity in scientific communication.
Our study is not without limitations. First, our ad-hoc writing style-based detector is consciously developed to detect an unexpected change in writing style during the time of LLM proliferation. As such, authors with a limited number of publications made prior to that period could not be considered by the detector. Moreover, mild changes in writing style from one publication to the next would likely not be detected, allowing for more substantial undetected LLM-assisted writing over time. In general, applying our detector for future publications would entail a major challenge of determining how and if to use current publications, be they classified by the detector as LLM-assisted or not, for future inference. Regarding our evaluation, it relies on two data sets with the first being relatively small. Unfortunately, at least in the realm of scientific communication, gathering more unsolicited instances of LLM-assisted writing is highly challenging since, as currently believed, most authors who practice LLM-assisted writing avoid explicitly reporting it for a variety of reasons (Dergaa et al., 2023; Yuan et al., 2022).