<abstract xmlns="http://www.w3.org/1999/xhtml">

<sec>
<h3>Purpose</h3>
<p>Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.</p>
</sec>
<sec>
<h3>Design/methodology/approach</h3>
<p>Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements.</p>
</sec>
<sec>
<h3>Findings</h3>
<p>ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author’s significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations.</p>
</sec>
<sec>
<h3>Research limitations</h3>
<p>The data is self-evaluations of a convenience sample of articles from one academic in one field.</p>
</sec>
<sec>
<h3>Practical implications</h3>
<p>Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use.</p>
</sec>
<sec>
<h3>Originality/value</h3>
<p>This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.</p>
</sec>
</abstract>

Purpose
Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task.


Design/methodology/approach
Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements.


Findings
ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author’s significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations.


Research limitations
The data is self-evaluations of a convenience sample of articles from one academic in one field.


Practical implications
Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use.


Originality/value
This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.

Can ChatGPT evaluate research quality?

Journal of Data and Information Science

This work is licensed under the Creative Commons Attribution 4.0 International License.

{"article-title":"Can ChatGPT evaluate research quality?"}

Purpose
Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming...

Score	GPT	%	Me	%
1*	0	0.0%	2	4%
1.5*	0	0.0%	3	6%
2*	14	1.8%	12	24%
2.33*	1	0.1%	0	0%
2.5*	2	0.3%	9	18%
2.67*	2	0.3%	0	0%
2.75*	0	0.0%	1	2%
3*	509	66.5%	8	16%
3.33*	9	1.2%	0	0%
3.5*	14	1.8%	7	14%
3.67*	15	2.0%	0	0%
4*	199	26.0%	9	18%
Total	765	100.0%	51	100%

Correlation	All articles	Articles scored 2.5+ by me	Articles scored 3+ by me
GPT average vs. author (95% CI)	0.509	0.200	0.246
	(0.271,0.688)	(-0.148,0.504)	(-0.175,0.590)
GPT vs. author, average of 15 pairs (fraction of 95% Cis excluding 0)	0.281	0.102	0.128
	(8/15)	(1/15)	(1/15)
GPT vs. GPT (average of 105 pairs)	0.245	0.194	0.215
Sample size (articles)	51	34	24

Can ChatGPT evaluate research quality?

Article Category: Research Papers

Published Online: May 27, 2024

Page range: 1 - 21

Received: Feb 06, 2024

Accepted: Apr 22, 2024

DOI: https://doi.org/10.2478/jdis-2024-0013

Keywords
ChatGPT, Large Language Models, LLM, Research Excellence Framework, REF 2021, Research quality, Research assessment

© 2024 Mike Thelwall, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Figure 1.

Figure 2.

The scores given by ChatGPT-4 REF D and me to 51 of my open access articles.

Pearson correlations for 51 of my open access articles, comparing my initial scores, and scores from ChatGPT-4 REF D.

Can ChatGPT evaluate research quality?

Article Category: Research Papers

Published Online: May 27, 2024

Page range: 1 - 21

Received: Feb 06, 2024

Accepted: Apr 22, 2024

DOI: https://doi.org/10.2478/jdis-2024-0013

KeywordsChatGPT, Large Language Models, LLM, Research Excellence Framework, REF 2021, Research quality, Research assessment

© 2024 Mike Thelwall, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Figure 1.

Figure 2.

The scores given by ChatGPT-4 REF D and me to 51 of my open access articles.

Pearson correlations for 51 of my open access articles, comparing my initial scores, and scores from ChatGPT-4 REF D.

Keywords
ChatGPT, Large Language Models, LLM, Research Excellence Framework, REF 2021, Research quality, Research assessment