Corpus ROBOT-TALK (english)
The ROBOT TALK corpus was created with the aim of serving as a language sample to perform the quantitative and qualitative contrastive linguistic analyses in order to answer the main question of the project: is it possible to distinguish whether a text in Spanish has been generated by an LLM or by a person using linguistic features of the text?
This is a comparable monitor corpus in Spanish. It is composed of author-comparable texts (human, Bard, Claude, GPT-3.5-Turbo, GPT-4, Mixtral, DeepSeek) of three main genres (news, film reviews and scientific articles specialised in linguistics).
Sample of the corpus
Characteristics of the corpus
- Text written in Spanish
- Comparables by author
- human
- Gemini
- Claude
- GPT-3.5-Turbo
- GPT-4
- Mixtral
- DeepSeek
Composición del corpus
Comparable corpus | Author | Human | Gemini | Claude | GPT-3.5-Turbo | GPT-4 | Mixtral | Deep Seek | No. of text by genre |
---|---|---|---|---|---|---|---|---|---|
Genre of the text | Scientific articles | 144 | 144 | 144 | 90 | 144 | 90 | 144 | 900 |
News | 171 | 171 | 171 | 111 | 171 | 111 | 171 | 1077 | |
Film reviews | 160 | 160 | 160 | 95 | 160 | 95 | 160 | 990 | |
Total no. of text | 475 | 475 | 475 | 296 | 475 | 296 | 475 | 2967 |