Corpus ROBOT-TALK (english)

The ROBOT TALK corpus was created with the aim of serving as a language sample to perform the quantitative and qualitative contrastive linguistic analyses in order to answer the main question of the project: is it possible to distinguish whether a text in Spanish has been generated by an LLM or by a person using linguistic features of the text?

This is a comparable monitor corpus in Spanish. It is composed of author-comparable texts (human, Bard, Claude, GPT-3.5-Turbo, GPT-4, Mixtral, DeepSeek) of three main genres (news, film reviews and scientific articles specialised in linguistics).

Sample of the corpus

Characteristics of the corpus

Text written in Spanish
Comparables by author
- human
- Gemini
- Claude
- GPT-3.5-Turbo
- GPT-4
- Mixtral
- DeepSeek

GENRES

Scientific articles on linguistics

News

Film review

SOURCES

Scientific journals in linguistics:

RSEL, Revista de investiación Lingüística, Revista electrónica de lingüística aplicada, Sintagma, Círculo de Lingüística Aplicada a la Comunicación, Asterisco, …

News online:

RTVE, EFE

Film review website:

Filmaffinity

composition of the corpus

Comparable corpus	Author	Human	Gemini	Claude	GPT-3.5-Turbo	GPT-4	Mixtral	Deep Seek	No. of text by genre
Genre of the text	Scientific articles	152	152	152	90	144	90	144	982
	News	182	182	182	111	171	111	171	1250
	Film reviews	171	171	171	95	160	95	160	1050
Total no. of text		505	505	505	296	475	296	475	3282

Proyecto ROBOT-TALK

Proyectos de Investigación

Corpus ROBOT-TALK (english)

Sample of the corpus

Characteristics of the corpus

composition of the corpus