Proyectos de Investigación

Corpus ROBOT-TALK (english)

The ROBOT TALK corpus was created with the aim of serving as a language sample to perform the quantitative and qualitative contrastive linguistic analyses in order to answer the main question of the project: is it possible to distinguish whether a text in Spanish has been generated by an LLM or by a person using linguistic features of the text?

This is a comparable monitor corpus in Spanish. It is composed of author-comparable texts (human, Bard, Claude, GPT-3.5-Turbo, GPT-4, Mixtral, DeepSeek) of three main genres (news, film reviews and scientific articles specialised in linguistics).

 

  Sample of the corpus

 

Characteristics of the corpus

  • Text written in Spanish
  • Comparables by author
    • human
    • Gemini
    • Claude
    • GPT-3.5-Turbo
    • GPT-4
    • Mixtral
    • DeepSeek
 
GENRES Scientific articles on linguistics News Film review
SOURCES

Scientific journals in linguistics:

RSEL, Revista de investiación Lingüística, Revista electrónica de lingüística aplicada, Sintagma, Círculo de Lingüística Aplicada a la Comunicación, Asterisco, …

News online:

RTVE, EFE

Film review website:

Filmaffinity

 

composition of the corpus

Comparable corpus Author Human  Gemini  Claude GPT-3.5-Turbo GPT-4 Mixtral Deep Seek No. of text by genre
Genre of the text Scientific articles 152 152 152 90 144 90 144 982
News 182 182 182 111 171 111 171 1250
Film reviews 171 171 171 95 160 95 160 1050
Total no. of text 505 505 505 296 475 296 475 3282