Proyectos de Investigación

Corpus ROBOT-TALK (english)

The ROBOT TALK corpus was created with the aim of serving as a language sample to perform the quantitative and qualitative contrastive linguistic analyses in order to answer the main question of the project: is it possible to distinguish whether a text in Spanish has been generated by an LLM or by a person using linguistic features of the text?

This is a comparable monitor corpus in Spanish. It is composed of author-comparable texts (human, Bard, Claude, GPT-3.5-Turbo, GPT-4, Mixtral, DeepSeek) of three main genres (news, film reviews and scientific articles specialised in linguistics).

 

  Sample of the corpus

 

Characteristics of the corpus

  • Text written in Spanish
  • Comparables by author
    • human
    • Gemini
    • Claude
    • GPT-3.5-Turbo
    • GPT-4
    • Mixtral
    • DeepSeek
 
GENRES Scientific articles on linguistics News Film review
SOURCES

Scientific journals in linguistics:

RSEL, Revista de investiación Lingüística, Revista electrónica de lingüística aplicada, Sintagma, Círculo de Lingüística Aplicada a la Comunicación, Asterisco, …

News online:

RTVE, EFE

Film review website:

Filmaffinity

 

Composición del corpus

Comparable corpus Author Human  Gemini  Claude GPT-3.5-Turbo GPT-4 Mixtral Deep Seek No. of text by genre
Genre of the text Scientific articles 144 144 144 90 144 90 144 900
News 171 171 171 111 171 111 171 1077
Film reviews 160 160 160 95 160 95 160 990
Total no. of text 475 475 475 296 475 296 475 2967