Caracterització i detecció de textos generats artificialment

Casas Muñoz, Antoni

Caracterització i detecció de textos generats artificialment

dc.contributor

Universitat Politècnica de Catalunya. Departament de Ciències de la Computació

dc.contributor

Martín Muñoz, Mario

dc.contributor.author

Casas Muñoz, Antoni

dc.date.issued

2020-06-26

dc.identifier

https://hdl.handle.net/2117/329241

dc.identifier

152499

dc.description.abstract

Millores recents en el camp dels models de llenguatge natural han portat a la creació de nous models generadors de llenguatge, aquests nous models són de gran qualitat, i en certes ocasions, diferenciar-los d'allò que un humà escriuria o faria és extremadament complex. A la vegada, usos il·legitims d'aquesta nova tecnologia estan creixent, per tant és d'interès la comprensió d'aquests models per a la seva millora, i per a la detecció d'usos il·legitims d'aquests. Aquest treball examina diferents lleis i distribucions sobre el llenguatge natural, i examina quines diferències existeixen entre el text generat pel model màquina GPT2, el state of the art actual, i text escrit per humans. Específicament analitza la distribució de categories gramaticals, entropia condicional sobre el text, entropia condicional sobre les seves categories gramaticals, i entropia condicional sobre els caràcters del text, la distribució de zipf, la distribució de les mides de grups de correferències, la distribució de mides de paraula i la distribució de la polisèmia de cada paraula. També s'ha desenvolupat una API REST documentada per Swagger 2.0 per a facilitar l'extracció d'aquesta informació i fer futurs anàlisis d'aquest estil més fàcils, i permetre la integració d'aquesta informació a processos d'extracció d'informació per l'avaluació de models de llenguatge natural creats amb aprenentatge màquina.

dc.description.abstract

Recent innovations in the field of natural language modelling have brought the creation of new generative language models. These new models are of great quality, and in certain occasions, differentiating between these and what a human would write is extremely complex. At the same time, nefarious uses of this technology have been growing, so the comprehension of these is of great importance for their improvement and detection of nefarious uses by these. This work examines different laws and distributions over the natural language, and examines which differences are present between human text and GPT2, the current state of the art. Specifically, it analyzes the distribution of parts of speech, conditional entropy over text, parts of speech and characters, distribution of the size of correference clusters, the zipf distribution, the distribution of word size, and the distribution of each word's polysemy. At the same time, a REST API documented by Swagger 2.0 has been developed to facilitate extraction of information for future analysis of this type, and to allow the integration of this information to processes which extract information to evaluate natural language models generated by machine learning.

dc.format

application/pdf

dc.language

cat

dc.publisher

Universitat Politècnica de Catalunya

dc.rights

Open Access

dc.subject

Àrees temàtiques de la UPC::Informàtica

dc.subject

Computational linguistics

dc.subject

Natural language processing (Computer science)

dc.subject

GPT2

dc.subject

lingüística quantitativa

dc.subject

quantitative lingüistics

dc.subject

Lingüística computacional

dc.subject

Tractament del llenguatge natural (Informàtica)

dc.title

Caracterització i detecció de textos generats artificialment

dc.type

Bachelor thesis

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Treballs acadèmics [82545]