Large-scale analysis of Zipf ’s law in English texts

Moreno-Sánchez, Isabel; Font-Clos, Francesc; Corral, Alvaro

Large-scale analysis of Zipf ’s law in English texts

dc.contributor.author

Moreno-Sánchez, Isabel

dc.contributor.author

Font-Clos, Francesc

dc.contributor.author

Corral, Alvaro

dc.date.accessioned

2020-10-29T11:51:52Z

dc.date.accessioned

2024-09-19T13:37:08Z

dc.date.available

2020-10-29T11:51:52Z

dc.date.available

2024-09-19T13:37:08Z

dc.date.issued

2015-01-01

dc.identifier.uri

http://hdl.handle.net/2072/377691

dc.description.abstract

Despite being a paradigm of quantitative linguistics, Zipf\'\''s law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf\'\''s law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf\'\''s law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf\'\''s law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value) and with only one free parameter (the exponent).

eng

dc.format.extent

cat

dc.language.iso

eng

cat

dc.source

RECERCAT (Dipòsit de la Recerca de Catalunya)

dc.subject.other

Matemàtiques

cat

dc.title

Large-scale analysis of Zipf ’s law in English texts

cat

dc.type

info:eu-repo/semantics/preprint

cat

dc.subject.udc

cat

dc.embargo.terms

cap

cat

dc.rights.accessLevel

info:eu-repo/semantics/openAccess

Documents

L5-p_big_zipfMaRcAt.pdf

1.243Mb PDF

This item appears in the following Collection(s)

Prepublicacions del Centre de Recerca Matemàtica [619]