Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: http://hdl.handle.net/10230/34367

PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles
Ferrés, Daniel; Saggion, Horacio; Ronzano, Francesco; Bravo Serrano, Àlex, 1984-
Comunicació presentada a la Language Resources and Evaluation Conference (LREC) 2018, celebrada els dies 7 a 12 de maig de 2018 a Miyazaki, Japó.
The availability of automated approaches and tools to extract structured textual content from PDF articles is essential to enable scientific text mining. This paper describes and evaluates the PDFdigest tool, a PDF-to-XML textual content extraction system specially designed to extract scientific articles’ headings and logical structure (title, authors, abstract,...) and its textual content. The extractor deals with both text-based and image-based PDF articles using custom rule-based algorithms implemented with existing state-of-the-art open-source tools for both PDF-to-HTML conversion and image-based PDF Optical Character Recognition.
This work was partly funded by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE) and the Spanish MINECO Ministry (MDM-2015-0502).
-Language resources
-Scientific text mining
-Digital libraries
-Information extraction
-PDF conversion
http://creativecommons.org/licenses/by-nc/4.0/
Objeto de conferencia
Artículo - Versión aceptada
ACL (Association for Computational Linguistics)
         

Mostrar el registro completo del ítem

Documentos relacionados

Otros documentos del mismo autor/a

Accuosto, Pablo; Ronzano, Francesco; Ferrés, Daniel; Saggion, Horacio
Marimon, Montserrat; Ferrés, Daniel; Saggion, Horacio
Ferrés, Daniel; Saggion, Horacio; Gómez Guinovart, Xavier