Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: http://hdl.handle.net/10230/34367
Título: | PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles |
---|---|
Autor/a: | Ferrés, Daniel; Saggion, Horacio; Ronzano, Francesco; Bravo Serrano, Àlex, 1984- |
Abstract: | Comunicació presentada a la Language Resources and Evaluation Conference (LREC) 2018, celebrada els dies 7 a 12 de maig de 2018 a Miyazaki, Japó. |
Abstract: | The availability of automated approaches and tools to extract structured textual content from PDF articles is essential to enable scientific text mining. This paper describes and evaluates the PDFdigest tool, a PDF-to-XML textual content extraction system specially designed to extract scientific articles’ headings and logical structure (title, authors, abstract,...) and its textual content. The extractor deals with both text-based and image-based PDF articles using custom rule-based algorithms implemented with existing state-of-the-art open-source tools for both PDF-to-HTML conversion and image-based PDF Optical Character Recognition. |
Abstract: | This work was partly funded by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE) and the Spanish MINECO Ministry (MDM-2015-0502). |
Materia(s): | -Language resources -Scientific text mining -Digital libraries -Information extraction -PDF conversion |
Derechos: | http://creativecommons.org/licenses/by-nc/4.0/ |
Tipo de documento: | Objeto de conferencia Artículo - Versión aceptada |
Editor: | ACL (Association for Computational Linguistics) |
Compartir: |