PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles

All of RECERCAT

To access the full text documents, please follow this link: http://hdl.handle.net/10230/34367

Title:	PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles
Author:	Ferrés, Daniel; Saggion, Horacio; Ronzano, Francesco; Bravo Serrano, Àlex, 1984-
Abstract:	Comunicació presentada a la Language Resources and Evaluation Conference (LREC) 2018, celebrada els dies 7 a 12 de maig de 2018 a Miyazaki, Japó.
Abstract:	The availability of automated approaches and tools to extract structured textual content from PDF articles is essential to enable scientific text mining. This paper describes and evaluates the PDFdigest tool, a PDF-to-XML textual content extraction system specially designed to extract scientific articles’ headings and logical structure (title, authors, abstract,...) and its textual content. The extractor deals with both text-based and image-based PDF articles using custom rule-based algorithms implemented with existing state-of-the-art open-source tools for both PDF-to-HTML conversion and image-based PDF Optical Character Recognition.
Abstract:	This work was partly funded by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE) and the Spanish MINECO Ministry (MDM-2015-0502).
Subject(s):	-Language resources -Scientific text mining -Digital libraries -Information extraction -PDF conversion
Rights:	http://creativecommons.org/licenses/by-nc/4.0/
Document type:	Conference Object Article - Accepted version
Published by:	ACL (Association for Computational Linguistics)
Share: