Title:
|
PDFdigest: an adaptable layout-aware PDF-to-XML textual content extractor for scientific articles
|
Author:
|
Ferrés, Daniel; Saggion, Horacio; Ronzano, Francesco; Bravo Serrano, Àlex, 1984-
|
Abstract:
|
Comunicació presentada a la Language Resources and Evaluation Conference (LREC) 2018, celebrada els dies 7 a 12 de maig de 2018 a Miyazaki, Japó. |
Abstract:
|
The availability of automated approaches and tools to extract structured textual content from PDF articles is essential to enable scientific
text mining. This paper describes and evaluates the PDFdigest tool, a PDF-to-XML textual content extraction system specially designed
to extract scientific articles’ headings and logical structure (title, authors, abstract,...) and its textual content. The extractor deals
with both text-based and image-based PDF articles using custom rule-based algorithms implemented with existing state-of-the-art
open-source tools for both PDF-to-HTML conversion and image-based PDF Optical Character Recognition. |
Abstract:
|
This work was partly funded by the TUNER project
(TIN2015-65308-C5-5-R, MINECO/FEDER, UE) and the
Spanish MINECO Ministry (MDM-2015-0502). |
Subject(s):
|
-Language resources -Scientific text mining -Digital libraries -Information extraction -PDF conversion |
Rights:
|
http://creativecommons.org/licenses/by-nc/4.0/ |
Document type:
|
Conference Object Article - Accepted version |
Published by:
|
ACL (Association for Computational Linguistics)
|
Share:
|
|