Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

Rebholz-Schuhmann, Dietrich; Furlong, Laura I., 1971-; Rautschka, Michael; Hahn, Udo

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

dc.contributor.author

Rebholz-Schuhmann, Dietrich

dc.contributor.author

Furlong, Laura I., 1971-

dc.contributor.author

Rautschka, Michael

dc.contributor.author

Hahn, Udo

dc.date.issued

2015-03-26T10:52:20Z

dc.date.issued

2015-03-26T10:52:20Z

dc.date.issued

2011

dc.identifier

Rebholz-Schuhmann D, Yepes A, Li C, Kafkas S, Lewin I, Kang N et al. Assessment of NER solutions against the first and second CALBC silver standard corpus. Journal of Biomedical Semantics. 2011;2(S5):S11. DOI: 10.1186/2041-1480-2-S5-S11

dc.identifier

2041-1480

dc.identifier

http://hdl.handle.net/10230/23289

dc.identifier

http://dx.doi.org/10.1186/2041-1480-2-S5-S11

dc.description.abstract

Background: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. Results: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I. The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. Conclusions: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.

dc.description.abstract

This work was funded by the EU Support Action grant 231727 under the 7th EU Framework Programme within Theme “Intelligent Content and Semantics” (ICT 2007.4.2). The work performed at IMIM (Laura Furlong) was funded by the EU Support Action grant 231727 under the 7th EU Framework Programme within Theme “Intelligent Content and Semantics” (ICT 2007.4.2) and the Instituto de Salud Carlos III FEDER (CP10/00524) grant. Fabio Rinaldi and Simon Clematide are supported by the Swiss National Science Foundation (grant 105315_130558/1)

dc.format

application/pdf

dc.format

application/pdf

dc.language

eng

dc.publisher

BioMed Central

dc.relation

Journal of Biomedical Semantics. 2011;2(S5):S11

dc.relation

info:eu-repo/grantAgreement/EC/FP7/231727

dc.rights

© 2011 Rebholz-Schuhmann et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

dc.rights

http://creativecommons.org/licenses/by/2.0

dc.rights

info:eu-repo/semantics/openAccess

dc.subject

Semàntica

dc.subject

Dades -- Anàlisi

dc.title

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

dc.type

info:eu-repo/semantics/article

dc.type

info:eu-repo/semantics/publishedVersion

Fitxers en aquest element

Fitxers	Grandària	Format	Visualització
No hi ha fitxers associats a aquest element.

Aquest element apareix en la col·lecció o col·leccions següent(s)

Recerca: articles, congressos, llibres [21085]