Intrinsic evaluation of text mining tools may not predict performance on realistic tasks

James G Caporaso, Nita Deshpande, J. Lynn Fink, Philip E. Bourne, K. Bretonnel Cohen, Lawrence Hunter

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Citations (Scopus)

Abstract

Biomedical text mining and other automated techniques are beginning to achieve performance which suggests that they could be applied to aid database curators. However, few studies have evaluated how these systems might work in practice. In this article we focus on the problem of annotating mutations in Protein Data Bank (PDB) entries, and evaluate the relationship between performance of two automated techniques, a text-mining-based approach (MutationFinder) and an alignment-based approach, in intrinsic versus extrinsic evaluations. We find that high performance on gold standard data (an intrinsic evaluation) does not necessarily translate to high performance for database annotation (an extrinsic evaluation). We show that this is in part a result of lack of access to the full text of journal articles, which appears to be critical for comprehensive database annotation by text mining. Additionally, we evaluate the accuracy and completeness of manually annotated mutation data in the PDB, and find that it is far from perfect. We conclude that currently the most cost-effective and reliable approach for database annotation might incorporate manual and automatic annotation methods.

Original languageEnglish (US)
Title of host publicationPacific Symposium on Biocomputing 2008, PSB 2008
Pages640-651
Number of pages12
StatePublished - 2008
Externally publishedYes
Event13th Pacific Symposium on Biocomputing, PSB 2008 - Kohala Coast, HI, United States
Duration: Jan 4 2008Jan 8 2008

Other

Other13th Pacific Symposium on Biocomputing, PSB 2008
CountryUnited States
CityKohala Coast, HI
Period1/4/081/8/08

Fingerprint

Data Mining
Databases
Proteins
Mutation
Costs
Costs and Cost Analysis

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Biomedical Engineering
  • Medicine(all)

Cite this

Caporaso, J. G., Deshpande, N., Fink, J. L., Bourne, P. E., Cohen, K. B., & Hunter, L. (2008). Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. In Pacific Symposium on Biocomputing 2008, PSB 2008 (pp. 640-651)

Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. / Caporaso, James G; Deshpande, Nita; Fink, J. Lynn; Bourne, Philip E.; Cohen, K. Bretonnel; Hunter, Lawrence.

Pacific Symposium on Biocomputing 2008, PSB 2008. 2008. p. 640-651.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Caporaso, JG, Deshpande, N, Fink, JL, Bourne, PE, Cohen, KB & Hunter, L 2008, Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. in Pacific Symposium on Biocomputing 2008, PSB 2008. pp. 640-651, 13th Pacific Symposium on Biocomputing, PSB 2008, Kohala Coast, HI, United States, 1/4/08.
Caporaso JG, Deshpande N, Fink JL, Bourne PE, Cohen KB, Hunter L. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. In Pacific Symposium on Biocomputing 2008, PSB 2008. 2008. p. 640-651
Caporaso, James G ; Deshpande, Nita ; Fink, J. Lynn ; Bourne, Philip E. ; Cohen, K. Bretonnel ; Hunter, Lawrence. / Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pacific Symposium on Biocomputing 2008, PSB 2008. 2008. pp. 640-651
@inproceedings{73af7a988db8432c8116b6cc751521fa,
title = "Intrinsic evaluation of text mining tools may not predict performance on realistic tasks",
abstract = "Biomedical text mining and other automated techniques are beginning to achieve performance which suggests that they could be applied to aid database curators. However, few studies have evaluated how these systems might work in practice. In this article we focus on the problem of annotating mutations in Protein Data Bank (PDB) entries, and evaluate the relationship between performance of two automated techniques, a text-mining-based approach (MutationFinder) and an alignment-based approach, in intrinsic versus extrinsic evaluations. We find that high performance on gold standard data (an intrinsic evaluation) does not necessarily translate to high performance for database annotation (an extrinsic evaluation). We show that this is in part a result of lack of access to the full text of journal articles, which appears to be critical for comprehensive database annotation by text mining. Additionally, we evaluate the accuracy and completeness of manually annotated mutation data in the PDB, and find that it is far from perfect. We conclude that currently the most cost-effective and reliable approach for database annotation might incorporate manual and automatic annotation methods.",
author = "Caporaso, {James G} and Nita Deshpande and Fink, {J. Lynn} and Bourne, {Philip E.} and Cohen, {K. Bretonnel} and Lawrence Hunter",
year = "2008",
language = "English (US)",
isbn = "9812776087",
pages = "640--651",
booktitle = "Pacific Symposium on Biocomputing 2008, PSB 2008",

}

TY - GEN

T1 - Intrinsic evaluation of text mining tools may not predict performance on realistic tasks

AU - Caporaso, James G

AU - Deshpande, Nita

AU - Fink, J. Lynn

AU - Bourne, Philip E.

AU - Cohen, K. Bretonnel

AU - Hunter, Lawrence

PY - 2008

Y1 - 2008

N2 - Biomedical text mining and other automated techniques are beginning to achieve performance which suggests that they could be applied to aid database curators. However, few studies have evaluated how these systems might work in practice. In this article we focus on the problem of annotating mutations in Protein Data Bank (PDB) entries, and evaluate the relationship between performance of two automated techniques, a text-mining-based approach (MutationFinder) and an alignment-based approach, in intrinsic versus extrinsic evaluations. We find that high performance on gold standard data (an intrinsic evaluation) does not necessarily translate to high performance for database annotation (an extrinsic evaluation). We show that this is in part a result of lack of access to the full text of journal articles, which appears to be critical for comprehensive database annotation by text mining. Additionally, we evaluate the accuracy and completeness of manually annotated mutation data in the PDB, and find that it is far from perfect. We conclude that currently the most cost-effective and reliable approach for database annotation might incorporate manual and automatic annotation methods.

AB - Biomedical text mining and other automated techniques are beginning to achieve performance which suggests that they could be applied to aid database curators. However, few studies have evaluated how these systems might work in practice. In this article we focus on the problem of annotating mutations in Protein Data Bank (PDB) entries, and evaluate the relationship between performance of two automated techniques, a text-mining-based approach (MutationFinder) and an alignment-based approach, in intrinsic versus extrinsic evaluations. We find that high performance on gold standard data (an intrinsic evaluation) does not necessarily translate to high performance for database annotation (an extrinsic evaluation). We show that this is in part a result of lack of access to the full text of journal articles, which appears to be critical for comprehensive database annotation by text mining. Additionally, we evaluate the accuracy and completeness of manually annotated mutation data in the PDB, and find that it is far from perfect. We conclude that currently the most cost-effective and reliable approach for database annotation might incorporate manual and automatic annotation methods.

UR - http://www.scopus.com/inward/record.url?scp=40549141170&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=40549141170&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9812776087

SN - 9789812776082

SP - 640

EP - 651

BT - Pacific Symposium on Biocomputing 2008, PSB 2008

ER -