Uniqueness of medical data mining

https://doi.org/10.1016/S0933-3657(02)00049-0Get rights and content

Abstract

This article addresses the special features of data mining with medical data. Researchers in other fields may not be aware of the particular constraints and difficulties of the privacy-sensitive, heterogeneous, but voluminous data of medicine. Ethical and legal aspects of medical data mining are discussed, including data ownership, fear of lawsuits, expected benefits, and special administrative issues. The mathematical understanding of estimation and hypothesis formation in medical data may be fundamentally different than those from other data collection activities. Medicine is primarily directed at patient-care activity, and only secondarily as a research resource; almost the only justification for collecting medical data is to benefit the individual patient. Finally, medical data have a special status based upon their applicability to all people; their urgency (including life-or-death); and a moral obligation to be used for beneficial purposes.

Introduction

This article emphasizes the uniqueness of medical data mining. This is a position paper, in which the authors’ intent, based on their medical and data mining experience, is to alert the data mining community to the unique features of medical data mining. The reason for writing the paper is that researchers who perform data mining in other fields may not be aware of the constraints and difficulties of mining the privacy-sensitive, heterogeneous data of medicine. We discuss ethical, security and legal aspects of medical data mining. In addition, we pose several questions that must be answered by the community, so that both the patients on whom the data are collected, as well as the data miners, can benefit [15].

Human medical data are at once the most rewarding and difficult of all biological data to mine and analyze. Humans are the most closely watched species on earth. Human subjects can provide observations that cannot easily be gained from animal studies, such as visual and auditory sensations, the perception of pain, discomfort, hallucinations, and recollection of possibly relevant prior traumas and exposures. Most animal studies are short-term, and therefore cannot track long-term disease processes of medical interest, such as preneoplasia or atherosclerosis. With human data, there is no issue of having to extrapolate animal observations to the human species.

Some three-quarter billions of persons living in North America, Europe, and Asia have at least some of their medical information collected in electronic form, at least transiently. These subjects generate volumes of data that an animal experimentalist can only dream of. On the other hand, there are ethical, legal, and social constraints on data collection and distribution, that do not apply to non-human species, and that limit the scientific conclusions that may be drawn.

The major points of uniqueness of medical data may be organized under four general headings:

  • Heterogeneity of medical data

  • Ethical, legal, and social issues

  • Statistical philosophy

  • Special status of medicine

Section snippets

Heterogeneity of medical data

Raw medical data are voluminous and heterogeneous. Medical data may be collected from various images, interviews with the patient, laboratory data, and the physician’s observations and interpretations. All these components may bear upon the diagnosis, prognosis, and treatment of the patient, and cannot be ignored. The major areas of heterogeneity of medical data may be organized under these headings:

  • Volume and complexity of medical data

  • Physician’s interpretation

  • Sensitivity and specificity

Ethical, legal, and social issues

Because medical data are collected on human subjects, there is an enormous ethical and legal tradition designed to prevent the abuse of patients and misuse of their data. The major points of the ethical, legal, and social issues in medicine may be organized under five headings:

  • Data ownership

  • Fear of lawsuits

  • Privacy and security of human data

  • Expected benefits

  • Administrative issues

Statistical philosophy

There is an emerging doctrine that data mining methods themselves, especially statistics, and the basic assumptions underlying these methods, may be fundamentally different for medical data. Human medicine is primarily a patient-care activity, and serves only secondarily as a research resource. Generally, the only justification for collecting data in medicine, or refusal to collect certain data, is to benefit the individual patient. Some patients might consent to be involved in research

Special status of medicine

Finally, medicine has a special status in science, philosophy, and daily life. The outcomes of medical care are life-or-death, and they apply to everybody. Medicine is a necessity, not merely an optional luxury, pleasure, or convenience.

Among all the professions, medicine has the longest apprenticeship. Most medical specialists in the USA require at least 11 years of training after high school graduation, and some surgical subspecialties require up to 16. In the USA, medical care costs consume

Summary

In summary, data mining in medicine is distinct from that in other fields, because the data are heterogeneous; special ethical, legal, and social constraints apply to private medical information; statistical methods must address these heterogeneity and social issues; and because medicine itself has a special status in life.

Data from medical sources are voluminous, but they come from many different sources, not all commensurate structure or quality. The physician’s interpretations are an

References (57)

  • L.A. Kurgan et al.

    Knowledge discovery approach to automated cardiac SPECT diagnosis

    Artif Intell Med

    (2001)
  • G.W. Moore et al.

    Token swap test of significance for serial medical databases

    Am J. Med.

    (1986)
  • Z. Pawlak

    Rough classification

    Int. J. Man-Mach. Stud.

    (1984)
  • E. Apps

    New mining industry standards: moving from monks to the mainstream

    PC AI

    (2000)
  • Banerjee S, Krishnamurthy V, Krishnaprasad M, Murthy R. Oracle8I—the XML-enabled data management system. In:...
  • Bauer, CJ. Data mining digs. Special advertising recruitment supplement to the Washington Post. Washington Post,...
  • Berman JJ, Moore GW, Hutchins GM. Maintaining patient confidentiality in the public domain Internet autopsy database...
  • Berman JJ. Tissue microarray data exchange standards: frequently asked questions, 2002...
  • Bray T, Paoli J, Maler E. eXtensible Markup Language (XML) 1.0. 2nd ed. W3C recommendation, October 2000...
  • Brewka G, Dix J, Konolige K. Nonmonotonic reasoning: an overview. CSLI Lecture Notes No. 73, ISBN 1-881526-83-6, 1997....
  • Büchner AG, Baumgarten M, Mulvenna MD, Böhm R, Anand SS. Data mining and XML: current and future issues. In:...
  • Cheng J, Xu J. IBM DB2 extender. In: Proceedings of the 16th International Conference on Data Engineering, San Diego...
  • Ceusters W. Medical natural language understanding as a supporting technology for data mining in healthcare. In: Cios...
  • Changeux J-P, Connes A. Conversations on mind, matter, and mathematics [DeBevoise MB, Trans.]. Princeton (NJ):...
  • Cios KJ, Pedrycz W, Swiniarski R. Data mining methods for knowledge discovery. Boston: Kluwer Academic Publishers,...
  • K.J. Cios et al.

    Diagnosing myocardial perfusion SPECT bull’s-eye maps—a knowledge discovery approach

    IEEE Eng Med Biol

    (2000)
  • Cios KJ, Moore GW. Medical data mining and knowledge discovery: an overview. In: Cios KJ, editor. Medical data mining...
  • Cios KJ, editor. Medical data mining and knowledge discovery. Heidelberg: Springer, 2001...
  • Cios KJ, Kurgan LA. Trends in data mining and knowledge discovery. In: Pal NR, Jain LC, Teodoresku N, editors....
  • CRISP-DM, 1998...
  • Fayyad UM, Piatesky-Shapiro G, Smyth P, Uthurusamy R. Advances in knowledge discovery and data mining. Boston: AAAI...
  • Fayyad UM, Piatetsky-Shapiro G, Smyth P. Knowledge discovery and data mining: towards a unifying framework. In:...
  • C. Friedman et al.

    Evaluating natural language processors in the clinical domain

    Meth. Inform. Med.

    (1998)
  • M. Goebel et al.

    A survey of data mining software tools

    SIGKDD Explor.

    (1999)
  • M.G. Goldner et al.

    Effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes. 3. Clinical implications of UGDP results

    JAMA

    (1971)
  • Informix object translator, 2001...
  • H.C. Lai et al.

    Risk of persistent growth impairment after alternate-day prednisone treatment in children with cystic fibrosis

    N Engl J Med

    (2000)
  • Manning CD, Schuetze H. Foundations of statistical natural language processing. Cambridge (MA): MIT Press,...
  • Cited by (481)

    • Improving quality of wearable biosensor data through artificial intelligence

      2024, Biosensors in Precision Medicine: From Fundamentals to Future Trends
    • Performance enhancement of IoMT using artificial intelligence algorithms

      2023, Security and Privacy Issues in Internet of Medical Things
    View all citing articles on Scopus
    View full text