Uniqueness of medical data mining
Introduction
This article emphasizes the uniqueness of medical data mining. This is a position paper, in which the authors’ intent, based on their medical and data mining experience, is to alert the data mining community to the unique features of medical data mining. The reason for writing the paper is that researchers who perform data mining in other fields may not be aware of the constraints and difficulties of mining the privacy-sensitive, heterogeneous data of medicine. We discuss ethical, security and legal aspects of medical data mining. In addition, we pose several questions that must be answered by the community, so that both the patients on whom the data are collected, as well as the data miners, can benefit [15].
Human medical data are at once the most rewarding and difficult of all biological data to mine and analyze. Humans are the most closely watched species on earth. Human subjects can provide observations that cannot easily be gained from animal studies, such as visual and auditory sensations, the perception of pain, discomfort, hallucinations, and recollection of possibly relevant prior traumas and exposures. Most animal studies are short-term, and therefore cannot track long-term disease processes of medical interest, such as preneoplasia or atherosclerosis. With human data, there is no issue of having to extrapolate animal observations to the human species.
Some three-quarter billions of persons living in North America, Europe, and Asia have at least some of their medical information collected in electronic form, at least transiently. These subjects generate volumes of data that an animal experimentalist can only dream of. On the other hand, there are ethical, legal, and social constraints on data collection and distribution, that do not apply to non-human species, and that limit the scientific conclusions that may be drawn.
The major points of uniqueness of medical data may be organized under four general headings:
- •
Heterogeneity of medical data
- •
Ethical, legal, and social issues
- •
Statistical philosophy
- •
Special status of medicine
Section snippets
Heterogeneity of medical data
Raw medical data are voluminous and heterogeneous. Medical data may be collected from various images, interviews with the patient, laboratory data, and the physician’s observations and interpretations. All these components may bear upon the diagnosis, prognosis, and treatment of the patient, and cannot be ignored. The major areas of heterogeneity of medical data may be organized under these headings:
- •
Volume and complexity of medical data
- •
Physician’s interpretation
- •
Sensitivity and specificity
Ethical, legal, and social issues
Because medical data are collected on human subjects, there is an enormous ethical and legal tradition designed to prevent the abuse of patients and misuse of their data. The major points of the ethical, legal, and social issues in medicine may be organized under five headings:
- •
Data ownership
- •
Fear of lawsuits
- •
Privacy and security of human data
- •
Expected benefits
- •
Administrative issues
Statistical philosophy
There is an emerging doctrine that data mining methods themselves, especially statistics, and the basic assumptions underlying these methods, may be fundamentally different for medical data. Human medicine is primarily a patient-care activity, and serves only secondarily as a research resource. Generally, the only justification for collecting data in medicine, or refusal to collect certain data, is to benefit the individual patient. Some patients might consent to be involved in research
Special status of medicine
Finally, medicine has a special status in science, philosophy, and daily life. The outcomes of medical care are life-or-death, and they apply to everybody. Medicine is a necessity, not merely an optional luxury, pleasure, or convenience.
Among all the professions, medicine has the longest apprenticeship. Most medical specialists in the USA require at least 11 years of training after high school graduation, and some surgical subspecialties require up to 16. In the USA, medical care costs consume
Summary
In summary, data mining in medicine is distinct from that in other fields, because the data are heterogeneous; special ethical, legal, and social constraints apply to private medical information; statistical methods must address these heterogeneity and social issues; and because medicine itself has a special status in life.
Data from medical sources are voluminous, but they come from many different sources, not all commensurate structure or quality. The physician’s interpretations are an
References (57)
- et al.
Knowledge discovery approach to automated cardiac SPECT diagnosis
Artif Intell Med
(2001) - et al.
Token swap test of significance for serial medical databases
Am J. Med.
(1986) Rough classification
Int. J. Man-Mach. Stud.
(1984)New mining industry standards: moving from monks to the mainstream
PC AI
(2000)- Banerjee S, Krishnamurthy V, Krishnaprasad M, Murthy R. Oracle8I—the XML-enabled data management system. In:...
- Bauer, CJ. Data mining digs. Special advertising recruitment supplement to the Washington Post. Washington Post,...
- Berman JJ, Moore GW, Hutchins GM. Maintaining patient confidentiality in the public domain Internet autopsy database...
- Berman JJ. Tissue microarray data exchange standards: frequently asked questions, 2002...
- Bray T, Paoli J, Maler E. eXtensible Markup Language (XML) 1.0. 2nd ed. W3C recommendation, October 2000...
- Brewka G, Dix J, Konolige K. Nonmonotonic reasoning: an overview. CSLI Lecture Notes No. 73, ISBN 1-881526-83-6, 1997....
Diagnosing myocardial perfusion SPECT bull’s-eye maps—a knowledge discovery approach
IEEE Eng Med Biol
Evaluating natural language processors in the clinical domain
Meth. Inform. Med.
A survey of data mining software tools
SIGKDD Explor.
Effects of hypoglycemic agents on vascular complications in patients with adult-onset diabetes. 3. Clinical implications of UGDP results
JAMA
Risk of persistent growth impairment after alternate-day prednisone treatment in children with cystic fibrosis
N Engl J Med
Cited by (481)
Cluster-based oversampling with area extraction from representative points for class imbalance learning
2024, Intelligent Systems with ApplicationsImproving quality of wearable biosensor data through artificial intelligence
2024, Biosensors in Precision Medicine: From Fundamentals to Future TrendsMachine Learning Model to Predict Graft Rejection After Kidney Transplantation
2023, Transplantation ProceedingsMachine Learning and Artificial Intelligence in Surgical Research
2023, Surgical Clinics of North AmericaPerformance enhancement of IoMT using artificial intelligence algorithms
2023, Security and Privacy Issues in Internet of Medical Things