PT - JOURNAL ARTICLE AU - Yvette Pyne AU - Yik Ming Wong AU - Haishuo Fang AU - Edwin Simpson TI - Analysis of ‘One in a Million’ primary care consultation conversations using natural language processing AID - 10.1136/bmjhci-2022-100659 DP - 2023 Apr 01 TA - BMJ Health & Care Informatics PG - e100659 VI - 30 IP - 1 4099 - http://informatics.bmj.com/content/30/1/e100659.short 4100 - http://informatics.bmj.com/content/30/1/e100659.full SO - BMJ Health Care Inform2023 Apr 01; 30 AB - Background Modern patient electronic health records form a core part of primary care; they contain both clinical codes and free text entered by the clinician. Natural language processing (NLP) could be employed to generate these records through ‘listening’ to a consultation conversation.Objectives This study develops and assesses several text classifiers for identifying clinical codes for primary care consultations based on the doctor–patient conversation. We evaluate the possibility of training classifiers using medical code descriptions, and the benefits of processing transcribed speech from patients as well as doctors. The study also highlights steps for improving future classifiers.Methods Using verbatim transcripts of 239 primary care consultation conversations (the ‘One in a Million’ dataset) and novel additional datasets for distant supervision, we trained NLP classifiers (naïve Bayes, support vector machine, nearest centroid, a conventional BERT classifier and few-shot BERT approaches) to identify the International Classification of Primary Care-2 clinical codes associated with each consultation.Results Of all models tested, a fine-tuned BERT classifier was the best performer. Distant supervision improved the model’s performance (F1 score over 16 classes) from 0.45 with conventional supervision with 191 labelled transcripts to 0.51. Incorporating patients’ speech in addition to clinician’s speech increased the BERT classifier’s performance from 0.45 to 0.55 F1 (p=0.01, paired bootstrap test).Conclusions Our findings demonstrate that NLP classifiers can be trained to identify clinical area(s) being discussed in a primary care consultation from audio transcriptions; this could represent an important step towards a smart digital assistant in the consultation room.The ‘One in a Million’ dataset is available for research use following valid ethics approval.ICPC-2 Codes and descriptions are freely downloadable from the web.The NICE CKS Health Topics dataset is freely downloadable from their website using freely available ‘web-scraping’ tools.