Background
Non-alcoholic fatty liver disease (NAFLD) is an umbrella term that describes two subtypes of liver disease: non-alcoholic fatty liver (NAFL) and non-alcoholic steatohepatitis (NASH).1 2 NAFL is characterised by fat accumulation (steatosis) in the liver without significant inflammation. NASH is a more severe form of NAFLD and is characterised by steatosis with inflammation and fibrosis, which can progress to cirrhosis.1 The prevalence of NAFLD in the USA is estimated to be 24%–26% of adults, of whom an estimated 20%–30% have NASH.3
The transition from simple hepatic steatosis to NASH is a crucial point in the development of severe liver disease, putting patients at higher risk for fibrosis and progression to chronic liver disease.4 Nevertheless, NASH is often underdiagnosed in clinical practice.5–7 This may be due to several factors. First, there is a lack of clear patient symptoms and reliable biomarkers to help identify NASH,8 9 and there are no universal routine screening standards.10 Second, liver biopsy is the gold standard for NASH diagnosis but is costly, invasive, complicated by sampling errors and requires a specialist to perform.4 11 Finally, despite ongoing clinical trials, there are currently no approved pharmacological treatments for NASH outside of India.12 Thus, detection of NASH remains a challenge and reliable diagnostic tools, including minimal or even non-invasive techniques, are warranted.
Machine learning (ML) with real-world data may help address the underdiagnosis of common and rare diseases. We recently demonstrated the application of ML in a retrospective case–control cohort study based on a US claims database to identify patients with undiagnosed hepatitis C virus.13 For the detection of NASH, studies have yielded encouraging results using metabolomics,14–17 electronic health records18–20 or combined clinical-claims data.21 The use case of each approach may be influenced by the chosen data type, characteristics of the model training population or the targeted application to patients with documented NAFL. Continuing to build on these efforts will further enable ML approaches to facilitate NASH detection.
This study examined supervised ML using medical claims as a non-invasive strategy to identify patients with likely NASH who might benefit from appropriate clinical follow-up such as monitoring or diagnostic screening. We used a retrospective rolling cross-sectional study design22 by taking multiple snapshots of patient prescription and medical claim histories to emulate patient data during real-world deployment while providing examples of patients with NASH for model training at different points in the patient journey prior to diagnosis. We also evaluated both knowledge-driven (‘hypothesis-driven’) and automated data-driven strategies in developing clinical predictors for NASH detection.