PURPOSE: Administrative health datasets are widely used in public health research but often lack information about common confounders. We aimed to develop and validate machine learning (ML)-based models using medication data from Australia's Pharmaceutical Benefits Scheme (PBS) database to predict obesity and smoking. METHODS: We used data from the D-Health Trial (N=18,000) and the QSkin Study (N=43,794). Smoking history, and height and weight were self-reported at study entry. Linkage to the PBS dataset captured 5?years of medication data after cohort entry. We used age, sex, and medication use, classified using Anatomical Therapeutic Classification codes, as potential predictors of smoking (current or quit <10?years ago; never or quit =10?years ago) and obesity (obese; non-obese). We trained gradient-boosted machine learning models using data for the first 80% of participants enrolled; models were validated using the remaining 20%. We assessed model performance overall and by sex and age, and compared models generated using 3 and 5?years of PBS data. RESULTS: Based on the validation dataset using 3?years of PBS data, the area under the receiver operating characteristic curve (AUC) was 0.70 (95% confidence interval (CI) 0.68 - 0.71) for predicting obesity and 0.71 (95% CI 0.70 - 0.72) for predicting smoking. Models performed better in women than in men. Using 5?years of PBS data resulted in marginal improvement. CONCLUSIONS: Medication data in combination with age and sex can be used to predict obesity and smoking. These models may be of value to researchers using data collected for administrative purposes.
Authors | Ali, Sitwat; Na, Renhua; Waterhouse, Mary; Jordan, Susan J; Olsen, Catherine M; Whiteman, David C; Neale, Rachel E |
---|---|
Journal | Pharmacoepidemiology and drug safety |
Pages | 91-99 |
Volume | 50 |
Date | 1/01/2021 |
Grant ID | |
Funding Body | |
URL | http://www.ncbi.nlm.nih.gov/pubmed/?term=10.1002/pds.5367 |