Scaling Clinician-Grade Feature Generation from Clinical Notes with Multi-Agent Language Models

By Jiayi WangJacqueline Jil VallonNikhil V. KothaNeil PanjwaniXi LingMargaret RedfieldSushmita VijSandy SrinivasJohn LeppertMark K. BuyyounouskiMohsen Bayati

December122025| Working Paper No. 4307

Operations, Information & Technology

Download

Developing accurate clinical prediction models is often bottlenecked by the difficulty of generating meaningful predictive features from unstructured data. While electronic health records (EHRs) contain rich narrative information, extracting a comprehensive list of structured features from them requires extensive domain knowledge and granular clinical judgment, a process that is historically manual, unscalable, and impractical for large cohorts. In this study, we first established a rigorous patient-level Clinician Feature Generation (CFG) protocol, in which domain experts manually reviewed notes to define and extract nuanced features for a cohort of 147 patients with prostate cancer. As a high-fidelity ground truth, this labor-intensive process provided the blueprint for SNOW (Scalable Note-to-Outcome Workflow), a transparent multi-agent large language model (LLM) system designed to autonomously mimic the iterative reasoning and validation workflow of clinical experts. In the prostate cancer cohort, SNOW achieved a predictive performance for 5-year recurrence (AUC-ROC 0.767 ± 0.041) that was indistinguishable from the gold-standard manual CFG (0.762 ± 0.026) and superior to structured baselines, clinician-guided LLM extraction, and six representational feature generation (RFG) approaches. Manual CFG required prolonged expert review and per-patient abstraction; in contrast, once configured, SNOW generated the full patient-level feature table in 12 hours with 5 hours of clinician oversight, reducing human expert effort by approximately 48-fold. To assess scalability in a setting where manual CFG is infeasible, we deployed SNOW on an external population of 2,084 patients with heart failure with preserved ejection fraction (HFpEF) from the MIMIC-IV database. Without task-specific tuning, SNOW-generated prognostic features that outperformed baseline and RFG methods for 30-day (SNOW: 0.851±0.008) and 1-year (SNOW: 0.763±0.003) mortality prediction. These results demonstrate that a modular LLM agent-based system can scale expert-level feature generation from clinical notes, while enabling interpretable use of unstructured EHR text in outcome prediction and preserving generalizability across a variety of settings and conditions.