Ifeoma_Gbenga
Global Health Researcher · Lagos
Feb 2026
Something the major papers systematically underreport — essentially all the high-performing chest X-ray models were trained on data from NIH, CheXpert (Stanford), or MIMIC-CXR (Beth Israel Boston), which are almost entirely adult patients from US academic hospitals, predominantly on posteroanterior projections taken on high-quality digital DR systems. When you deploy these in a resource-limited setting with portable X-ray units, paediatric populations, higher TB and HIV-associated pneumonia prevalence, and more anteroposterior projections, performance degrades substantially. The NIAID-funded TBnet and WHO-supported qXR (from Qure.ai) models are specifically trained with diverse global datasets and perform significantly better in high-TB-burden settings. If you're working in global health, this is not a minor caveat — it's the difference between a useful tool and a harmful one. Qure.ai's technical white papers at qure.ai/resources are actually transparent about their demographic coverage, which is rare and worth appreciating.
Robin_Walsh
ML Engineer · Carestream Health
Jan 2026
Adding technical context to Priya's excellent overview — the shift from classification-only models to vision-language models (VLMs) in chest X-ray AI represents a genuine paradigm change but it introduces new failure modes that pure classification didn't have. VLMs can produce fluent, confident-sounding text that is clinically wrong — a DenseNet-121 either fires the "effusion" label or it doesn't, but a VLM can write "small left pleural effusion is present" with high textual confidence when the finding is actually absent. Calibration of VLMs for medical imaging is an active research area and the best current practice is to evaluate not just AUC on standard benchmarks but expected calibration error (ECE) on your deployment-specific patient population, since distribution shift between the training cohort and your clinical site can destroy calibration even when overall AUC looks fine. The ReXVal dataset and RadGraph-F1 metric are the current community standards for evaluating radiology report quality beyond simple pathology detection — worth knowing these if you're benchmarking models internally.
Priya_Thomas ✓ Chest Radiologist
Radiologist · Apollo Hospitals
Jan 2026
CheXNet (Stanford's 2017 DenseNet-121 model) is effectively a historical baseline at this point — still cited in papers but outperformed significantly by everything released in the last two years. The models that are genuinely production-ready for chest X-ray triage in 2026 are BioViL-T (Microsoft Research) and CheXagent (from UC Berkeley and Stanford), both of which are vision-language foundation models that can handle free-text clinical findings rather than just fixed label classification. What this means practically is that you can query the model with a clinical question like "is there evidence of pulmonary oedema?" and get a grounded answer with attention heatmaps highlighting the relevant regions — which is far more useful for radiologist workflow integration than a fixed 14-label output. Google's Med-Gemini, the medically fine-tuned version of Gemini, also does chest X-ray interpretation and was shown in the 2025 paper to exceed radiologist-level performance on certain CheXpert tasks, though the "exceeds radiologist" framing always needs careful reading of the specific task setup. For open-source triage work, CheXagent's weights are publicly available and it runs on a single A100.