|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Recent progress in large-scale CLIP-like vision-language models (VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image-caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding. |
|
524K+ Image-Text Pairs |
3,000+ Phenotypes |
7,800+ Benchmark Pairs |
36.56% Zero-Shot Accuracy |
|
|
|
PhenoKG ConstructionWe construct PhenoKG, a large-scale phenotype-centric multimodal knowledge graph, by integrating:
PhenoLIP FrameworkOur two-stage pretraining approach:
PhenoBench BenchmarkWe introduce an expert-verified benchmark with:
|
Zero-Shot Phenotype Classification
Key Finding: PhenoLIP achieves an average accuracy of 36.56%, demonstrating strong generalization across diverse phenotype recognition tasks. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cross-Modal Retrieval on PhenoBench
Caption: Cross-modal retrieval results on PhenoBench (7,819 images, 1,187 phenotypes). I2T represents image-to-text retrieval and T2I represents text-to-image retrieval. I2P represents image-to-phenotype retrieval and P2I represents phenotype-to-image retrieval. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Rare Facial Phenotype Identification on Face2Gene
Key Finding: On the challenging Face2Gene dataset with 321 images and 993 candidate phenotypes, PhenoLIP achieves the highest F1-score of 2.62%, demonstrating enhanced ability to capture subtle visual signatures of rare diseases. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Linear Probing PerformanceBenchmarking results on Linear Evaluation (Acc) across different data ratios:
Caption: Benchmarking results on Linear Evaluation (Acc) across different data ratios. The best-performing model for each setting is in bold, and the second-best is underlined. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ablation StudiesImpact of different components on model performance:
Key Findings: Domain-specific pre-trained encoders provide strong baselines. Knowledge distillation significantly improves performance by injecting structured ontological knowledge. Data curation (subfigure detection and LLM-based caption refinement) consistently enhances model performance. The full model combining all components achieves the best results. (KB: Knowledge-enhanced BERT; PMB: PubmedBERT) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
C. Liang, C. Wu, W. Zhao, Y. Zhang, Y. Wang, W. Xie PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision–Language Pretraining In Submission, 2026. [Paper] [Code] [BibTeX] |
Acknowledgements |