PhenoLIP: Integrating Phenotype Ontology Knowledge
into Medical Vision–Language Pretraining
Cheng Liang1,2
Chaoyi Wu1
Weike Zhao1,2
Ya Zhang1,2
Yanfeng Wang1,2
Weidi Xie1,2
1School of Artificial Intelligence, Shanghai Jiao Tong University
2Shanghai Artificial Intelligence Laboratory
[Paper]
[GitHub]
[Dataset]
[Demo]
Overview of PhenoKG and PhenoLIP: (a) PhenoKG organizes diverse anatomical systems hierarchically; (b) Unified multimodal integration aligning phenotype images, descriptions, and ontology knowledge; (c) PhenoLIP framework with knowledge-enhanced vision–language pretraining via distillation.

Abstract

Recent progress in large-scale CLIP-like vision-language models (VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image-caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
524K+
Image-Text Pairs
3,000+
Phenotypes
7,800+
Benchmark Pairs
36.56%
Zero-Shot Accuracy


Method

PhenoLIP pipeline: Two-stage training process incorporating structured phenotype knowledge into vision-language pretraining through knowledge distillation.

PhenoKG Construction

We construct PhenoKG, a large-scale phenotype-centric multimodal knowledge graph, by integrating:
  • Image-text pairs: Over 524K high-quality medical images with detailed phenotype descriptions
  • Phenotype ontology: Structured knowledge from Human Phenotype Ontology (HPO) covering 3,000+ phenotypes
  • Hierarchical organization: Multi-level anatomical system categorization for systematic knowledge representation

PhenoLIP Framework

Our two-stage pretraining approach:
  1. Stage 1 - Knowledge Embedding: Learn a structured phenotype embedding space from textual ontology data using language models
  2. Stage 2 - Knowledge Distillation: Distill the learned structured knowledge into vision-language pretraining via teacher-guided objectives

PhenoBench Benchmark

We introduce an expert-verified benchmark with:
  • 7,800+ carefully curated image-caption pairs
  • 1,000+ phenotypes for comprehensive evaluation
  • Tasks: zero-shot classification and cross-modal retrieval

Results

Zero-Shot Phenotype Classification

Method Encoder Accuracy
Vision Text Dermatology Pathology Radiology Hematology Histology Ophthalmology PhenoBench Average
General VLMs
OpenCLIP ViT-B GPT2 14.77 33.33 29.75 23.00 4.52 19.43 2.26 18.15
SigLIP2 So400m SigLIP64 11.31 20.00 4.95 8.30 9.00 23.00 0.25 10.97
CoCa ViT-B GPT2 12.63 33.00 25.87 6.55 3.39 20.55 1.38 14.77
Biomedical VLMs
PMC-CLIP ResNet50 PubmedBert 41.35 45.12 29.10 21.50 5.68 43.75 7.50 27.71
BiomedCLIP ViT-B PubmedBert 47.59 42.87 28.47 20.40 4.23 40.18 8.15 27.41
BIOMEDICA ViT-L GPT2 56.76 57.40 37.72 9.35 5.40 26.15 6.69 28.50
PhenoLIP ViT-B PubmedBert 55.08 58.33 49.03 23.24 7.87 51.63 10.76 36.56

Key Finding: PhenoLIP achieves an average accuracy of 36.56%, demonstrating strong generalization across diverse phenotype recognition tasks.

Cross-Modal Retrieval on PhenoBench

Method Encoder I2T T2I I2P P2I
Vision Text R@10 R@50 R@10 R@50 R@10 R@50 R@10 R@50
General VLMs
OpenCLIP ViT-B GPT2 10.72 24.25 10.08 22.95 2.88 8.34 2.79 8.24
SigLIP2 So400m SigLIP64 14.02 28.16 10.33 23.44 0.31 0.76 0.08 0.62
CoCa ViT-B GPT2 9.27 22.16 7.57 18.93 2.02 6.42 1.82 5.59
Biomedical VLMs
PMC-CLIP ViT-L PubmedBert 40.00 64.82 36.68 61.83 7.22 23.44 6.64 18.92
BiomedCLIP ViT-B PubmedBert 32.91 56.63 32.43 56.08 3.71 13.38 3.77 12.17
BIOMEDICA ViT-L GPT2 40.51 66.28 40.03 67.38 8.12 25.27 6.82 19.60
PhenoLIP ViT-B PubmedBert 63.30 81.92 66.61 87.68 13.84 36.88 12.77 31.30

Caption: Cross-modal retrieval results on PhenoBench (7,819 images, 1,187 phenotypes). I2T represents image-to-text retrieval and T2I represents text-to-image retrieval. I2P represents image-to-phenotype retrieval and P2I represents phenotype-to-image retrieval.

Rare Facial Phenotype Identification on Face2Gene

Method Retrieval Metrics (%) Matching Metrics (%)
R@5 R@10 R@50 Precision Recall F1
General VLMs
OpenCLIP 6.97 11.94 39.80 1.63 2.56 1.81
SigLIP2 6.85 11.80 39.50 1.50 2.10 1.75
CoCa 4.48 7.46 38.31 1.04 2.21 1.25
Biomedical VLMs
PMC-CLIP 6.75 11.60 49.50 1.65 3.20 2.17
BiomedCLIP 6.84 11.73 49.84 1.80 3.77 2.22
BIOMEDICA 3.48 7.46 43.28 1.10 2.37 1.36
PhenoLIP 7.49 12.05 55.05 2.08 4.56 2.62

Key Finding: On the challenging Face2Gene dataset with 321 images and 993 candidate phenotypes, PhenoLIP achieves the highest F1-score of 2.62%, demonstrating enhanced ability to capture subtle visual signatures of rare diseases.

Linear Probing Performance

Benchmarking results on Linear Evaluation (Acc) across different data ratios:

Method RSNA BreastMNIST ChestMNIST DermaMNIST OCTMNIST RetinaMNIST HAM10000
1%10%100% 1%10%100% 1%10%100% 1%10%100% 1%10%100% 1%10%100% 1%10%100%
General VLMs
OpenCLIP 77.0779.5281.61 32.0573.7283.33 51.3152.8853.69 69.5377.5183.74 67.4073.2073.10 43.5055.0061.75 52.8168.6577.23
SigLIP2 54.5568.9077.50 32.5073.5083.00 51.0552.5053.80 69.0077.1083.50 67.0072.9072.50 43.8054.5061.00 53.2466.6772.57
CoCa 77.2979.5981.36 36.5478.2182.05 51.6252.8553.48 68.7375.2681.40 43.7555.7559.50 52.2456.9258.37 55.4566.6772.94
Biomedical VLMs
PMC-CLIP 82.4983.7483.76 26.9275.0082.05 49.5053.7055.19 69.2375.2680.55 76.2069.9071.60 45.0056.5060.00 54.1364.3675.25
BiomedCLIP 81.9983.6683.99 51.9279.4984.62 51.3154.4655.03 69.2874.8679.25 68.2073.8074.10 43.5055.7559.00 59.7465.0272.61
BIOMEDICA 80.0982.9483.44 37.8277.5685.90 51.9953.5754.45 70.8779.1085.14 61.4076.4076.70 43.5055.5063.50 57.7669.3173.60
PhenoLIP 83.0683.3684.11 49.3674.3686.54 52.6554.7155.15 69.9375.8683.74 73.9078.3077.10 43.5057.7561.75 62.7169.3177.89

Caption: Benchmarking results on Linear Evaluation (Acc) across different data ratios. The best-performing model for each setting is in bold, and the second-best is underlined.

Ablation Studies

Impact of different components on model performance:

Encoder Knowledge Distillation Data Curation I2T T2I Classification Acc
Vision Text R@10 R@50 R@10 R@50
Scratch KB - - 28.15 45.33 31.04 49.81 2.13
CLIP KB - - 47.20 68.91 51.72 73.05 6.44
BiomedCLIP KB - - 50.11 71.22 53.68 75.90 7.02
CLIP PMB - - 49.53 70.88 54.88 78.14 6.95
BiomedCLIP PMB - - 53.42 74.10 58.03 81.25 7.81
CLIP PMB - 58.91 78.54 62.19 84.33 9.57
CLIP PMB 60.13 79.62 63.55 85.18 9.98
BiomedCLIP PMB 63.30 81.92 66.61 87.68 10.76

Key Findings: Domain-specific pre-trained encoders provide strong baselines. Knowledge distillation significantly improves performance by injecting structured ontological knowledge. Data curation (subfigure detection and LLM-based caption refinement) consistently enhances model performance. The full model combining all components achieves the best results. (KB: Knowledge-enhanced BERT; PMB: PubmedBERT)


Paper and Citation

C. Liang, C. Wu, W. Zhao, Y. Zhang, Y. Wang, W. Xie
PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision–Language Pretraining
In Submission, 2026.

[Paper]   [Code]   [BibTeX]

Acknowledgements

This webpage template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project. The code can be found here.