PhenoLIP: Phenotype Knowledge-Enhanced Vision-Language Pretraining

PhenoKG Construction

Image-text pairs: Over 524K high-quality medical images with detailed phenotype descriptions

Phenotype ontology: Structured knowledge from Human Phenotype Ontology (HPO) covering 3,000+ phenotypes

Hierarchical organization: Multi-level anatomical system categorization for systematic knowledge representation

Zero-Shot Phenotype Classification

Method

Encoder

Accuracy

Vision

Text

Dermatology

Pathology

Radiology

Hematology

Histology

Ophthalmology

PhenoBench

Average

General VLMs

OpenCLIP

ViT-B

GPT2

14.77

33.33

29.75

23.00

4.52

19.43

2.26

18.15

SigLIP2

So400m

SigLIP64

11.31

20.00

4.95

8.30

9.00

23.00

0.25

10.97

CoCa

ViT-B

GPT2

12.63

33.00

25.87

6.55

3.39

20.55

1.38

14.77

Biomedical VLMs

PMC-CLIP

ResNet50

PubmedBert

41.35

45.12

29.10

21.50

5.68

43.75

7.50

27.71

BiomedCLIP

ViT-B

PubmedBert

47.59

42.87

28.47

20.40

4.23

40.18

8.15

27.41

BIOMEDICA

ViT-L

GPT2

56.76

57.40

37.72

9.35

5.40

26.15

6.69

28.50

PhenoLIP

ViT-B

PubmedBert

55.08

58.33

49.03

23.24

7.87

51.63

10.76

36.56

Key Finding: PhenoLIP achieves an average accuracy of 36.56%, demonstrating strong generalization across diverse phenotype recognition tasks.

Cross-Modal Retrieval on PhenoBench

Method

Encoder

I2T

T2I

I2P

P2I

Vision

Text

R@10

R@50

R@10

R@50

R@10

R@50

R@10

R@50

General VLMs

OpenCLIP

ViT-B

GPT2

10.72

24.25

10.08

22.95

2.88

8.34

2.79

8.24

SigLIP2

So400m

SigLIP64

14.02

28.16

10.33

23.44

0.31

0.76

0.08

0.62

CoCa

ViT-B

GPT2

9.27

22.16

7.57

18.93

2.02

6.42

1.82

5.59

Biomedical VLMs

PMC-CLIP

ViT-L

PubmedBert

40.00

64.82

36.68

61.83

7.22

23.44

6.64

18.92

BiomedCLIP

ViT-B

PubmedBert

32.91

56.63

32.43

56.08

3.71

13.38

3.77

12.17

BIOMEDICA

ViT-L

GPT2

40.51

66.28

40.03

67.38

8.12

25.27

6.82

19.60

PhenoLIP

ViT-B

PubmedBert

63.30

81.92

66.61

87.68

13.84

36.88

12.77

31.30

Caption: Cross-modal retrieval results on PhenoBench (7,819 images, 1,187 phenotypes). I2T represents image-to-text retrieval and T2I represents text-to-image retrieval. I2P represents image-to-phenotype retrieval and P2I represents phenotype-to-image retrieval.

Rare Facial Phenotype Identification on Face2Gene

Method

Retrieval Metrics (%)

Matching Metrics (%)

R@5

R@10

R@50

Precision

Recall

General VLMs

OpenCLIP

6.97

11.94

39.80

1.63

2.56

1.81

SigLIP2

6.85

11.80

39.50

1.50

2.10

1.75

CoCa

4.48

7.46

38.31

1.04

2.21

1.25

Biomedical VLMs

PMC-CLIP

6.75

11.60

49.50

1.65

3.20

2.17

BiomedCLIP

6.84

11.73

49.84

1.80

3.77

2.22

BIOMEDICA

3.48

7.46

43.28

1.10

2.37

1.36

PhenoLIP

7.49

12.05

55.05

2.08

4.56

2.62

Key Finding: On the challenging Face2Gene dataset with 321 images and 993 candidate phenotypes, PhenoLIP achieves the highest F1-score of 2.62%, demonstrating enhanced ability to capture subtle visual signatures of rare diseases.

Linear Probing Performance

Benchmarking results on Linear Evaluation (Acc) across different data ratios:

Method	RSNA			BreastMNIST			ChestMNIST			DermaMNIST			OCTMNIST			RetinaMNIST			HAM10000
Method	1%	10%	100%	1%	10%	100%	1%	10%	100%	1%	10%	100%	1%	10%	100%	1%	10%	100%	1%	10%	100%
General VLMs
OpenCLIP	77.07	79.52	81.61	32.05	73.72	83.33	51.31	52.88	53.69	69.53	77.51	83.74	67.40	73.20	73.10	43.50	55.00	61.75	52.81	68.65	77.23
SigLIP2	54.55	68.90	77.50	32.50	73.50	83.00	51.05	52.50	53.80	69.00	77.10	83.50	67.00	72.90	72.50	43.80	54.50	61.00	53.24	66.67	72.57
CoCa	77.29	79.59	81.36	36.54	78.21	82.05	51.62	52.85	53.48	68.73	75.26	81.40	43.75	55.75	59.50	52.24	56.92	58.37	55.45	66.67	72.94
Biomedical VLMs
PMC-CLIP	82.49	83.74	83.76	26.92	75.00	82.05	49.50	53.70	55.19	69.23	75.26	80.55	76.20	69.90	71.60	45.00	56.50	60.00	54.13	64.36	75.25
BiomedCLIP	81.99	83.66	83.99	51.92	79.49	84.62	51.31	54.46	55.03	69.28	74.86	79.25	68.20	73.80	74.10	43.50	55.75	59.00	59.74	65.02	72.61
BIOMEDICA	80.09	82.94	83.44	37.82	77.56	85.90	51.99	53.57	54.45	70.87	79.10	85.14	61.40	76.40	76.70	43.50	55.50	63.50	57.76	69.31	73.60
PhenoLIP	83.06	83.36	84.11	49.36	74.36	86.54	52.65	54.71	55.15	69.93	75.86	83.74	73.90	78.30	77.10	43.50	57.75	61.75	62.71	69.31	77.89

Caption: Benchmarking results on Linear Evaluation (Acc) across different data ratios. The best-performing model for each setting is in bold, and the second-best is underlined.

Ablation Studies

Impact of different components on model performance:

Encoder

Knowledge Distillation

Data Curation

I2T

T2I

Classification Acc

Vision

Text

R@10

R@50

R@10

R@50

Scratch

28.15

45.33

31.04

49.81

2.13

CLIP

47.20

68.91

51.72

73.05

6.44

BiomedCLIP

50.11

71.22

53.68

75.90

7.02

CLIP

PMB

49.53

70.88

54.88

78.14

6.95

BiomedCLIP

PMB

53.42

74.10

58.03

81.25

7.81

CLIP

PMB

✓

58.91

78.54

62.19

84.33

9.57

CLIP

PMB

✓

60.13

79.62

63.55

85.18

9.98

BiomedCLIP

PMB

✓

63.30

81.92

66.61

87.68

10.76

Key Findings: Domain-specific pre-trained encoders provide strong baselines. Knowledge distillation significantly improves performance by injecting structured ontological knowledge. Data curation (subfigure detection and LLM-based caption refinement) consistently enhances model performance. The full model combining all components achieves the best results. (KB: Knowledge-enhanced BERT; PMB: PubmedBERT)

Cheng Liang^1,2	Chaoyi Wu¹	Weike Zhao^1,2
Ya Zhang^1,2	Yanfeng Wang^1,2	Weidi Xie^1,2

Abstract

Method

PhenoKG Construction

PhenoLIP Framework

PhenoBench Benchmark

Results

Zero-Shot Phenotype Classification

Cross-Modal Retrieval on PhenoBench

Rare Facial Phenotype Identification on Face2Gene

Linear Probing Performance

Ablation Studies

Paper and Citation

Acknowledgements