Microsoft recently introduced Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive multimodal input, follow instructions, and perform in-context learning for multimodal tasks.
Microsoft trained Kosmos-1 on web-scale multimodal corpora that included interleaved text and images, image-caption pairs, and text data. As per Microsoft’s evaluation, Kosmos-1 has achieved impressive performance on language understanding, generation, and multimodal dialogue, image captioning, visual question answering, and vision tasks.
Microsoft evaluated KOSMOS-1 model on following tasks:
- Language tasks
- Language understanding
- Language generation
- OCR-free text classification
- Cross-modal transfer
- Commonsense reasoning
- Nonverbal reasoning
- IQ Test (Raven’s Progressive Matrices)
- Perception-language tasks
- Image captioning
- Visual question answering
- Web page question answering
- Vision tasks
- Zero-shot image classification
- Zero-shot image classification with descriptions
You can find the evaluation results details from the source link below. Microsoft also revealed that it has plans to scale up KOSMOS-1 in terms of model size , and integrate the speech capability into KOSMOS-1.