Models
Modalities
Text, multimodal, image, video, audio, embedding and rerank — what each modality filter means.
ToRouter tags every catalog model with a modality. The filter on the /models page narrows the list to one kind of capability at a time.
The eight modalities
Prop
Type
Modality vs endpoint
Modality describes what a model does. The endpoint you call describes which protocol you use:
- Text + Multimodal →
POST /v1/chat/completions,POST /v1/responses,POST /v1/messages,POST /v1beta/models/<id>:generateContent - Image →
POST /v1/images/generations,POST /v1/images/edits - Embedding →
POST /v1/embeddings - Audio →
POST /v1/audio/transcriptions,POST /v1/audio/speech - Rerank → vendor-specific, usually
POST /v1/rerank
The filter is a UI convenience — actual capability is defined by the underlying upstream model. Always confirm on the model detail page that the endpoint you need is supported.
Multimodal in practice
Multimodal models accept image content blocks alongside text in standard OpenAI / Anthropic / Gemini formats:
client.chat.completions.create(
model="gpt-5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
],
}],
)