Multimodal

Multimodal AI can process multiple data types — text, image, audio, video — in the same model. Claude 4.x, GPT-5, and Gemini are all multimodal: you can drop in an image plus a question and the model reads both before answering. Real application: we send candlestick chart screenshots plus strategy notes to Claude and it identifies patterns directly from the image — Judy AI Lab AI Glossary

core beginner

What is Multimodal?

Multimodal AI can process multiple data types — text, image, audio, video — within the same model. Early LLMs handled text only; today’s Claude 4.x, GPT-5, and Gemini are all multimodal. You can drop in an image plus a question, and the model reads both before answering. Video generators like Sora flip the direction, going from text to video.

Real application: we send candlestick chart screenshots plus strategy notes to Claude, and it identifies head-and-shoulders, gaps, and volume-price divergences directly from the image before factoring them into the trading plan. This was impossible with text-only LLMs — you’d first need a separate CV model to convert chart to numbers before feeding it in.

What is Multimodal?#

Related Terms

Get our weekly AI digest:

What is Multimodal?