What is Multimodal?

Multimodal AI can process multiple data types — text, image, audio, video — within the same model. Early LLMs handled text only; today’s Claude 4.x, GPT-5, and Gemini are all multimodal. You can drop in an image plus a question, and the model reads both before answering. Video generators like Sora flip the direction, going from text to video.

Real application: we send candlestick chart screenshots plus strategy notes to Claude, and it identifies head-and-shoulders, gaps, and volume-price divergences directly from the image before factoring them into the trading plan. This was impossible with text-only LLMs — you’d first need a separate CV model to convert chart to numbers before feeding it in.