While the world is still buzzing about ChatGPT and text-based artificial intelligence (AI), recent tech conferences have unveiled the next paradigm shift: multimodal AI.
Simply put, it is AI that can process images, video, and/or audio in addition to text. You may have seen demos from Google or OpenAI (https://www.youtube.com/watch?v=HU_4vMu9xFI) where AI “looks” around a room, explaining what it sees and understanding visual input.
While full-fledged multimodal AI features are still emerging, we can experiment with their basic forms now. Learning leaders should start developing a “multimodal mindset” today to prepare for this significant shift in AI capabilities.
HYPER-PERSONALIZATION POTENTIAL
A key question to ask is: What visual input could I give AI to help it answer my questions or requests more effectively?
Once you have some ideas, you can test them out using the ChatGPT app. Images are easy to find and upload quickly, making it perfect for experimentation. This approach applies to work and personal life, as all experiences help build the mindset.
For instance, while playing Pokémon Go with my kids, I found myself lost among new features and puzzled by grayed-out options. Instead of Googling or scouring forums, I discovered multimodal AI was a game-changer.
Using the ChatGPT app with the GPT-4o model, I uploaded screenshots of my gameplay issues. The responses were incredible. The AI analyzed my inventory, nearby Pokémon, and location, then offered tailored advice on strategies for gameplay and battles. It was truly as if a seasoned player were looking over my shoulder coaching me.
This highly contextual guidance, based on visual input, exemplifies the power of multimodal AI—its hyper-personalization potential.
APPLICATIONS FOR LEARNING
The applications for learning, of course, are vast (many of which we are already working on with clients):
- Employees recording presentations for AI feedback on speaking skills, content clarity, and engagement
- Sales trainees uploading pitch videos for AI analysis of body language, tone, and content
- Technicians photographing machinery for instant, visual troubleshooting guides (alluded to in the Google demo)
MULTIMODAL CAPABILITIES
The possibilities for rich, context-aware learning experiences are vast, greatly expanding what’s possible with AI-enhanced education. This multimodal approach isn’t just a future concept—it’s something we can start exploring today. While advanced features may not yet be integrated into most official work tools, Learning leaders can begin experimenting with several platforms that already offer multimodal capabilities:
- ChatGPT (GPT-4o model): Beyond text conversations, try uploading screenshots of software interfaces, flowcharts, or even handwritten notes. Ask it to explain, troubleshoot, or expand on what it sees. This exercise helps you identify scenarios where visual input can significantly enhance AI’s ability to assist or teach.
- Microsoft’s PowerPoint Copilot: When creating presentations, use Copilot to generate slides based on your content. Then experiment with refining these results by providing additional visual cues or references. This helps you understand how AI interprets and integrates different types of information to create cohesive outputs, including its strengths and limitations.
- Microsoft Designer or Canva: Use these AIpowered design tools to create visuals for your learning materials. Experiment with different text prompts and see how AI generates or suggests images. This process helps you think about the relationship between textual descriptions and visual representations, a key aspect of multimodal AI.
By exploring how text, images, audio, and video can interact, you’ll naturally develop a multimodal mindset—essential for leveraging AI’s full potential. Those who embrace and combine different modes will lead the next wave of learning innovation.