Agent building tools should support advanced multimodal state design, allowing developers to define and manage what persists as pixels, text, or latent features. This architectural shift is crucial for agents to effectively plan, cache, and retrieve information across different modalities, moving beyond simple PDF-to-text conversions.
I hit a frustrating wall when I first started building AI agents. But the lessons I learned now define how I build enterprise-grade apps. Let me explain... Early on, I was extremely comfortable manipulating text. But the moment I tried to add PDFs, images, and audio, my clean architectures collapsed. I built pipelines that chained together: • OCR engines for PDFs • Layout detection for tables and diagrams • Custom classifiers for images It looks sophisticated. But behaved like a brittle machine that broke every time a document layout changed. The breakthrough came when I realised I was solving the wrong problem. I did not need to convert documents to text... I just needed to treat them as images. ... And let multimodal LLMs handle them natively. Once I understood that: • Every PDF page is effectively an image • Modern LLMs can “see” as well as they can read • Images, audio, and text all become tokens The entire system simplified. And the accuracy increased. Here's the gist: If you keep normalising everything to text, you are throwing away the information that matters most. In the next installment of the AI Agents Foundations series in Decoding AI Magazine, I break down how to build agents that work with this reality instead of fighting it. Here is what I will walk you through: • Foundations of multimodal LLMs • Practical implementation • Multimodal state management • Building the agent If you want this lesson in your inbox the moment it goes live, subscribe to Decoding AI Magazine. Link https://lnkd.in/dgKZFc5j