Users need a way to reason across multiple documents or site data without losing connections between chunks. Current chunking methods defeat the purpose of comprehensive data retrieval.
I'm using Firecrawl to scrape multiple websites and get back full markdown. This markdown is fed to an LLM agent whose job is to reason over all of it and return a structured response. The problem here is that the combined markdown from even 3–4 sites (after preprocessing) blows past the context window. I know chunking is a common solution, but it feels like it defeats the purpose. If the answer to my query lives across multiple chunks from multiple sites, wont the naive retrieval step miss the connections between them? ( I might be misunderstanding this - please guide me if I'm wrong here ) My question specifically concerns MULTI DOCUMENT, REAL TIME SCRAPED DATA and not static knowledge bases, not single-document summarization. What I'm trying to understand is: - Are there any patterns or strategies that allow an agent to reason across multiple documents or site data, rather than just retrieve isolated chunks? - How can hallucinations be minimized when the model only sees partial context? - How can we ensure that relevant information isn't ignored during retrieval? PS: I'm relatively new to this area, but I'm very interested in learning about the design patterns and approaches used to handle these kinds of problems in practice.