Detailed tracing across RAG chains is critical. Tools should instrument and expose specific span-level metrics like token counts, ANN lookup durations, and LLM latencies to help identify hidden bottlenecks, guide optimizations (e.g., adjusting embedding thresholds, prompt construction), and accurately forecast API costs.
๐๐ ๐ข๐ฏ๐๐ฒ๐ฟ๐๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ is now a must have in your tool belt as an AI Engineer. ๐ง๐ฟ๐ฎ๐ฐ๐ถ๐ป๐ด sits at the core of it, why is it important? Tracing and instrumentation of software have been around for decades now. With AI systems resembling regular software even more, we are now moving the practice here as well (with a few key differences). Letโs look into the process of tracing from a perspective of a naive RAG system. ๐๐ฆ๐ธ ๐ฅ๐ฆ๐ง๐ช๐ฏ๐ช๐ต๐ช๐ฐ๐ฏ๐ด: ๐ผ) An Orchestrator in the GenAI system application is the central piece of software that orchestrates the end-to-end process. Think of apps using LangChain, LlamaIndex or Haystack. ๐ฝ) Trace is the end-to-end application flow from the entry point till the answer is produced, it is composed of smaller pieces called spans. ๐พ) Span is a smaller piece of the application flow that represents an atomic action like a function call or a database query. They can be sequential, or run in parallel. โน๏ธ As part of span we capture general metadata like start and end time, inputs and outputs of the span. On top of this metadata we track information specific to the GenAI system elements. What might a trace look like for a naive RAG system? ๐ญ. A query that has been submitted to the chat application. ๐ฎ. The query is embedded into a vector. โ Additional metadata like input token count is persisted with the span so that we can estimate the cost of the procedure. ๐ฏ. ANN lookup performed against the Vector DB to retrieve the most relevant context. โ Additional metadata about the query is persisted as part of the span together with the retrieved pieces of context and their relevance. ๐ฐ. A prompt is constructed from the system prompt and retrieved context. ๐ฑ. The prompt is passed to the LLM to construct the answer. โ Additional metadata about input and output token count is captured together with the span so that we can estimate the cost of the procedure. ๐๐ฉ๐บ ๐ช๐ด ๐ต๐ณ๐ข๐ค๐ช๐ฏ๐จ ๐ฐ๐ง ๐๐ฆ๐ฏ๐๐ ๐ด๐บ๐ด๐ต๐ฆ๐ฎ๐ด ๐ช๐ฎ๐ฑ๐ฐ๐ณ๐ต๐ข๐ฏ๐ต? - These applications are usually complex chains, errors can happen in different steps of your application. E.g. Embedding of query is taking longer than expected or you have reached API limits of LLM provider. - Cost for calling LLM APIs will be variable depending on the length of inputs and produced outputs. You would usually trace this information and analyze it to help forecast expenses. - GenAI systems are non-deterministic and will deteriorate over time. They need to be evaluated on span level rather than input/output of the entire system so that you can tune each piece separately. - โฆ Are you tracing your Agents? Let me know in the comments ๐