The current retrieval quality (`probe_recall_at_k` = 0.22) is materially below published benchmarks (0.5-0.8). Adding paragraph-level chunking alongside sentence chunks is proposed to improve retrieval quality, especially for character span-based retrieval.
## Context The LegalBench-RAG benchmark harness added in #1239 measures retrieval quality over character spans. In session evaluation on the `privacy_qa` subset we observed: - `probe_recall_at_k` (single-shot top-10 vector search) = **0.22** - `citation_span_overlaps_gold` (agent's actual iterative retrieval) = **0.60** The published paper's reported recall@k for comparable embedder + retriever combos on the same subset sits at **0.5–0.8**. Our probe number is materially below that band, which points at chunking granularity as the first suspect. ## Hypothesis `TxtParser` currently produces **sentence-level** structural annotations as the retrieval units. LegalBench-RAG gold spans are consistently multi-sentence (average ~400 chars, 3–5 sentences). A top-10 retrieval of isolated sentences has to assemble the right 5-of-10 window to cover a gold passage; a top-10 of *paragraphs* only has to find the right paragraph. ## Proposal Add a paragraph-level structural annotation layer tha