Implement memory optimizations for large crawls to prevent the system from becoming unresponsive due to excessive memory consumption. The current issue appears to stem from crawled data accumulating in memory without proper cleanup, particularly in the `url_to_full_document` dictionary.
# Memory Optimization Issue for Large Crawls ## Problem Description During large crawls, Archon fills up the host memory and causes the server to become unresponsive. This has required increasing RAM to 32GB on VMs, but the issue persists. The root cause appears to be that crawled data accumulates in memory without proper cleanup. ## Investigation Summary ### Key Memory Issues Identified #### 1. **Accumulation of Data in Memory** - The `url_to_full_document` dictionary stores **entire document contents** in memory for all crawled pages - The batch crawling accumulates all results in `successful_results` list before processing - All chunks, metadata, and contents are accumulated in lists before batch processing #### 2. **No Memory Cleanup** - No explicit garbage collection or memory clearing between batches - Results from processed batches are kept in memory throughout the entire crawl - The `url_to_full_document` dictionary is never cleared during the crawl #### 3. **Large Batch