The user needs AI coding agents to automatically get smarter over time by retaining and utilizing information across different sessions, agents, and environments. This addresses a "cross-session, cross-agent, cross-environment memory problem."
challenge accepted: build a production-quality application without writing or reading any code -- I'm currently building a personal side project. My dev expertise is in smart contracts not traditional apps, so I'm leaning heavily into ai coding agents to write the actual code. As with anything ai-generated, verification and evaluation are critical. But if I'm no expert in this area, then me reviewing code is slow and probably not that helpful anyways. So the challenge is how to set up programmatic verification so I can be sufficiently confident that the code is good? I'm still figuring this out (I suspect it will be a long process), but thus far I've converged on a few principles: 1. Enforce deterministic verification wherever possible. For example: - pre-push git hook that runs code linters, etc - pre-push claude code hook that enforces all tests pass - via github, enforce changes go through PRs - require all tests pass in ci / github actions 2. Invoke extremely strict agentic verification. The main principle here is that writing code is now quite cheap and fast, so we can afford to set a really high bar for quality and taking on tech debt. I have a Claude code review github action that runs with very strict custom review guidelines. If it catches anything, it blocks the PR. I also try to make use of sub-agents during development. Anthropic's feature-dev plugin is pretty good at this, but my custom PR reviewer still catches a lot of issues, so there's lots of room for improvement 3. Match intended user type to UX testing/review. My project is an app for myself, so I test the UX and request changes if its not good for me. But when the intended user is a computer or ai, that may no longer be necessary, as long as you can write tests that cover all desired behaviors. 4. I still make the final merge decision. It's my project, not my agent's. With all this in place, I feel comfortable letting the agent run autonomously with a new feature. I have a containerized sandbox on my raspberry pi where I can run claude and codex in yolo/dangerous mode without risking my computing life. This also lets me do long-running sessions overnight or when I'm away from my computer (h/t @banteg for takopi). There are a lot of improvements to make. A few things I'm grappling with: - How do I enable the system to automatically get smarter as it works? this is a roughly cross-session, cross-agent, cross-environment memory problem. I'm inspired by numerous people trying different approaches here. - How do I do human-facing UI design faster? One trick I've learned is to have claude create html mockup files, test them in my browser, provide feedback, and iterate from there. Not bad, but I think it can be better. - I could be a lot better at planning, eg I want to try out ralph wiggum-style flows but I need to prepare more features in advance to actually take advantage of that. Tell me what you do!