Loading request...
Implement an evaluation runner to compare amended skill versions (v(n+1)) against previous versions (v(n)) using the same task patterns. It should compare success rates, execution metrics, and user feedback, with options to promote or rollback based on improvement. Evaluation results should be stored.
Run amended skill (v(n+1)) against the same task patterns that failed for v(n). Compare success rates, execution metrics, user feedback. If improved: promote. If not: rollback. Store evaluation results.