Loading request...
Integrate OpenTelemetry support into SkyPilot to provide visibility into command usage, success rates, error tracking, and performance metrics.
the goal is to give visibility into skypilot, so we'd ideally expose things that helps answering questions like: - Which sky commands are users calling? - Are commands succeeding or failing? - What errors are users hitting? - How long do commands take to complete? - Are error rates spiking for a particular cluster/command/user? - Is sky failing to connect to any cluster? - how is skypilot deciding which cluster to allocate a certain `<job-id>` - Preemption events, recovery attempts, failovers > why OTel? industry standard, supported by all the different observability backends