"One time I was woken up at 3am by a pager that escalated. I instantly asked DrDroid to investigate it and in a few minutes, I was able to close the issue directly from Slack."
— Moiz Arsiwala, CTO
WorkIndia is one of India's largest job marketplaces with 28M+ active users. With a large expanse of infrastructure and applications, incidents can impact their customers adversely. Their on-call process needed to scale without scaling headcount.
01Problem Context
WorkIndia had set up on-call processes and alerting to handle issues, but multiple challenges were slowing their team down.
Manual investigation overhead
Frequent alerts required 15-20 minutes of manual investigation each, jumping across k8s, ElasticAPM, Grafana dashboards, Loki logs, and code.
Escalation bottleneck
Given their tool sprawl and context expanse, escalation during on-call was frequent and often blocked identifying and fixing issues.
Engineers pulled off-rotation
Engineers who were not on-call were frequently involved in production issues, breaking focus and disrupting feature work.
Knowledge gaps
On-call engineers would get stuck without deep know-how of a specific component (e.g. k8s) or without understanding correlation across the full stack.
02The Vision
WorkIndia's CTO and tech team were working towards Zero Touch Production. They were hands-on with AI, actively using and building agents in their product, and wanted an agentic solution for on-call that would reduce the burden on engineers to investigate and debug production issues.
03Trying DrDroid
One of their engineers came across DrDroid and after checking the demo, decided to try it. Their evaluation criteria:
Relevant integrations
ElasticAPM, Grafana, k8s, PagerDuty, Loki, Jenkins, GitHub, Jira.
Slack-first workflows
Everything needed to work through Slack, where their on-call lived.
VPC integration support
Their infrastructure runs behind a VPC, so self-managed integration was a hard requirement.
Access management and security
Well-defined RBAC and audit capabilities.
04 What WorkIndia Achieved
Using DrDroid, the WorkIndia team can now:
Junior engineers own on-call end-to-end
New and junior engineers can investigate any production alert in minutes without escalations. They have the context DrDroid surfaces.
Automated runbook execution
Automatically take action and auto-resolve domain-specific alerts using prompt-based runbooks.
Continuous retrospectives
Manage daily on-call retrospectives to improve alert actionability via DrDroid.
- Further improve their autonomous detection stack to catch failures in deployment pipelines before alerts fire.
- Further enhance operational efficiency by automating actions on more alert classes.
"One time I was woken up at 3am by a pager that escalated. I instantly asked DrDroid to investigate it and in a few minutes, I was able to close the issue directly from Slack."
"DrDroid works amazingly for initial investigation. It gives exact alerting traces that help me understand what's happening quickly. With the time I save on debugging, I can actually focus on implementing long-term fixes instead of just firefighting all the time."