Stephen Wild, the observability manager at William Hill, runs a 10-strong team that looks after everything going on with the IT at the online bookmaker. Describing what observability means within William Hill, he says it allows the company to “keep an eye on all our services”. To support this, it chose New Relic as its observability platform.
William Hill used to monitor the individual nodes that comprised its software stack. The bookie has been on a journey to migrate workloads to the cloud, in a strategy to modernise IT infrastructure that was not able to cope well with the huge peak in bets placed during major sporting events such as the Grand National.
The challenge for Wild and the observability team is how to tackle failures that only occur during peak betting periods. “In the past,” he says, “it was a bit of a nightmare because we had infrastructure that wasn’t really built for the single huge day or huge week that we have. It was built to handle load over a year, which meant we were seriously struggling with IT infrastructure that was collapsing around us.” This, he says, meant it was hard to pinpoint where failures were occurring.
Understanding the revenue impacts of technical outages across all production business services is a key objective within William Hill’s observability strategy. To help teams gain the real-time observability needed to achieve this, the observability team built a tool called Impact Listener on top of New Relic, which William Hill uses to track high priority “P1” incidents.
The tool can be mapped onto any business service and any metric in real time to provide context and insights into service-impacting incidents during the entire incident lifecycle. New Relic is the primary trigger to launch the Impact Listener workflow. Alerts for critical incidents are sent to PagerDuty.
“The Impact Listener lets us prioritise what needs fixing first. It shows where most of the revenue is being lost,” says Wild. “There is an urgency to fix the problem that is costing us the most money.” He says that, thanks to Impact Listener, William Hill can now resolve 80% of P1 issues within one hour.
• Read the full interview here •