How Twilio Builds AI at Internet Scale (w/ Head of AI)
We spoke to Zachary Hanif, who leads AI, ML, and Data at Twilio, about what it takes to deploy AI at scale.
Twilio powers billions of messages between businesses and customers every month. At that scale, even the smallest model error - one missed detection, or one mistimed send - can affect millions of people.
We spoke to Zachary Hanif, who leads AI, ML, and Data at Twilio, about what it really takes to deploy AI across a product that operates at such scale.
Here’s what we learned:
The hidden tax of AI
For Zach, the real cost of AI emerges after launch.
“Building AI has a cost. But operationalizing and maintaining AI has its own cost that goes beyond the normal expectations of software engineering.”
AI systems, he says, age faster than code. Models drift, data changes, and what once worked perfectly starts to fail silently.
Twilio treats model maintenance like infrastructure. Every model is monitored for accuracy, retrained when the world shifts, and tracked against both technical and business metrics.
“Your model is a representation of how the world worked when it was trained - and sometimes that world changes very slowly, sometimes really fast.”
Deploy and forget is not an option.
When 99% isn’t good enough
AI leaders often talk about “human-level” accuracy. At Twilio’s scale, Zach says that bar doesn’t cut it.
“At scale, something with 99% efficacy is wrong a lot of the time.”
With billions of messages in motion, a 1% failure rate translates to millions of mistakes. That’s why his teams chase the final tenth of a percentile - the part that makes AI viable for production use, not just a fancy demo.
Reliability isn’t a nice-to-have when you're operating at the scale Twilio does.
Measure the code and the consequence
Zach draws a sharp line between technical success and business success.
“There was a famous case with the Netflix Prize - a team built a model that performed better, but it was too expensive to run. Netflix never used it.”
That story has become a lesson for how Twilio evaluates AI.
Each model is measured twice:
- once for technical efficacy (F1 score, AUC, inference cost)
- once for business impact (NPS, adoption, fraud reduction)
A model that scores well on paper but fails in production doesn’t ship.
The UX is hardest part
When it comes to integrating AI into products, Zach says the hardest part is designing the way users interact with it.
“The hardest problem isn’t usually the technical part. It’s finding product–market fit and making sure the thing you’ve done is maintainable.”
At Twilio, AI features go through countless design iterations before a single change. The challenge is helping users trust what the system does - especially when it makes decisions on their behalf.
For most companies, that’s where the work really begins.
From compliance to care
One of Twilio’s newest AI products is its Compliance Toolkit for Messaging, which automatically checks whether messages comply with local regulations before they’re sent.
For example, some EU countries enforce “quiet hours” after certain times of day.
Twilio’s AI can now detect when a company is about to send a marketing message past that threshold, and automatically delay it until morning.
“The right human gets the right message at the right time.”
The next five years
Zach sees AI gradually disappearing into the background - doing more work invisibly so humans don’t have to.
“As more becomes transparent to the end user, it just works. Twilio becomes less of a communication layer between machines and humans.. and more a communication layer between humans and humans.”
It’s about software that understands intent, and does the right thing automatically.
We had such a great time jamming with Zachary! You can catch our full conversation on YouTube, alongside episodes with engineering and product leaders from Intercom, Monday.com, and Vercel.