Maintaining High-Velocity Feature Development, Without Sacrificing Quality
As in any high-growth environment, expanding your suite of products and capabilities can contribute to a growing backlog of errors, and challenges prioritizing them… a scenario not lost on the team at Airtable, a connected apps platform that more than 300k organizations, including 80% of the Fortune 100, rely on to connect their teams, data, and workflows.
Airtable ships new features and updates through multiple deployments a week. Keenly aware of the importance of release health and the ability to reproduce and fix errors from staging through production, they set out to find an APM solution to support five key outcomes:
The ability to define issue ownership
Improved canary analysis of code before and during deployments
Detailed context to search by arbitrary tags and product area
The ability to optimize workflows for better triaging
To maintain release stability at scale and,
An overall improvement in the developer experience
The main business driver was our ‘Scalable and Safe’ deployments effort, led by our Service Orchestration engineering team in collaboration with product engineering teams to improve and decentralize our release process. Doug Forster, Software Engineer, Airtable.
Airtable’s infrastructure team partnered with Sentry to streamline and improve alert accuracy, and more effectively collaborate with teams on issue ownership and incident response.
We want to use best-in-class tools to help our engineers be effective, and having a solution that other organizations widely use makes onboarding faster for new team members.
Turning expectations into outcomes
By enriching errors with custom context, developers were able to search either by arbitrary tags or look at errors by product area for the first time.
We’ve seen a much better user experience for our engineers, particularly in the ability to search by tags, which we did not have with our previous tool.
New alert rules let teams see the volume of events over time, highlighting those with the biggest impact on customers. They’ve further customized this to show errors by product area, such as marketing pages, which has sped up the time it takes to catch any regressions or new issues.
Soon they were monitoring for issues related to ongoing project work and seeing errors in their staging environment that might’ve been caused by recent code changes.
In the past, this type of investigation would require pivoting between tools to get alerted and it would take 10-15 mins to correlate it with the impacted customers. Now, we can do the same thing in Sentry in a couple of minutes.
Safer, more scalable deployments
Custom tags help identify which engineering team owns a particular feature, so if there’s an issue during deployment they know who to reach out to. Teams also have the ability to prioritize issues based on impact, so that not all errors automatically block a deployment.
Putting this into practice they recently set up an alert for unexpected errors related to loading Airtable bases. First, they established an error threshold and configured the alert to notify the feature team in Slack. If there’s an issue, the on-call team member is notified, drills down, and goes over any tags, which include metadata about the request and error. The tag histogram usually provides a breakdown by user ID or other metadata which helps identify the impact, as well as patterns related to the cause.
This makes deployments safer and more scalable by more quickly detecting problems. In most cases, the errors are detected during the ‘canarying’ phase of the deployment, so any customer impact is limited.
With a focus on customer and developer experience, Airtable tweaked Sentry to fit how their teams work. This lets them route issues to the right people for faster investigations, reduces triaging times and the overall duration of incidents, and frees up developer time to work on other projects.
To learn more, read our full conversation with Doug here.