Crash by API
Crashes are something we know a thing or two about. We see a lot of reasons why apps crash, but we also see when a lot of popular mobile apps crash at once.
Crashes across many mobile apps at the same time are not typically a result of an app-specific code update. I mean, there's a non-zero chance that hundreds of developers published broken apps... but these widespread issues are likely a result of two things. One is when a bug is caused by something related to time (like the mishandling of daylight saving time, time-based overflows, expiring certificates etc.). The second is something we're all familiar with because we've experienced it twice in the last two months: crashes caused by code failing to talk to an API or because the API starts returning unexpected data.
The Problem
Facebook made a change in its SDK that caused a ridiculous number of mobile applications to crash. On application startup, the SDK tried to hit an API endpoint but the callback that works with the response data would cause the app to crash immediately. The function looked like this:
+ (void)updateFilters:(nullable NSDictionary<NSString *, id> *)restrictiveParams
{
restrictiveParams = [FBSDKTypeUtility dictionaryValue:restrictiveParams];
if (restrictiveParams.count > 0) {
// ... snip
}
}
The code crashed on restrictiveParams.count
which is syntactic sugar for [restrictiveParams count]
and sends the count
message to the given Objective-C object. In this particular case, the restrictiveParams
object was nil
(NULL). The code is obviously in violation of what it suggests will happen. Both the input parameter to the method (restrictiveParams
is declared nullable) as well as the return value from [FBSDKTypeUtility dictionaryValue]
is nullable if the passed object is not an NSDictionary
but the code does not check for nil
.
The input value itself comes from code that parses a JSON response from the web API which again can return nil
if the key does not exist.
NSDictionary<NSString *, id> *restrictiveParams =
[FBSDKBasicUtility objectForJSONString:
resultDictionary[FBSDK_SERVER_CONFIGURATION_RESTRICTIVE_PARAMS_FIELD]
error:nil];
The Fix
In the same way the problem appeared — through a server-side change — the problem was resolved. If you look into the SDK you can still see that nothing was changed on the code which means that if the server were to return bad data, all applications using this SDK would crash again. This means that the code on the client was not really fixed and is still broken with regard to the handling of invalid data coming from the server.
Given that the Facebook SDK is widely adopted and would be time-consuming for every application to update the SDK, the problem needs to be fixed on the server quickly.
The Impact
Fortunately, Facebook has the means and the technical staff to respond and resolve these issues quickly. Unfortunately, end-users don't see the poor experience as a Facebook problem, they see it as an app problem. This particular SDK problem impacted navigation apps, music services, and many more. While popular SDKs provide many benefits, they also serve as a single point of failure with significant downstream consequences.
The most apparent is the spike in support cases. If users can't open an app they deem important, you can bet they'll tweet at you, reach out via chat support, and probably email in. Not to mention, some apps automatically send crash reports upon a crash. Imagine waking up to that backlog? No thanks.
Another downstream impact is on the tools application developers use to maintain their software, like an error monitoring tool :wink wink:. On a normal day, applications crash for all kinds of reasons, and usually the crashes are distributed throughout the course of the day. For an error monitoring product ingesting those events, evenly distributed crashes are great. Low event volume throughout the day means the error and performance monitoring provider doesn't have to throttle how many events they process and can serve them back to the developer in real time.
But when a large number of widely used applications start crashing, that error monitoring software needs to scale. Spikes like what we saw last week cause an extra load on the backend and some providers need to essentially rate-limit what they serve their customer. This means you as the developer are not getting crashes in real-time and you're not getting all of them anytime soon.
In other words, third-party SDKs aren't going away. They make it easy for users to adopt our software but at the same time, they pose a big risk because critical parts of your application depend on it. Select software that’s built to scale so that in the rare unpredictable occasions when you need them to work flawlessly…they do.
Prepare for Next Time
So let's look at how to prevent a ton of applications from crashing at startup without having to wait for the SDK author (in this case Facebook) to fix it:
Option A: Prevent the initialization of such SDKs until absolutely necessary. For example, if such an SDK were only to be used for sign-in it might be possible to isolate the incident to customers who haven't signed in yet.
Option B: The Facebook SDK issue was fixed through a server side deploy. In many cases, most App developers are left waiting for the deploy even if they already identified the issue. But if we expect such things to happens again, we can modify our application by downloading a small configuration file from a well-known URL on the application startup. This should give us the ability to disable individual components from the application.
For instance, if we know which SDK degraded, we could disable that component in the config. A lot of functionality can be disabled this way. You could disable analytics and other SDKs without having to publish or having your customers download a new version. In essence: we want to create a kill switch.
In a way, building a kill switch is like fighting fire with fire, because the problem in the first place was code responding badly to an API sending bad data. So when you build a feature like this yourself, make sure that you're not running into similar issues.
Defensive Programming
This leads us to how to write code so that it does not crash on bad data. When we consume something from an API on a mobile phone we want to make sure the following things are happening:
Errors are handled well: instead of crashing if the data is not looking like it should, an application should properly handle that error and do something sensible by default instead. So for instance instead of assuming that there will be a list of configuration options, validate that first to prevent the crash. Instead of failing if the file cannot be downloaded or if the server did not send some values, have a default config shipped with the app.
Avoid common pitfalls: when it comes to expecting where errors happen when working with downloaded data, here are some suggestions:
Expect that your JSON parsing can fail. A lot of code assumes that if the server always produces non-malformed JSON. Often that might be true for the backend server itself, but the load balancer could start returning errors which will typically look more like HTML than JSON.
Expect the API to return invalid types. The most obvious things are wrong types coming back for already known keys, values being null or entire keys missing.
Another common issue is when values are out of range for a specific type. A frequent offender are time stamps too far in the past or future for the type your language or ecosystem provides.
Consider using a schema validation library. If you consume data from a web API consider doing basic schema validation. Web APIs change over time and we have seen many cases where some old versions of clients suddenly start erroring in unexpected ways, because whoever changed the API was not aware that some old clients were not compatible. If you do schema validation, don't reject data you don't know about yet so you can update the config later (by adding new keys).
Client devices are careful about retrying. This is something I cannot stress enough. If something is wrong, don't blindly retry. Especially in cases where you have a global outage (overloaded backend, bad deploy etc.) you don't want all your client apps to go into an uncontrolled retry loop. It's very easy to DDOS yourself when this happens.