AI, Privacy and Terms of Service Updates
Update
Hey everyone. We’ve gotten your feedback and heard your concerns; we were less than artful in expressing our intentions. Many of the things that people are worried about are not things that we plan to pursue, and we should have been more clear.
First off, we’re going to delay the effective date of our TOS change indefinitely until we’ve satisfied the shared concerns.
As part of these changes, we agree that we need to provide a better consent mechanism for many of our customers. When reviewing this, we quickly realized the majority of what we’re looking to accomplish (e.g. improving our fingerprinting heuristics) is being clouded by the less understood hypothetical investments we’d like to explore.
We will not push customers to agree now to use cases that are not well-defined or understood. Rather, we commit to offering a consent mechanism for those future hypothetical use cases when we can clearly define them so customers have the opportunity to evaluate whether the feature is valuable enough to them to contribute their data. That could include, for example, a system which might use customer data to train a large language model. Additionally, we’re exploring an opt-out mechanism for other uses of data, such as training a heuristic model to better group errors together. While the implications of these two applications are very different, we understand customers’ desire for more control in how their data is used.
We need some time to explore what these changes would look like and how we would implement them in a way that stays true to protecting our customers.
Thanks for bearing with us.
Original Post
Like everyone else in the world, we are thinking hard about how we can harness the power of AI and machine learning while also staying true to our core values around respecting the security and privacy of our users’ data.
If you use Sentry, you might have seen our “Suggested Fix” button which uses GPT-3.5 to try to explain and resolve a problem. We have additional ideas being developed as well that we're excited to preview. For example, our plan to use machine learning to rank and group issues might be less flashy, but it is significantly more impactful, as knowing which issues interest our users the most will help us make our products more compelling.
To make any of these things happen, we need to be able to use machine learning to train data sets based on error, event, and other data sent to Sentry (we refer to this as service data) and the way you interact with Sentry (usage data).
Because this is a big change for us, we thought we’d take a minute to address how we’re approaching these issues. While we have used aggregate usage data in the past, using service data for product development would be new for us. Since Sentry’s inception over 10 years ago, we have limited our use of service data to troubleshooting issues or validating functionality. This stance reflects our commitment to demonstrating the utmost respect for your data. However, limiting our use of service data in this way prevents us from advancing our products through AI or machine learning.
Keeping privacy top-of-mind
With machine learning, you throw data into a big ball of machine learning magic and let the model come up with answers on its own. Here are some implications of using service data in this context. One primary way of using service data for machine learning is to use data in the aggregate to rank and recommend. The service data is thrown into a larger magic 8 ball, and the 8 ball just produces numbers. While the training data set itself might retain some of the underlying service data, the output is just mumbo jumbo. This is the easier problem to deal with because it means that no service data can accidentally be revealed.
A trickier use case for machine learning involves retrieval (i.e., having the ML model provide a contextually relevant response based on underlying service data). Here we need to be extra careful because the system can reveal information that is contained in the magic 8 ball. It would be highly problematic if, for instance, our “Suggested Fix” feature were to reveal confidential information of someone else.
So here is how we’re planning on dealing with all of this:
We will continue to encourage all our customers to use our various data scrubbing tools so that service data is sanitized before we receive it.
We will apply the same deletion and retention rules to our training data as we do to the underlying service data. This means that if you delete service data, it will also be removed from our machine learning models automatically.
We will scrub data for PII before it goes into any training set.
We will ensure that the only service data presented in the output of any ML feature belongs to the customer using the feature.
We will only use AI models built in-house or provided by our existing trusted third-party sub-processors who have made contractual commitments that are consistent with the above.
We are confident that with these controls in place, we will be able to use service data and usage data to improve our products through AI while still protecting that very data.
To that end, we are updating our Terms of Service to allow us to use service data for AI. The exact language is simple, and it’s consistent with how many of the companies in our space are addressing the issue:
“...Sentry may use Usage Data and Service Data for analytics and product development (including to train or improve AI Features and generate Outputs).”
With recent TOS shenanigans in the news—and especially those surrounding AI—we don’t take this update lightly. As an open company, it’s crucial that we operate transparently; not because we want to avoid a PR disaster, but because it’s fundamental to the way we build and do business.
The new TOS will take effect on February 3, 2024 for all new customers, and on the first renewal after February 3, 2024 for existing customers. This gives all existing customers at least 30 days to review the TOS changes before they go into effect.
See our FAQs for more information.
Do you have any questions? Please reach out to us with concerns or feedback. We’ve also started a GitHub discussion on our first use case, issue severity.
Postscriptum: The nitty gritty details for the interested developers
Our intended improvements will harness text embeddings – converting words into numerical values that capture their semantic information. In this context, we’ll treat embeddings as a set of features for a downstream task. For example, we may use this aggregated dataset to train a model that can predict the severity of a new issue that is sent to your feed based on both usage and service data. Crucially, the information within these embeddings cannot identify specific organizations, projects, or issues. Moreover, our policy on GDPR deletion requests will apply to any data we leverage for AI features - if we are required to delete any sensitive data in our production database, we will also delete the corresponding numerical embedding from our vector database.
In the coming months, we intend to launch additional generative AI features that leverage Retrieval Augmented Generation (RAG). In this context, we will use text embeddings as an index - a way to retrieve the most contextually relevant bits of source code (via our GitHub integration) or events for a particular Sentry issue. When it comes to RAG, embeddings will always be logically separated (never cross customer boundaries). We may use embeddings to provide improved context to our suggested fix feature, help with semantic search, or improve the relevance of the issue details page, but only for the specific customer whose service data those embeddings were derived. All functionality leveraging RAG will require user opt-in - if you do not intend to take advantage of these features, nothing will change.