Principles for Online Safety in the Age of Generative AI
Over the past 20 years, the internet has transformed from an emerging technology to a foundational part of life, with over 5 billion internet users in the world. The sheer volume of content being created and consumed on the web has skyrocketed, with Generative AI throwing gasoline on the fire. Each year, it gets harder for platforms to enforce their policies and create a safe online environment, as they come under attack from bad actors generating objectionable, explicit, or outright illegal content. Platforms are under immense pressure to keep their online communities safe; and the consequences of failure extend beyond legal or reputational risks. Ultimately, the best platforms know that safety is an essential part of a great user experience. In a sense, safe platforms are profitable platforms.
None of this is new, but it has become doubly pressing in the age of Generative AI. It has never been easier to create content. It’s not just the stuff that makes the news, like fake Joe Biden phone calls; GenAI will help bad actors break down language barriers, generate compelling conspiracy videos with a short prompt, or create deep fake explicit images of celebrities.
At the same time, there seems to be growing awareness around how harmful negative online experiences can be. With American teenagers spending 8.5 hours on their phones a day, governments are taking action against social media companies for their role in the teen mental health crisis. There’s an unbelievable amount at stake!
And yet, if you ask folks working on Trust and Safety teams at companies like Meta, Reddit, or Twitter, they’ll paint a pretty bleak picture about the state of internet safety today. Content moderation and AI safety is far from a solved problem; on the contrary, it’s resource intensive, manual, and inaccurate—and it only gets worse with scale. For all of the technical sophistication of big tech, they still collectively spend billions of dollars every year on human reviewers, and not for lack of investment in automation. Even with these substantial investments, tens of millions of policy-violating views still slip through the cracks each month.
Their current model involves:
Outsourcing moderation to underpaid and overworked foreign workforces who are subjected to such disturbing content that they develop mental health issues
Expecting human reviewers to apply internal guidelines and standards that can run hundreds of pages long (and can change rapidly), making it nearly impossible to ensure consistent, accurate moderation.
These content moderation challenges pose real business risks for tech companies, including user churn, loss of advertising revenue, and severe damage to their brands and reputations. CEOs are increasingly being called forward to address these issues publicly. Each platform’s unique approach to content safety, and the lack of a workable standard, has created gaps that allow harmful material to slip through.
But the news is not all bad! GenAI, for all its risks, could hold the key to a scalable, long-term solution. Our approach, which we’re calling Structured AI Data Enrichment, "SAIDE," provides a path forward, giving users fine-grained control over the behavior of an AI agent. Using this approach, repetitive, large-scale tasks are automated away with much of the intelligence and nuance a human might bring to bear. Instead of relying solely on under-resourced human moderation teams, we (as an industry) can leverage an AI system that is precise, scalable, explainable, and adaptive to evolving content types and threats.
—
SAIDE didn’t flash into existence. It emerged from years of research, and hard-knocks. Over time, a set of implicit principles came into view. Principles we believe show a clear path forward. The following are our principles for fair, robust tools for online safety. We’ve already adopted them, and we hope you’ll join us in adopting them as well.
Eight Principles for Online Safety
Fast
Any effective solution must operate in real or near-real time. So many online use cases need real-time controls.
Consider online gaming, where players (many of them young children) regularly get bullied (or worse). In the best case, they simply churn out of the game. The worst case, is, well, much worse.
What about online dating? For many users, getting harassed or objectified has become so common it’s now seen as being just part of the process..
We shouldn’t just accept this as the status quo, nor do we have to. Wouldn’t it be better to prevent the bullying in the first place rather than banning the user after the fact? Wouldn’t it be better to nudge online dating users towards better behavior in real-time? Efforts to intervene in the past have shown great promise, even without access to extremely accurate, automated tools. We can, and should, strive to ensure our tools can do their work in real time.
Customizable
Every platform is different. Their challenges, their bad actors. The modality of the content they deal with. That’s not even considering AI Safety, which is a whole different ball of wax.
One-size-fits-all approaches, by definition, fit no one. Each platform must be able to define the rules they need to enforce, which means modern safety solutions should be customizable to the needs of each platform.
But why stop there? Safety solutions can and should offer users a measure of control over their online experiences. While platform policies may be enough for most, some users (or parents of users) may want additional protections put in place. Let’s give them the tools to do so.
Testable
A safety solution should be testable, even when utilizing non-deterministic components, like AI models. Users should be able to reliably predict how the system will behave at scale, with a known and acceptable margin of error. Ultimately, testability creates trust. Trust for the operators of the safety system, and user trust in the platform.
Agile
Online threats evolve quickly, with adversaries constantly finding new ways to exploit platforms for financial or ideological reasons. This is especially true in the realm of AI safety, where bad actors regularly identify and exploit system weaknesses. Therefore, safety systems must be agile, capable of adapting to emerging tactics and techniques within minutes.
Explainable
Solutions should be as transparent as possible without compromising their effectiveness.
Platforms must be able to explain their rationale behind any actions taken, including which policies were violated, and provide clear communication to users when requested. This fosters trust and ensures users understand how moderation decisions are made.
Consistent == Accurate
Defining ground truth with respect to content safety is understandably difficult. Rules differ across platforms, It’s not breaking news to say that human societies are notoriously nuanced, and norms and standards vary greatly across cultures.
At Clavata, we measure our accuracy by how well we enforce the specific rules we’re given. We believe the best approach is to define accuracy as the consistent application of rules by any tool, process, or technology.
Proactive
Work on online safety for a minute and you quickly learn that user reports on content leave much to be desired. When trained moderators can’t apply a platform’s policies consistently, it seems almost silly to assume the platform’s users will get it right.
Let’s get proactive instead. The ideal has always been the review of every piece of content. And while that’s been infeasible in the past (for reasons of cost, time, or resources), that’s no longer the case. Going forward we should all strive for 100% proactive coverage. Even if we fall a bit short, it’s far better than the status quo.
Scalable
Generative AI will continue to drive the proliferation of content, and the associated risks will grow proportionally. As such, future safety solutions must be scalable (with economies of scale decreasing their cost per review).
Human expertise, informed by life experience will continue to play a crucial role—but we can leverage that expertise much more efficiently. Human moderators should be empowered by, and, importantly, in control of, AI tools that allow them to do the work of a thousand reviewers. Yes, there may still be cases where a human moderator actually has to review a piece of content, but if implemented well, those cases will be worth the reviewer’s precious time.
At Clavata.ai, we’ve pioneered a new approach to online safety based on the principles we’ve laid out above.