Time to Walk Upright: Evolving content safety technology

In an era saturated with digital content, ensuring safety has become paramount. Yet, the tools available to safety professionals are lacking. They don’t scale and can’t keep up with the explosion of online content driven by generative AI adoption.

If the social media era wasn’t enough, hundreds of millions of people are now using GenAI tools like ChatGPT. Ensuring safety across these tools has never been more important – and more difficult. The tools Trust & Safety professionals have today, like keyword lists and prompt engineering, are sorely lacking: they’re hard to use, they’re brittle, and they don’t scale.

LLM prompt engineering is not the answer

On the surface, automated content safety tools have evolved quite a bit over the past decade, from simple keyword lists and regular expressions, to pre-trained, “black box” ML classifiers, to new approaches that use general purpose LLMs.

One of the latest approaches now becoming common in the field of Trust & Safety and elsewhere, seeks to define content safety policies as LLM prompts written in natural language. This approach strives to mimic human understanding, but this leads to inherent ambiguities and errors, while providing large, complex policies to be maintained by already over-burdened safety professionals.

At Clavata we think differently about how to leverage recent advancements in AI, and we’ve built an innovative safety product that draws on our backgrounds in software engineering and Trust & Safety. Instead of trying to force natural language to do something it wasn’t intended for, Clavata uses short, structured policies that are easy to use, and don’t require our users to become experts at LLM prompt engineering.

As Turing award-winning computer scientist Edsger W. Dijkstra eloquently argued, we should embrace "narrow interfaces" for reliable machine interaction [1]. This philosophy, advocating for precisely defined assertions, finds its embodiment in the structured policy language we’ve created at Clavata.

“The virtue of formal texts is that their manipulations, in order to be legitimate, need to satisfy only a few simple rules; they are, when you come to think of it, an amazingly effective tool for ruling out all sorts of nonsense that, when we use our native tongues, are almost impossible to avoid.”

- Edsger W. Dijkstra (On the foolishness of “natural language programming”, 2010)

Comparing structured policies to prompt engineering

Let’s take a closer look at a publicly available and highly-tuned natural language policy [2] for a common hazard – ‘hate speech’ – and compare it with an equivalent Clavata structured policy.

	Natural Language Policy	Clavata Structured Policy
Lines	41	17
Words	708	24
Characters	5113	326

The structured policy from Clavata is ~5% the size, and thus much easier to read, edit and understand—with equal or better performance (more on this later). Which of the above would you prefer working with? Which would you trust your global team to have an easier time managing?

Clavata replaces the ambiguity of natural language with a structured approach. It operates on a foundation of meticulously crafted, atomic operations, each addressing a specific aspect of content analysis. Instead of attempting to "understand" the nuances of a text or image, Clavata decomposes content into its fundamental components and subjects them to a series of deterministic checks.

Clavata’s structured policies drastically improve performance

A structured policy is not only easier to understand and change, it also performs better than natural language policies in terms of industry-standard metrics like precision, recall and F1 scores. Lets look at performance in two dimensions: accuracy and cost.

Accuracy

We used a leading LLM, Google Gemini, to evaluate both policies against an open source dataset. We tested against 300 rows of labeled data from the Nvidia Aegis AI Content Safety Dataset [3], half of which were categorized as “hate speech.” We define an analysis to be accurate when true positives and true negatives are correctly identified.

Clavata’s accuracy was nearly 10 percentage points higher in detecting hate speech than the natural language policy, while only being a fraction of the size.

Cost

Clavata’s structured policies are much more cost-effective than natural language policies. Here’s why: a natural language policy by definition will have significantly more LLM ‘input tokens’ due to the verbosity of natural language, and therefore the input will be slower and more expensive to process. Both in this test, and across the millions of items Clavata has processed for its customers, Clavata’s policy evaluates content at half the cost of a natural language policy.

Structured policies are the future of content safety

Clavata represents a paradigm shift in content safety technology; making intelligent use of AI without the complexities of natural language policies and prompt engineering. Our approach offers a robust, reliable, and scalable solution for safeguarding digital content in an increasingly complex online world.

This approach ensures that machines are tools that perform precisely what they are told, instead of trying to guess what is intended from ambiguous human input. Clavata is not just a technological advancement in AI, it’s a big step towards a more responsible and trustworthy digital ecosystem.

Learn more at www.clavata.ai or contact us: hello@clavata.ai.

Special thanks to the following people for contributing to this post: Samuel Failor, Justin Gage, Brett Levenson, Shubhi Mathur, Ivan Tam, Tyler Tate, and Ilias Tsangaris.

—

Sources:

https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667.html

https://huggingface.co/spaces/zentropi-ai/cope-demo

https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0

Get to know Clavata.
Let’s chat.

See how we tailor trust and safety to your unique use cases. Let us show you a demo or email us your questions at hello@clavata.ai

Schedule a demo