Why businesses need to build an AI Quality Flywheel

Dec 7

A huge part of making LLMs useful is Reinforcement Learning from Human Feedback.

Think of it like this: you're teaching a puppy a new trick. You don't just give it a textbook on "fetching" and expect it to figure it out. You show it, you guide it, and you give it feedback – "Good dog!" when it brings back the ball, "No, not the cat!" when it, well, you get the idea.

RLHF is the same principle, but with algorithms instead of puppies, and more data. We show the LLM examples, we ask it questions, and then we give it feedback on its responses. A large part of this work for ChatGPT, originally known as InstructGPT, was done by crowdsourcing and contract workers. This feedback loop is what helps the LLM learn to be more helpful.

Scaling RLHF – making it work for massive models, across multiple languages and domains, with millions of users – is a challenge particularly for business functions that have customer data across fragmented teams and platforms. It's the bottleneck that's holding back the promise of LLMs for business use cases.

So here are the top bottlenecks in scaling RLHF for LLMS:

The Data Deluge (and Drought): This is the big one. RLHF needs human feedback data. Lots of it. And not just any data; it needs high-quality data. That means clear instructions, consistent ratings, and examples that cover business-specific scenarios.

The Deluge: For some tasks, we're drowning in data. Think of all the user interactions we have across Google's products. But most of that data is unstructured, noisy, and unlabeled. It's like trying to find a needle in a haystack the size of Mount Everest.
The Drought: For other tasks, especially in specialized domains or less common languages, we're facing a data desert. Try finding enough expert-level feedback to train an LLM on, say, quantum physics in Swahili. Good luck.
The Problem: Even when there is a fair amount of data, getting that data in a format you can use is hard.

The Human Cost (and Time): Human feedback isn't free. Even if you're using crowdsourcing platforms, you're still paying people for their time and expertise. And if you're relying on skilled support agents or subject matter experts, the costs can skyrocket.

Time is Money: Gathering enough high-quality feedback data can take weeks or months. That's an eternity in the fast-moving world of AI. We need to be able to iterate and improve our models much faster.
Human Factors: What's more is that having good humans, and the right humans is also important.

The Quality Conundrum: Not all human feedback is created equal. People are inconsistent, biased, and sometimes just plain wrong. If you're training your LLM on flawed feedback, you're going to end up with a flawed LLM.

Subjectivity: Many tasks, especially those involving creativity or nuanced judgment, are inherently subjective. What one person considers a "good" response, another might consider "meh."
Bias: Human raters can bring their own biases and preconceptions to the task, leading to unfair or inaccurate feedback.
Error: Even well-meaning, well-trained raters can make mistakes. It's human nature.

The Scaling Spiral: The bigger your LLM, the more feedback data you need. The more feedback data you need, the more humans you need. The more humans you need, the more quality control you need. It's a vicious cycle that can quickly spin out of control.

Complexity: Managing a large-scale RLHF operation is a logistical nightmare. You need tools for task assignment, data collection, quality assurance, and model retraining.
Infrastructure: You need the computational resources to process all that data and retrain your models. And that's not cheap.

The consequences are that our users get worse experiences, the business incurs costs, and we're not making the improvements that we need.

So how can one address these bottlenecks?

I don’t have all the answers, but from having worked in this space and gone on a deep dive into industry leading practices, here are some viable and working paths for scaling RLHF:

Smarter Sampling: Instead of randomly throwing data at human raters, we're using AI to identify the most informative examples. This is called "active learning," and it's like focusing your study time on the topics you're least confident about.

Uncertainty Sampling: Target examples where the LLM is least sure of its answer.
Disagreement Sampling: Look for cases where different models (or different versions of the same model) give conflicting outputs.
Expected Model Change: Prioritize examples that are likely to have the biggest impact on the model's parameters.

AI-Assisted Labeling: We're not just relying on humans to do all the work. We're using AI to help them.

Pre-labeling: Use a smaller, faster model to provide initial labels or suggestions, which humans can then refine or correct.
Active Tooling: Provide user interfaces that make the labeling process easier and more efficient, like Shadow Hawk, or other in house tools.
ML Linters: Use AI to check for potential errors or inconsistencies in human-provided labels, like a grammar checker for feedback.

Data Augmentation: We're finding ways to get more mileage out of the data we do have.

Synthetic Data: Use AI to generate new examples that are similar to the real data, but with controlled variations.
Back Translation: Translate data from one language to another and back again, to create new training examples.
Paraphrasing: Use AI to rephrase existing examples in different ways, to increase the diversity of the training data.

Automated Benchmark Mining:

Only use data that humans have 100% agreement on.
Use a benchmark threshold, and filter our labelers that do not meet that threshold.
Vary the rate of mixing benchmark data dynamically.

Golden Label Correction: A small amount of labels, when used correctly, can be applied to the rest of the data.

Weighting Dataset: Weight different data sets based on error rates.

Increasing Dataset Size: I mean, this helps with all ML models, in general.

Individual feedback and modeling:

Use expert labelled data to correct for biases
Match complex instructions with better training

The Future is Feedback (and Automation)

RLHF is a fundamental part of how we're going to build the next generation of intelligent systems. But we need to get smarter and more efficient about how we do it. We need to break the bottlenecks, automate the boring stuff, and focus our human energy on the tasks that really require human judgment and creativity.

The way to do this is to build an AI Quality Flywheel. It's a system that combines all of the above strategies into a self-reinforcing cycle of data generation, evaluation, and model retraining.

AI helps humans: Pre-labeling, active tooling, and linters make the human feedback process faster and more accurate.
Humans help AI: High-quality human feedback data is used to retrain and improve the LLM.
The flywheel spins: The improved LLM generates better outputs, which leads to more efficient data collection, which leads to even better models, and so on.

Rachel Wu https://rachelw.org

Why businesses need to build an AI Quality Flywheel

5 years launching AI at Google: 10 Lessons in Navigating Ambiguity

Something worth building

rachelw.org