Why Every Product Manager Needs to Be Reading AI Research Papers
I'm usually not one to dictate what others should read. But in the world of product management, where the hype cycle spins faster than a GPU training the latest LLM, there's one thing I feel strongly about: If you're building AI-powered products, you need to be reading AI research papers.
The following comes from personal experience building and launching AI products at Google (3 of which ended up center-stage at Google’s company-wide TGIF). Some come from observations and rumblings I’ve heard from software engineers and researchers who have strong opinions about the PMs they work with.
Here are four reasons why:
1. AI Literacy Is Table Stakes
Imagine trying to build a house without knowing the difference between a foundation and a roof. That's what it's like building AI products without a solid grasp of the underlying technology.
Reading research papers arms you with the vocabulary and conceptual framework to communicate effectively with your engineering and research teams. You'll understand why "supervised fine-tuning" is not the same as "parameter-efficient fine-tuning," and why confusing "LLMs" and "LAMs" will make you sound like you're speaking gibberish.
More importantly, you'll gain a deeper understanding of what these models can and cannot do, allowing you to set realistic expectations and build products that deliver on their promises.
2. Unmasking the Hype Cycle: Seeing Beyond the Shiny Demos
Headlines scream about the latest AI breakthroughs, but they rarely delve into the limitations and potential pitfalls. Research papers, on the other hand, often highlight areas where models struggle, provide context for their performance, and discuss avenues for future research.
One paper I found particularly insightful explored why large vision models, trained on synthetic data, were developing distorted "mental concepts" of everyday objects like a chair. Researchers had hypotheses for why this happens - but the takeaway as a PM is to understand the limitations of over-reliance on synthetic data - and when these limitations start harming output. Understanding these nuances, the known unknowns, will make you a more credible and effective product leader.
3. From Overpromising to Overdelivering: The Dangers of AI Ignorance
I’ll share a story about not my own team or org, but an adjacent partner engineering org. During the peak hype of post-Chat GPT, they were gung-ho about building "autonomous AI" for customer service. I once worked with a partner engineering team whose PMs were very gung-ho about autonomous AI. The sentiment was that they’d eventually be able to capture hundreds of millions in cost savings from creating autonomous customer service agents that could answer all customer queries and resolve problems end to end. The cost savings would come from reduction in customer service staff.
Their plan was to train the current LLMs off customer service agent behaviors and have customer service agents teach or train the AI to do more and more complex tasks — this was the equivalent of asking people to train their own replacements.
While the vision itself was problematic for me and a blatant attempt to replace humans with machines to cut costs, the biggest problem I had with their roadmap was they projected the autonomous AI would be ready and land in two years of the project being started.
While there are many components necessary to bring about production-quality autonomous AI in customer service, one of the biggest gaps this org didn’t seem to grasp was that for AI to ‘watch’ customer service agents, teams would likely need a Large Action Model and much more data points than 1 vendor team could generate. Second, the data points would need to be cleaned, processed, balanced, and flow through data pipelines to continuously train the AI. Synthetic data might have to come into play. However the teams generating the data - typically business operations teams working with 3rd party vendors with agents in the Philippines, Greece, or Africa - do not have this capability as their focus is human to human customer service.
Overall it was a highly unrealistic goal. A year later, this partner org had to rollback its ambitious ‘autonomous AI’ deployment and revert back to basic automated workflows for a fraction of top user issues.
Here are some other sentiments that Google software engineers and AI researchers have shared with me about their PMs:
One AI researcher confided in me they don’t hold their PM in high regard solely because they don’t seem to understand the underlying tech - and it shows in their decks to leadership and sponsors. The PM asks them to corroborate their narrative, but the research and engineering team members often find themselves in the awkward position of wanting to refute some of the claims the PM made about their new product. Enthusiasm for AI does not make up for lack of technical depth.
Another software engineer vented to me how their PM doesn’t understand why their LLM is good at answering 1 question, and not another customer question in the same topic. In this case, the model could answer how a customer could optimize their Cloud instances for lower costs (i.e. switch to running Ubuntu), but couldn’t answer how many instances the customer already had that were running Ubuntu. The answer: the engineering team has to manually identify and integrate the backend API call for counting up the total machine instances running Ubuntu.
Blind optimism, fueled by a lack of technical understanding, is a recipe for disaster. It leads to overpromising, underdelivering, and ultimately, erodes trust in your leadership and the products you build.
4. Evaluating Quality: From Esoteric Benchmarks to User-Delighting Experiences
Evaluating the quality of AI output is an ongoing challenge. Your executive stakeholder and sponsor won’t know whether your model’s “95%” factuality” score is a good thing or not. Is this an expected result? Is this the best you can get? Or is your fine-tuned model underperforming against expectations? Research papers can provide valuable insights into the latest evaluation techniques and benchmarks, helping you set realistic goals and track progress. It’s your job as an AI PM to bridge the gap between the esoteric research evaluation benchmarks and how the model is actually impacting the end user experience.
But don't get lost in the world of F1 scores and BLEU scores. Remember, your ultimate goal is to build products that delight users. It's your job to bridge the gap between esoteric research metrics and the user experience, demonstrating the real-world impact of your AI-powered product. At the end of the day, your product is there to serve users and improve their lives, not subject them to testing out a fancy LLM.
That said, you’ll often combine user-centric metrics like CSAT, NPS, average handle time, average resolution time, engagement/session duration, DAU, MAUs, etc with the AI research benchmarks to give a full picture of how your launch is going. It’s best to ground your model-centric metrics in what AI research uses.
In the rapidly evolving world of AI, product management is no longer just about building the right product—it's about understanding the very building blocks of innovation. And that starts with reading the research. So grab a cup of coffee, open up arXiv, and start exploring the fascinating world of AI research. Your users (and your engineering team) will thank you.