Proving the Promise: Trust in AI Systems

AWS Distinguished Scientist Dr. Byron Cook talks with Werner Vogels on automated reasoning, neural networks, and the future of secure and provably correct AI systems.

WERNER: It’s been a few years since the last time we spoke about Automated Reasoning. For folks who haven’t kept up since the curiosity video, what’s been happening?

BYRON: Wow, a lot has changed in those three and a half years! There are two forces at play here: the first is how modern transformer-based models can make the more difficult-to-use but powerful automated reasoning tools (e.g., Isabelle, HOL-light, or Lean) vastly easier to use, as current large language models are in fact usually trained over the outputs of these tools. The second force is the fundamental (and as of yet unmet) need that people have for trust in their generative and agentic AI tools. That lack of trust is often what’s blocking deployment into production.

For example, would you trust an agentic investment system to move money in and out of your bank accounts? Do you trust the advice you get from a chatbot about city zoning regulations? The only way to deliver that much-needed trust is through neurosymbolic AI, i.e. the combination of neural networks together with the symbolic procedures that provide the mathematical rigor that automated reasoning enjoys. Here we can formally prove or disprove safety properties of multi-agent systems (e.g., the bank’s agentic system will not share information between its consumer and investment wings). Or we can prove the correctness of outputs from generative AI (e.g., an optimized cryptographic procedure is semantically equivalent to the previously unoptimized procedure).

With all these advancements, we’ve been able to put automated reasoning in the hands of even more users—including non-scientists. This year, we released a capability called Automated Reasoning checks in Amazon Bedrock Guardrails which enables customers to prove correctness for their own AI outputs. The capability can verify accuracy by up to 99%. This type of accuracy and proof of accuracy is critical for organizations in industries like finance, healthcare, and government where accuracy is non-negotiable.

WERNER: You mentioned Neurosymbolic AI, which we’re hearing a lot about. Can you go into that in more detail and how it relates to Automated Reasoning?

BYRON: Sure. Generally speaking, it’s the combination of symbolic and statistical methods, e.g., mechanical theorem provers together with large language models. If done right, the two approaches complement each other. Think about the correctness that symbolic tools such as theorem provers offer, but with dramatic improvements in the ease of use thanks to generative and agentic AI. There are quite a few ways you can combine these techniques, and the field is moving fast. For example, you can combine automated reasoning tools like Lean with reinforcement learning, like we saw in DeepSeek (The Lean theorem prover is in fact founded and led by Amazonian Leo de Moura). You can filter out unwanted hallucination post-inference, e.g., like Bedrock Guardrails does in its Automated Reasoning checks capability. With advances in agentic technology, you can also drive deeper cooperation between the different approaches. We have some great stuff happening within Kiro and Amazon Nova in this space. Generally speaking, across the AI science sphere, we’re now seeing a lot of teams picking up on these ideas. For example, we see new startups such as Atalanta, Axiom Math, Harmonic.fun, and Leibnitz who are all developing tools in this space. Most of the large language model builders are also now pushing on neurosymbolic, e.g., DeepSeek, DeepMind/Google.

WERNER: How is AWS applying this technology in practice?

BYRON: To begin with, we’re excited that ten years of proof over AWS’s most critical building blocks for security (e.g., the AWS policy interpreter, our cryptography, our networking protocols, etc.) now allows us to use agentic development tools with higher confidence by being able to prove correctness. With our existing scaffolding we can simply apply the previously deployed automated reasoning tools to the changes made by agentic tools. This scaffolding continues to grow. For example, this year the AWS security team (under CISO Amy Herzog) rolled out a pan-Amazon whole-service analysis that reasons about where data flows to/from, allowing us to ensure invariants such as “all data at rest is encrypted” and “credentials are never logged.”

WERNER: How have you managed to bridge the gap between theoretical computer science and practical applications?

BYRON: I actually gave a talk on precisely this topic a couple of years ago at the University of Washington. The point of the talk is that this is one of Amazon’s great strengths: melding theory and practice in a multiplicative win/win. You of course will know this yourself as you came to Amazon from academia and melded advanced research on distributed computing and real-world application… this changed the game for Amazon and ultimately the industry. We’ve done the same for automated reasoning. One of the most important drivers here is Amazon’s focus on customer obsession. The customers ask us to do this work, and thus it gets funded and we make it happen. That simply wasn’t true at my previous employers. Amazon also has a number of mechanisms that force those that think big (which is easy to do when you work in theory) to deliver incrementally. There’s a quote that inspires me on this topic, from Christopher Strachey:

“It has long been my personal view that the separation of practical and theoretical work is artificial and injurious. Much of the practical work done in computing, both in software and in hardware design, is unsound and clumsy because the people who do it have not any clear understanding of the fundamental design principles of their work. Most of the abstract mathematical and theoretical work is sterile because it has no point of contact with real computing.”

In my experience, the best theoretical work is performed when under pressure from real-life challenges and events, including the invention of the digital computer itself. Amazon does a great job of cultivating this environment, giving us just enough pressure that we stay out of our comfort zone, but giving us enough space to go deep and innovate.

WERNER: Let’s talk about “trust.” Why is it such an important challenge when it comes to AI systems?

BYRON: Talking to customers and analysts, I think the promise of generative and agentic AI that they’re excited about is the removal of expensive and time-consuming socio-technical mechanisms. For example, rather than waiting in line at the department of buildings to ask questions about and/or get sign-off on a construction project, can’t the city just provide me an agentic system that processes my questions/requests in seconds? This isn’t job replacement; it’s about helping people do their jobs faster and with more accuracy. This gives access to truth and action at scale, which democratizes access to information and tools. But what if you can’t trust the AI tools to do the right thing? At the scales that our customers seek to deploy these tools they could do a lot of harm to themselves and their customers unless the agentic tools behave correctly, i.e., they can be trusted. What’s exciting for us in the automated reasoning space is that the definition of good and bad behavior is a specification, often a temporal specification (e.g., calls to the procedures p() and q() should be strictly alternated). Once you have that, you can use automated reasoning tools to prove and/or disprove the specification. That’s a game changer.

WERNER: How do you balance building systems that are both powerful and trustworthy?

BYRON: I’m reminded of a quote that’s attributed to Albert Einstein: “Every solution to a problem should be as simple as possible, but no simpler.” When you cross this thought with the reality that the space of customer needs is multidimensional, then you come to the conclusion that you have to assess the risks and the consequences. Imagine we are using generative AI to help write poetry. You don’t need trust. Imagine you are using agentic AI in the banking domain, now trust is crucial. In the latter case we need to specify the envelopes in which the agents can operate, use a system like Bedrock AgentCore to restrict the agents to those envelopes, and then reason about the composition of their behavior to ensure that bad things don’t happen and good things eventually do happen.

WERNER: What are the most promising developments you’re seeing in AI reliability? What are the biggest challenges?

BYRON: The most promising developments are the widescale adoption of Lean theorem prover, the results on distributed solving in SAT and SMT (e.g., the mallob solver), and the wide interest in autoformalization (e.g., the DARPA expMath program). In my opinion the biggest challenges are: 1/ getting autoformalization right, allowing everyone to build and understand specifications without specialist knowledge. That’s the domain that tools such as Kiro and Bedrock Guardrails’ Automated Reasoning checks are operating in. We’re learning, doing innovative science, and improving rapidly. 2/ How difficult it is for groups of people to agree on rules, and their interpretations. Complex rules and laws often have subtle contradictions that can go unnoticed until someone tries to reach consensus on their interpretation. We’ve seen that within Amazon trying to nail down the details of AWS’s policy semantics, or the details of virtual networks. You also see this in society, e.g., laws that define copyrightable works as those stemming from an author’s original intellectual creation, while simultaneously offering protection to works that require no creative human input. 3/ The underlying problem of automated reasoning is still NP-complete if you’re lucky or undecidable (depending on the details of the application). That means scaling will always be a challenge. We see amazing advances in the distributed search for proofs, and also in the use of generative AI tools to guide proof search when the tools need a nudge in their algorithmic proof search. Really rapid progress is happening right now making possible what was previously impossible.

WERNER: What are three things that builders should be keeping an eye on in the coming year?

BYRON: 1/ I think that agentic coding tools and formal proof will completely change how code is written. We are seeing that revolution happen in Amazon. 2/ It’s exciting to see the launch of so many startups in neurosymbolic AI space. 3/ With tools such as Kiro and Automated Reasoning checks, specification is becoming mainstream. There are numerous specification languages and concepts, for example, branching-time temporal logic vs. linear-time temporal logic, or past-time vs future-time temporal operators. There’s also the logics of knowledge and belief, and causal reasoning. I’m excited to see customers discover these concepts and begin demanding them of their specification-driven tools.

WERNER: Last question: What is one thing you’d recommend that all of our builders to read?

BYRON: I recently read Creativity, Inc. by Amy Wallace and Ed Catmull, which I found, in many ways, told a similar story to the journey of automated reasoning. I say this because it’s the use of mathematics replacing manual work. It’s about the human and organizational drama it takes to figure out how to do things radically different. And ultimately, it’s about what’s possible once you’ve revolutionized an old area with new technology. I also loved the parallels I saw between Pixar’s brain trust and our own principal engineering community here at Amazon. I also think builders might enjoy reading Thomas Kuhn’s The Structure of Scientific Revolutions, published in 1962. We are living through one of those scientific revolutions right now. I found it interesting to see my experiences and feelings validated with historical accounts of similar transformative times.