AI's Evolution: Engineering Smarter Models Beyond Brute Force
The AI industry grapples with the 'bitter lesson' of scaling compute vs. the need for robust engineering. This analysis explores evals, capital flow, and strategic model deployment.
Key Insights
-
Insight
Successful AI product development prioritizes robust engineering around models, including effective evals, feedback loops, and testing harnesses, rather than solely relying on the 'smartest' models or brute-force compute. This engineering discipline is crucial for making AI systems reliable and predictable.
Impact
This shift demands increased investment in MLOps, software engineering best practices, and talent focused on system design, potentially leading to more stable and commercially viable AI applications.
-
Insight
The 'bitter lesson' approach of continuously scaling data and compute for AI model improvement is increasingly challenged by the need for system predictability and reliability. There's a fundamental tension between continuous AI (non-deterministic) and discrete systems (predictable, consistent).
Impact
Businesses must integrate traditional computer science principles with AI research, fostering interdisciplinary teams that can reconcile the probabilistic nature of AI with the deterministic requirements of production systems.
-
Insight
Evals (evaluations) are critical for AI development, serving as the 'scientific method' for non-deterministic systems and acting as a modern, declarative Product Requirements Document (PRD). They enable quantitative and qualitative understanding of system behavior and drive iterative improvements.
Impact
Adopting comprehensive eval frameworks will improve AI product quality, reduce development cycles, and ensure that AI outputs align with business objectives, acting as a competitive advantage.
-
Insight
Abundant capital in frontier AI labs enables a 'brute force' approach to model development (throwing more compute/data) which delays the economic incentive for engineering efficiency. However, when incremental model quality gains plateau, significant opportunities arise for engineering to optimize efficiency.
Impact
This suggests a future market shift from purely 'model-first' companies to those excelling in engineering and efficiency. Investors may begin to scrutinize the sustainability of brute-force scaling as returns diminish.
-
Insight
AI agents demonstrate 'comically' better performance when interacting with structured tools like SQL for data management, compared to unstructured environments like Bash. This highlights the importance of grounding AI agents in computer science fundamentals and structured data systems.
Impact
Developers should prioritize designing agent environments with strong typing, referential transparency, and declarative interfaces (like SQL) to unlock higher accuracy and efficiency, shifting away from naive 'give it a computer' approaches.
-
Insight
A cyclical dynamic exists between closed-source frontier models and open-source models. While new commercial models drive initial innovation and mind share, open-source models rapidly approximate performance, leading to cost-effective alternatives for stable, high-volume use cases.
Impact
Companies can strategically balance cutting-edge closed-source models for novel applications with optimized, cheaper open-source models for established workflows, leading to significant cost savings and competitive advantage.
-
Insight
The speed at which enterprises and consumers can effectively ingest and integrate increasingly advanced AI capabilities may become a limiting factor for AI growth. Demand, while strong, faces human political and integration rate limits.
Impact
Future AI innovation might increasingly focus on improving usability, integration capabilities, and user experience to facilitate broader adoption, rather than just raw model intelligence, to unlock untapped market potential.
Key Quotes
"The companies shipping AI products that actually work aren't using the smartest models. They're the ones with the best engineering around the models. The evals, the feedback loops, the testing harnesses."
"Right now, we are kind of building, you know, like God. And so it's possible and probably economically viable to keep throwing capital at the problem to make God 1% smarter. But when you can't make God 1% smarter, there is like an insane opportunity to engineer God to be more efficient."
"The worst models perform better on SQL than they do on like everything. It feels like this is another dichotomy that's shaping up, which is like there is a whole percentage of population that seems to be like, give it a computer and let us do its thing... and there's another, which is let's give it computer science fundamentals."
Summary
The Brute Force Dilemma: Why AI Needs More Engineering
The artificial intelligence landscape is rapidly evolving, often driven by a 'bitter lesson' mentality: throw more data and compute at a problem, and a smarter model will emerge. While this approach has fueled impressive advancements, it's increasingly evident that sustainable, effective AI product development demands a sophisticated engineering mindset. The real differentiator for companies shipping functional AI isn't necessarily the 'smartest' model, but the robustness of the engineering around those models.
The Evals Imperative: Guiding AI's Non-Determinism
At the heart of this engineering shift lies the concept of 'evals' – essentially the scientific method applied to non-deterministic AI systems. Evals provide a structured way to hypothesize, test, and quantitatively and qualitatively measure the impact of model changes, prompt tweaks, or added context. For product managers, evals become a living, declarative representation of product requirements, ensuring that AI systems evolve with predictability and reliability, rather than just raw power.
Capital Flow vs. Engineering Efficiency
The current abundance of venture capital flowing into 'Frontier Labs' allows for an unprecedented scaling of compute and data, effectively delaying the economic pressure to optimize for efficiency. These labs can raise billions and build models based on capital, rather than engineering speed. However, this 'make God 1% smarter' approach has its limits. Once incremental improvements become too expensive, a massive opportunity emerges for engineers to make existing models more efficient and reliable.
Open vs. Closed: A Dynamic Market
The AI market sees a continuous push and pull between open-source and closed-source models. While frontier closed-source models often introduce "step function changes" in capabilities, open-source alternatives quickly approximate their performance. Shrewd businesses, particularly for stable, high-volume use cases, are finding significant value and cost savings in optimizing older, open-source models like Llama 3-1, leveraging deep familiarity with their quirks to eke out superior performance and predictability. This suggests that the best model isn't always the newest, but the one best engineered for the specific application.
Computer Science Fundamentals: SQL Trumps Bash for Agents
A critical insight highlights the power of computer science fundamentals in AI agent design. Benchmarks comparing Bash and SQL environments for agents reveal a "comical" difference: even less advanced models perform significantly better with SQL. This underscores that providing agents with structured data access and declarative querying capabilities, rather than brute-force Unix environments, enhances their effectiveness and reliability. This duality between simply giving an AI a "computer" and equipping it with "computer science fundamentals" signals a potential golden age for systems engineering in AI.
The Demand-Side Limit
Beyond technical and economic factors, a less discussed limiting factor for AI's growth might be consumption itself. The speed at which enterprises and even consumers can meaningfully integrate and utilize increasingly capable AI models could become a bottleneck. While individual usage of tools like ChatGPT and Gemini is widespread, the deeper integration into complex enterprise workflows or the development of truly transformative consumer AI applications is still in its early stages. This suggests a future where engineering efforts might shift not just to making AI smarter or more efficient, but to making it more consumable and integratable into human systems.
Ultimately, the future of AI will be defined not just by raw intelligence, but by the strategic engineering that harnesses it, ensuring reliability, efficiency, and real-world applicability.
Action Items
Prioritize investment in AI engineering infrastructure, including robust evaluation frameworks, feedback loops, and testing harnesses. This ensures reliability, predictability, and continuous improvement of AI products in production.
Impact: Leads to more stable, higher-quality AI applications, reducing operational costs and increasing customer satisfaction through reliable performance.
Adopt a 'throw-away' mentality for the internal components of AI agents (e.g., prompt engineering, context provision). While iterating quickly on these, focus long-term engineering investment on the surrounding evaluation and feedback systems that validate their performance.
Impact: Accelerates agent development and experimentation while maintaining product integrity and quality through systematic validation, enabling rapid adaptation to new models.
For AI agents interacting with data, leverage structured data systems and declarative querying languages like SQL over brute-force shell environments. Design environments that provide agents with computer science fundamentals, not just raw access.
Impact: Significantly improves agent accuracy and efficiency in data-intensive tasks, reducing errors and making AI-powered automation more effective and trustworthy.
For high-volume, stable AI use cases, strategically evaluate and optimize older open-source models instead of always chasing the newest, most expensive frontier models. Deeply understand model quirks and optimize for specific use case performance and cost.
Impact: Achieves substantial cost savings and predictable performance for critical business functions, freeing up resources for innovation in other areas or more complex tasks requiring frontier models.
Product managers for AI products should evolve from traditional PRDs to designing comprehensive 'evals' that declaratively define desired system behavior. These evals should encompass both quantitative metrics and qualitative validation.
Impact: Establishes a more scientific and transparent product development process for AI, bridging the gap between non-deterministic AI capabilities and clear, measurable business outcomes.
Mentioned Companies
Brain Trust
5.0Anchor Goyal's company, central to the discussion on evals, agents, and engineering AI products, positioned as a solution provider in this space.
Figma
4.0Acquired Anchor Goyal's previous company, Imperial, where he led their AI team, indicating a positive acquisition and career progression.
OpenAI
4.0Mentioned for its large models (GPT-52, Opus 45), significant capital runs, and effective API management/rate limiting, indicating leadership in frontier models.
Datadog
4.0Ollie from Datadog provided valuable wisdom on pricing strategies related to cloud spend for Brain Trust, indicating a respected and successful business model.
MemSQL
3.0Anchor Goyal's prior employer, relevant to database systems and the discussion of NoSQL's 'head fake' and the value of SQL in enterprises.
Anthropic
3.0Referenced for its significant capital raises, illustrating the large investments in frontier model development.
Cursor
3.0Cited as an example of an AI product that 'works' due to effective engineering around models, rather than just model intelligence.
Replit
3.0Mikaela from Replit is quoted on the common developer desire to discard codebases with new model releases, highlighting a significant industry challenge.