Why LLMs Won't Replace Software Engineers

Posted Apr 28, 2025 Updated Jul 4, 2025

By Justin Cook

3 min read

Obama recently commented on the rapid progression of AI, stating: “AI can code better than 60% to 70% of human coders.” While AI advancements are indeed revolutionary, much of the discourse around it is science fiction fantasy rather than reality. We should remember that history is full of failed predictions from prominent people. Cognitive biases, such as the “availability heuristic” (overestimating what’s most recently visible) and “optimism bias” (overconfidence in positive outcomes), affect even the smartest tech-types. Moreover, tech CEOs have financial incentives to hype AI—VC funding for AI startups surged to $42.5 billion in 2023, and companies like OpenAI and Anthropic are in a race for dominance.

SWEs won’t be completely replaced anytime soon due to architectural limitations. One major problem with transformer-based models is context. Context is the information that a model knows in a conversation. Large enterprise codebases often span millions of lines of code. Assuming an average of 5 tokens per line, a model would need a 5 million token context window to fully read a codebase. This is far beyond the 200k token limit of Claude 3.7 Sonnet. Even with techniques like hierarchical chunking and context management, performance degrades as the context length grows: perplexity (a measure of model confidence) drops, and computational costs scale quadratically with context length.

Retrieval-Augmented Generation (RAG) helps fix this by allowing information to be retrieved from data sources. But LLMs still struggle to autonomously solve tasks with long inferential distances, such as beating a video game or writing a big piece of software. For example, AI agents like Claude 3.7 playing Pokémon frequently get stuck in a loop or fails to adapt to edge cases; it’s not even comparable to a human player. The reason why is because LLMs have a weak working memory or temporal reference. LLMs can’t remember past decisions, keep track of the goal, or update goals dynamically. It’s possible to use an AI agent to keep it on task or use documents for explicit goal planning.

EDIT: Anthropic recently introduced the next generation of Claude models. Claude Opus 4 is the world’s best coding model, with sustained performance on complex, long-running tasks and agent workflows. It still has the same token limit of 200k.

LLMs are great tools for coding, but they cannot fully replace any coder for the reasons above. In a recent project in C, I encountered a confusing bug and Claude failed to fix it. It got stuck in a negative feedback loop and it kept creating more bugs and hallucinating. Generally speaking, LLMs are terrible with low-level programming in C and Rust. Even when it does write a working program, it creates technical debt and unoptimized code. But for high level programming (like in JS and Python) and frontend development, LLMs can really shine.

AI progress is undeniable and AI research interest is skyrocketing. Research labs like DeepMind, OpenAI, and Anthropic are pushing boundaries of model capabilities. Once AI models can independently solve novel, complex problems that would take years of effort from researchers, we should be deeply concerned. Until then, the number of SWE jobs will likely slowly decrease as the coding models get better. Companies will no longer need teams of SWEs because one senior SWE with AI has 3x-6x productivity.

llm ai

This post is licensed under CC BY 4.0 by the author.

Trending Tags