Introduction
The year is 2030. The latest company to get listed on NASDAQ has just 2 employees. There is a CEO, and a CTO and they are supported by a team of over a thousand agents. That seems quite dystopian right? Agreed. Whether AI Agents are going to augment human productivity (and hence the ROI on hiring human employees) or they are going to replace the human workforce – we don’t know yet! However, one thing is quite clear: the ratio of AI agents to human employees in any organization is going to increase by at least 100x. With that being said, let’s talk about what functions are going to become the beachheads for these AI Agents (or you can even call them AI Colleagues). In 2022, we saw companies like Jasper emerge in the Content Writing space. Then in 2023, we saw some companies breaking out in the Enterprise Search space. In 2024, as the LLMs became more powerful, we are now seeing a bunch of companies scaling into the AI Sales Agent space. But what is the common thread between these three spaces? All of these tasks are “self-contained”. For example, one search query is a one-off task where the task output is generally independent of the broader context in which the user is making that search query. A good trick for checking if a task is “self-contained” or not is to ask yourself – “Can I outsource this to an intern?” This question also makes an indirect claim that AI Agents that are deployed today are not good enough replacements for full-time human employees. But what is the difference between a full-time employee and an intern? It’s mostly the organizational context. Interns don’t have context. 2025 will be the year when AI Agents expand beyond “self-contained” tasks and can drive large projects end-to-end. However, to keep the discussion more focused, let’s talk about just one function – software development. It is a huge market – much larger than just what we’ve historically seen in developer tooling. Tools like Github Copilot significantly enhance the productivity of the developers. But why haven’t we seen a tremendous increase in the throughput of software consultancy firms? Because developers are still present in the value chain of building software. If we draw a crude parallel with Ford’s Assembly Line, then adding the GitHub Co-pilot is not reducing the no. of stations in the assembly line, it’s merely making every station more efficient. The end-to-end automation development wave that is going to hit us in the near future will be akin to a huge 3D printer capable of producing cars at a much faster rate than traditional assembly lines. Now that we have some imagery to describe the quantum of impact which GenAI is going to bring with end-to-end autonomous development, it’s time to dive deeper. In this blog, we will try to answer three questions:- Why is autonomous development unsolved yet? What are the major challenges?
- What are the different approaches for autonomous development?
- What are their pros and cons? And, which approach seems more promising?
Challenges of end-to-end autonomous development
We started this blog with the concept of “self-contained” tasks. Today, SOTA models like GPT-4 are quite adept at solving “self-contained” coding tasks. Let’s understand with some examples:- Github Copilot was launched back in 2021 and the initial offering was mostly around generating pieces of code. Partial code generation requires relatively little context because you do not have to think about how the entire project is organized and how different pieces will interact with each other. You just have to solve a very small isolated problem that the user has asked you to do.
- Then came a lot of point solutions like writing test cases, generating documentation, etc. While code generation was too upstream, testing and documentation are too downstream! Did you notice the pattern? Gen-AI-based solutions were able to make an early impact in peripheral areas of software development. These peripheral tasks are “self-contained”!
Challenge 1: Having the right context
That brings us to the first challenge of E2E autonomous development: Understanding the whole project – or, Having the complete context. The word “context” may trigger a quick and dirty solution in your mind – LLMs with large enough context size to ingest the entire codebase in a single call. There are multiple loopholes in this solution:- Larger Context Windows lead to a drop in performance: In a paper from Stanford titled “Lost in the Middle,” it is shown that state-of-the-art LLMs often encounter difficulties in extracting valuable information from their context windows, especially when the information is buried inside the middle portion of the context.
- The problem of unwanted imitation: When using the model to generate code, typically we want the most correct code that the LM is capable of producing, rather than code that reflects the most likely continuation of the previous code, which may include bugs.
- A big part of the context exists outside the codebase – in Jira Issues, PRDs, etc.

Having-the-right-context
- A smaller context window is sufficient as we are not sharing the entire code.
- Since we are not sharing the entire code, the existing bugs are also not getting fed into the LLMs.
Challenge 2: Personalized code generation and Inverse Scaling
LLMs are great at suggesting the most acceptable solutions because that solution was more frequently observed in the training dataset. But sometimes, the most acceptable solution may not be the correct solution. For example, in a world where APIs keep getting depreciated regularly, LLMs are at a natural disadvantage because they have been trained more on the depreciated APIs than the latest APIs. See the following screenshot where GPT-4o was asked for a simple script to determine the row height of a string with newlines in PIL (an image library in Python):
Challenge 3: The Last Mile Problem
Around mid-2023, we started seeing some of the cool demos of agents who can do end-to-end development. The Supercoder from SuperAGI and GPT-Engineer were some of the early players in this area. However, most of these projects were trying to create a new application from scratch. In other words, they were solving for the first mile. Creating a new application looks cooler than making incremental changes, but the latter is the more frequent use case and hence it has a larger market size. These projects were aimed at creating from scratch because it is an easier use case, as the LLMs are free to choose their own stack and design paradigms. However, in 2024, we are seeing startups that are more focused on making changes/additions to existing applications. The reason this is happening so late is that it was simply not possible for LLMs to understand a huge codebase and build on top of it. But today, we have smart models with large enough context lengths, and we also have techniques like RAG and repository maps to aid in code generation in existing projects. But all of the above still doesn’t guarantee that the incremental changes made to the codebase will work on the first try. An interesting idea (reiterated by Andrej Karpathy here) is around the concept of flow engineering, which goes past single-prompt or chain-of-thought prompt and focuses on the iterative generation and testing of code.

Are agents the enduring solution to Autonomous Development? Or do we need code-specific models?
Theoretically, using a code-specific model makes sense, especially when we are trying to optimize for production-ready solutions which must have low latency. When the domain is limited, smaller models have always been able to match the performance of general models. The reason why very few players are going for this approach is two-fold:- Empirically, we have seen successful apps like Cursor and Devin which are built on top of generic GPT models, not code-specific models.
- Training a new model is a capital intensive task. The core question here is whether a new team can out-pace frontier model improvements. The base model space is moving so fast that if you do try to go deep on a code-specific model, you are at risk of a better base model coming into existence and leapfrogging you before your new model is done training.