For three long years I have been building with AI. For three long years I have suffered, and scraped and cried and raged. For three long years I have seen the light at the end of the tunnel and the sun breaking through the clouds. For three long years I have seen the clouds reconvene and blot out the sun. The darkness seems darker when it returns.
You know what I am talking about. I am talking about working with LLMs. Building with AI is a function of the LLM black box, “inference.” Sending a prompt off into the either and hoping that a decent response comes back in return. And the LLMs are good, some of the time! Most of the time! Never all of the time. And when the LLM replies with something decent I have a sense of feeling awestruck. A sense that this really is the future. A sense that all I need to do now is to build. And then I do build. I build an entire subsystem for talking to LLMs and then I start to scale the process. And inevitably, all too frequently, I get back a response that breaks down the entire experience of the app. I can deal with this myself, sometimes, but I know that my users cannot.
I am never someone to surrender to the shortcomings of the machine. And so I begin my efforts to improve the LLM outputs. There are a variety of tools. But ultimately, they all orbit around one key concept: Prompt Engineering. Even hearing the words make me shudder. At the beginning of the current hype cycle prompt engineering was touted as a new profession. Aspiring profesionals would receive “prompt engineering jobs” and write prompts for a living. If only someone could coax the tokens into just the right order and sequence, with just the right tone, then LLMs would dutifully obey and output correctly, consistently. This theory boiled bad LLM outputs down to a skill issue. But, you and I, we’ve been through that. We know that is not our fate. Let us not talk falsely now, LLMs fundamentally operate on a degree of randomness. We are not up against LLMs with our prompts, but rather chaos itself.
By the end of this I will talk about why Confect AI is the tool to tame the chaos of building with AI. But first, I want to dive into the history of Prompt Engineering. Why it was dead on arrival, and why new methods and tools that actually work will be born from its ashes.
What was called “Prompt Engineering” circa 2022 was dead on arrival. Let’s start with a definition. Prompt Engineering was touted as a way to achieve consistent and correct results from AI by “engineering” the text of the prompt fed to it. The premise was that using specific words, or structuring the white space and punctuation, or even the order of instructions given in the prompt was going to allow a prompt engineer to produce better results with better prompts. To some degree, this is true. The prompt matters greatly. But the fundamental principle is flawed. The transformation and vector layer of LLMs are beyond the comprehension of the human mind. Each token in a prompt is represented by a multi-thousand dimension vector. Those vectors are then transformed and computed against themselves creating a data structure far beyond human comprehension. The truth is there is no guarantee for how tokens are going to interact with one another. And therefore managing for determinism at the prompt layer alone is never going to work. Relying solely on prompt engineering to control an LLM’s output is like trying to change tomorrow’s weather by blowing really hard into the sky. You might create a tiny breeze, but the vast atmospheric forces at play will ultimately carry on, mostly indifferent to your effort.
Let me try to put this into a more concrete example. We are familiar with functions, like X + 5 = A, where A will always be the addition of 5 to whatever you give for X. 10 + 5 = 15. 100 + 5 =105. But this is not how AI operates at all. Instead, the text inputs are going to be converted into tokens, and then each token into a vector, (TODO: link here) and then those vectors will be transformed. Practically, this means that the function is not made up of discrete components, like "X" and "5." Instead, the function will be taken as a whole. The value of X will have unknowable side effects on the value of "5." No two values of X will ever vectorize in the same way. And thus, the most perfectly engineered prompt to the human mind will still suffer from the inevitable chaos caused by varying inputs at scale.
Engineers working with AI quickly observed this big problem when building. They would give the same prompt to the same LLM multiple times, and receive multiple outputs. To make matters worse, most prompts in systems are templates that have a few slots or “variables.” They would spend hours fiddling with the prompts and their set of test inputs to achieve something that seemed consistent and reliable. Then they would deploy their prompt to production, with its many unknowable inputs, and inevitably feel the pain of a weird and breaking output. They'd start to play whackamole, fixing the prompt for one input, and just to have it break with a previously working input. Time-consuming at best, the prompts never would reach their desired level of consistency.
Ultimately, Prompt Engineering v1 was another face of alchemy. An attempt to transmute one material into another, overcoming essential differences using a process that humans cannot understand. Is it possible to turn lead into gold? Possibly, the universe created all of the elements from something. We can even theorize how it might be done. But the theory always breaks down in practice, or requires more resources than would make the effort worthwhile. But the problem persists. In order to build systems with an AI layer, we need reliability from LLM outputs. And the first step of wisdom on this path is to accept the unknowable details of what is going on within the LLM. We know how they work. We know how they process inputs, but even our best tools only scratch the surface of what is actually happening at runtime for any given input.
Producing repeatable outputs that conform to our specifications remains as a problem. And so we must continue to work to solving the randomness problem. I hope you understand why prompt engineering as massaging the prompt itself was never going to work. So what will work? Let's start with the basic principle: the more that a prompt changes, the more its tokens will change and thus the more its vectors will change, etc... all the way to the output. And so the best prompt is a prompt that doesn't change at all. We want our prompts to change as little as possible. But we need AI because the inputs are constantly changing. If the inputs didn't change, we would preprocess an output and never have to use an LLM. The first instinct, which is correct, is prompt chaining. The second instinct, which is more correct, is Prompt Engineering v2.
Prompt Chaining derives from principle of Divide and Conquer. Got a big problem? Or in this case a big prompt? Divide it into smaller and smaller prompts. The less a prompt changes per inference, the more similar the token vectors, and the more likely the results will be consistent. Do you have a prompt that is taking in multple variables? Do you have a prompt that is performing multiple functions? Minimize the dynamic pieces of the prompt. You will likely get a lot of mileage from this technique. The smaller, and more narrowly focused you make your prompts the better. This is really like the Single Responsibility Principle of engineering, and it holds true in this domain as well. One obstacle however, is that once you break a single prompt into many small prompts, you now have to deal with orchestration. That is, getting your many prompts to pipe inputs and outputs together. Working with an asynchronous system of jobs and queues. And the observability problems thare are going to come along with all of that. More on this later.
Prompt Engineering (v2) is like Prompt Chaining, and absorbs many of the same engineering principles such as the Single Responsbility Principle. The key difference with Prompt Engineering is that there are no variables, no conditionals and no code in any of the prompts. Prompts are all written in natural language. And while the context might change for a particular inference at runtime, the prompt itself stays the same. In fact, with prompt engineering done correctly, the only dynamic portion of an inference is the context. This may seem like a trivial distinction with Prompt Chaining, but it is signficant. Prompt Chaining is still a technique reserved for software engineers. It means creating a software system that calls out to an LLM as a function of the bigger system. Prompt Engineering however describes an entire layer of a system that is purely prompts. Logical control can still be implemented, however it is done so using prompts and calls to AI. The software layer can call into the AI layer, much like the software calls into a database layer. The layers are connected and interactive, however they are separate domains.
Both Prompt Engineering and Prompt Chaining are solving for the same problem. Making big prompts smaller. Any time you find a prompt that is acting unreliably, the problem is that you need to rephrase that single prompt into multiple smaller prompts. You may need to connect them with logic. You may need to add a verification prompt afterward to validate and potentially re-run the prompt again. But by using these techniques, AI can be not controlled, but tamed. The AI system's reliability will improve and high success rates at scale are achievable.
But why is Prompt Engineering better than Prompt Chaining? This claim comes from the same basic truth we have been discussing. Prompting LLMs is not the same as programming software. The best metaphor I have is the database. Or the front-end. As we have seen in those two domains, the patterns are signficantly different from application logic. The tools and platforms and concerns are different. Entire ecosystems are built around these differences ultimately to the point that job titles are born around each of these domains. The code to run these separate layers ends up hosted differently. This will be the case for AI interaction as well. And that's why Prompt Engineering deserves its own category, like Database and Front-end Engineering. We are early, but the market is already filling up with tools that will make the ecosytem for Prompt Engineering jobs.
What good is it to be able to build anything you'd like, if you don't know what to build? This problem is true of all product development and engineering disciplines. The people who understand the industry and the domain nuances usually are not coders. This is a big problem in these other layers because they all require some amount of code and technical expertise. Frontend at least has mockups and Figmas. Databases have the analogy of spreadsheets. Application design is still out of reach of the non-technical. But the AI layer does not have to be. Everyone can undertand a flow chart. Everyone can understand giving instructions. AIs are natively natural human language. Of all the domains, creating with AI should require the least amount of technical expertise. The AI layer, when done correctly should reach like an onboarding manual for a new employee. It should codify standard operating procedures for business processes. It should be purely in natural language and devoid of esoteric concepts. Observing the AI layer and iterating on prompts should also be simple and accessible to the non-technical.
As AI gets better, at least in the GPT and LLM paradigm, I expect its speed and cost to decrease. But I don't think the fundamental problem of breaking tasks into bite-sized chunks to go away. And the meta tasks of orchestarting prompts and designing flows will not go away. These are the essential problems of the AI layer. Solving them, is what we call Prompt Engineering.