Skip to main content

What is this experiment.

·845 words·4 mins
Agent Pipeline - This article is part of a series.
Part 1: This Article

This is Why I’m Looking at AI Coding.
#

There has been a lot of interest in using AI agents and/or “Vibe” coding, but it didn’t interest me. Because I have a long history with DevOps, I cannot look at an application as just the code. I also see the repository, testing, building, packaging, deployment, monitoring, and anything else that could be part of the production pipeline. I could easily see where AI could assist in the initial creation of the code, but studies have shown it doesn’t affect maintainability of the code at all. (See: Modern Software Engineering) With the possible threat that AI agents could be prohibitively expensive in the future, I felt that agents would not likely become part of the actual pipeline.

Then I began to see applications like CorridorKey appear. This was a professional level program written using AI tools by someone with a weak programming background. He had a strong understanding of the problem he wanted to solve, but a weak understanding on the tools he had available. The question becomes how could this type of development be brought into a pipeline. Code reviews could be problematic. It would require multiple people working on the same application. Testing would have to be auto implemented. Is it possible to set up such a pipeline?

Setup for the Experiment.
#

The goal is to determine if AI agents can be used in a complete, stable, production pipeline.

Since I need something that a human can easily interface with in natural language, I’m going to start with Gitea. Along with the repository, it provides issue tracking, pull requests, and a wiki that an agent can use to plot its next action. It will be interesting to see how much can be stored in Gitea and how much has tto be setup for the agent.

The Initial Questions
#

  • Can a human read the code?
  • Are automated tests properly generated?
  • Do the tests pass?
  • Does it execute?
  • Is the code change relevant to the requested change?
  • Is the percentage of the code change excessive?
  • Can development continue if the AI agent is unavailable?
  • What happens when changes were submitted by something other than the AI agent?
  • Can changes to the AI provider be made irrelevant?
  • What happens when different providers work on the same application?

As the experiment progresses, other questions will be added.

What is the Application Going to Be?
#

I’ve decided to build a arcade style game for the following reasons:

  • A game can be safely abandoned. This is an experiment, and failure is always an option. A catastrophic failure could mean the end of the experiment. The loss of a tool or library could also be catastrophic, but the loss of a game is at most an annoyance.
  • Arcade games have been around for decades. Therefore, the LLMs have been able to scrape a large amount of relevant code. My research has shown that LLMs do best with standardized answers, and this would give the best chance of the agent producing a good answer. The experiment is about the pipeline, not the LLM.
  • Arcade games don’t require LLMs to function. Again the experiment is about the code and the pipeline. LLMs can introduce self modifying code, which would introduce additional variables to measure. Since arcade code is self contained, results are easier to test.
  • Arcade games are not simple. They contain scoring, movement, damage, reactions, players, levels, and so many other things. This is not something that can be written in a single prompt. It will require several iterations which will be used to test the pipeline.

Why I am Running This Experiment.
#

I’ve always been a curious person. When LLMs were becoming popular, I built an unreliable OCR program to get a better understanding of how they worked. This meant building the data collections, writing a basic neural network, and writing the evolution program that adjusted the network to the best settings. Though I didn’t produce anything usable, I learned what I wanted to know. The following is what I learned:

  • LLMs are a pattern recognition system.
  • predictive AI only telling you what is most likely to come next in the pattern.
  • AI cannot comprehend. It can tell you what’s most likely to come next, but not why it comes next.
  • AI is a popularity contest. For coding, it is the code that is most often used, not the one that is best to use.
  • AI is a self fulfilling prophesy. Since LLMs consume their own output, they are seeing more examples of the code they recommended earlier. Thus, increasing the likelihood of them recommending the code.

Even with the described limitations, AI agents have been able to produce some remarkable products. The question is, “How do I make it worth it?” A dinning room set made out of sand on a beach is going to disappear with the tide. An application that cannot be maintained might as well be built of sand. Or, is there a way to make these applications work with a production pipeline.

Agent Pipeline - This article is part of a series.
Part 1: This Article