How to turn a LLM into a mini-developer

5 min readJun 1, 2023

There is a lot of hype around LLMs and their ability to generate code. As someone on the front lines, I can attest that the hype is real and potentially understated. And the real explosion in functionality occurs when you give the LLM context when writing code.

As a developer you likely spend 99% of your time working with an exiting codebase. You start with a high level understanding of the codebase — where to put stuff, where to find example code, how things are organized. But the LLM knows none of this. When you feed code into the LLM it is likely zero or few shot — it only has the limited context you provide.

But what if you could give the LLM the context you have about a codebase? Or even better, more context than you have.

This post is a step-by-step guide that will walk you through the process of achieving this goal. You’ll learn how to use the LLM to write explanations for your code, create vector embeddings of the explanation and how to store the vectors alongside the code metadata for easy retrieval.

This will unlock the ability to give the LLM context when generating new code — this transforms the LLM from a copy/paste nice-to-have to a full fledged mini-developer.

Step One — Selecting a vector DB

The first step to indexing a codebase is to select a database that supports storing vectors. There are several options. I’ll leave the cost/benefit analysis of these options for another post.

Pinecone
Weaviate
Elastic search — (yep, this works really well. Ill save the discussion on this for another post)
Postgres/pgvector

I’ve used Elastic search, Pinecone and pg/pgvector. I chose pg/pgvector because I liked the idea of storing the code metadata in the same location as the vector store.

Here you can see the code explanation and other metadata is stored directly next to the embedding vector:

I also chose to use Supabase to host the pg db. Supabase is very easy to set up and lets me focus on the details of the app vs managing the db.

Step Two — Use the LLM to explain the code

Once you have the db set up (here’s a quickstart guide) you can ask the LLM to explain what the code does and store the embedding. The prompt is simple for this:

const prompt = "What does this code do:" + content + "Please include the coding language"

I use GPT 3.5 turbo for this and it works well. Here is an example output:

This file, branch.ts, contains TypeScript code that handles requests related to GitHub branches. It defines functions for retrieving branch information, creating new branches, and deleting branches. The code also includes error handling for cases where the requested branch does not exist or the user does not have permission to perform the requested action.

Step Three — Embed the explanation

In practice I do step two and three at the same time — I have a function that loops over the files in the codebase, asks the LLM for an explanation of the code in the file, embeds the explanation and stores all of it in the db.

Here’s the embedding function I use:

const response = await openai.createEmbedding({
    model: "text-embedding-ada-002",
    input: documents.map((d) => {
      if (!d) return "";
      return truncateStringTokens(d, 8191);
    }),
  });

The truncate function just makes sure the input isn’t too big for the model. The response is the vector embedding:

[-0.012312407,0.023326635,-0.019229261,-0.026423331,-0.0042292206,0.0049087354,0.0043103565,0.020054145,-0.010135254,-0.031940587,0.018701877,-0.04392169,-0.009012871,-0.0044591064,0.010297527,-0.011879681,0.012724849,…]

Step Four — Create a similarity function

Once you have the embeddings stored in the db you need a way to search for relevant code. Here’s how that works with an example.

Lets say I wanted to search for code that handles updating git branches. Here is the query:

Find the code that handles updating git branches.

The first thing you need to do is create a function that can match similar vectors. Here’s a post on how that works. Here’s the function that I added to Supabase to match code:

create or replace function match_code_file (
  query_embedding vector(1536),
  similarity_threshold float,
  match_count int
)
returns table (
  id integer,
  file_explaination text,
  file_name text,
  file_path text,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    code_file.id,
    code_file.file_explaination,
    code_file.file_name,
    code_file.file_path,
    1 - (code_file.file_explaination_embedding <=> query_embedding) as similarity
  from code_file
  where 1 - (code_file.file_explaination_embedding <=> query_embedding) > similarity_threshold
  order by code_file.file_explaination_embedding <=> query_embedding
  limit match_count;
end;
$$;

Here’s what happens behind the scenes:

The query is converted to a vector: “Find the code that handles updating git branches” becomes [-0.012312407,0.023326635,-0.019229261,-0.026423331…] using the embedding function above.
That vector is fed into the match_code_file function above with the similarity_threshold (I use 0.81) and the match_count.
The matched code is spit out! Because the code, metadata and vectors are all stored in the same table all it takes is one call to get the relevant code.

Step Five — Give the LLM context

Once you create an embedding vector of every file in your codebase you can give the LLM context for any query! Here’s an example:

I recently built a GitHub app and wanted to create a new branch/pr with new code files. One prompt was all it took:

Write a set of functions that creates a new git branch/pr taking a string input and creating a new file for the input

Here’s what happened in the background:

The Exo app looked for existing relevant code based on the nouns in that query: “git branch/pr” for example
It found the branch.ts file above given the vector explanation
It fed the existing code in the branch.ts into the LLM with the prompt that this is existing code that the LLM agent could use to write the new function.
The LLM created new code that paired the existing code in the branch.ts file with new code to create a set of functions that create a new file/branch/pr with a string as an input.

Here’s a video of this process in action.

Conclusion

By embedding an entire codebase you can feed context into LLM agents and create your own mini-developer!

All the code referenced above can be found here.

Thanks for reading this far!