Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

Experiments in building bespoke tools with AI

A few weeks into my sabbatical, I took another stab at vibe-coding. After a few rounds of watching agents painfully burn tokens as they wasted my time I asked Claude to reorganize my Emacs configurations.

The next thing I remember is exhausting my Claude Pro limits and upgrading to Claude Max.

Vibe Limits

My challenge with building and deploying software built by agents is a lack of trust in the generated code. I have significantly less patience for debugging agent-authored code and generally feel a sense of unease if I think about deploying it to production. There's also a nagging sense in the back of my head that I'd be much better off just implementing it myself over the long run.

With code generation being (effectively) instantaneous, my new bottleneck is being able to trust the agent's output: either by fitting it into my head or finding ways to mechanically validate correctness. I want to be involved in the design decisions, but I don't need to be involved as long as the result works correctly and efficiently.

To make this sensation concrete — something I can reason and talk about — I'm calling the complexity of the change I can make with sufficient confidence my vibe limit.

Vibe Limits

All parts of the chart are subjective, contextual and personal:

Risk: bugs, outages, and unexpected behavior that I'm willing to accept.
Complexity of Change: a way to describe how big the change is; one measure is how long it takes me to truly understand and internalize what the change is.^[1]

The shape of the curve between change complexity and risk depends a lot on the model and codebase it's working on:

Tightly coupled, ball of mud-style code tends to break far more easily: represented as an exponential curve
The size of the codebase itself, and the necessary context for the model to behave
Anthropic published a paper on the Hot Mess of AI: how agents become more incoherent as the task becomes more complex

This framing also helps me explain at least part of the discontent online: people at extremely different positions on the chart yelling past each other. The spectrum goes all the way from one-word auto-complete to Gas Town.

Something that stands out immediately is that smaller changes that can be easily vetted should be acceptable in any circumstance: this is a lesson we've learned repeatedly while reviewing code by humans, and it still stands true for code by robots.

I still haven't quite managed to ship fully agent-generated code to production: the one time I did try, I ended up ripping out and rewriting the code within a week because I really didn't want to spend time hardening and debugging generated code.

Complexity-bending

One way to have agents get more work (larger pieces of?) done without accelerating risk would be to change the shape of the risk/complexity curves. Some approaches to achieve this:

Extensive testing: both unit and end to end tests to force correctness, with guard rails to prevent the model simply deleting the tests
Choosing languages where compilers can step in aggressively to help prevent regressions — particularly memory leaks and race conditions — like Rust
Modular designs that reduce the amount of context a model must maintain to make safe changes
Smarter models that can deal with larger changes, carefully trained appropriately

The other mechanism is to increase the appetite of risk from generated code:

Making it trivial to revert to a known good state, with minimal consequences in case it breaks temporarily
Running in contexts where end-user can reasonably be expected to deal with failure or (ideally) debug it

Jagged edges

Trying to approach the problem in a different way, I thought I'd compare using agent-generated code to relying on upstream libraries, or delegating the work to fellow engineers. This didn't really work out because of minimal signal on quality by default.

The reason I have trouble with agents is that I don't have a meaningful theory of mind for their behavior. LLMs exhibit jagged intelligence: brilliant in some tasks (generating python to count the number of r's in strawberry), and astonishingly incompetent in others (directly counting the number of r's in strawberry). Another phrasing I've heard that applies just as neatly is alien intelligence.

Seeing this the first few times is disconcerting: Claude (or Codex) will one shot a task that would have taken me hours or days — particularly for identifying bugs in my code — and then spin indefinitely trying to change and list directories.

Ultimately, LLMs are an extraordinarily powerful tool but the ultimate accountability for their output rests with me. For delegating to people I can consider motivation, incentives, and consequences: I have none of those for models. Which leads to a decidedly strange risk curve.

A computer can never be held accountable, therefore a computer must never make a management decision.
– IBM Training Manual, 1979 (tentatively)

My Sweet Spot

Given my risk tolerance for production projects is extremely low with a correspondingly low vibe limit, my sweet spot is in quickly manufacturing tools for myself. After roughly fifteen years as a professional programmer, I heavily customize any laptops and interfaces I use to my unique preferences. Agents can become a handy shortcut to make software perfectly suited to my taste.

High risk tolerance, controlled environments

If the tools break, I can generally mitigate the risks by simply reverting to a known good state or even simply abandon the tool entirely.

There's also the option to design tools to be self contained and testable: flattening the risk/complexity curve and increasing the size of changes I can safely ask agents to pull off.

A large surface area to tackle

Tool preferences also tend to be extremely personal: a function of taste, the tools someone was first introduced to, and how people think. Leaning on the extreme programmability of Emacs — and observing smalltalk and the renaissance of moldable development — set up the frontier of customizable tools, and I'm terribly excited at the idea that everyone can build tools that fit them without having spend unreasonable amounts of time getting good at programming the tools.

The amount of effort saved is large, and helps me avoid minutiae around manipulating tools that is valuable but not essential to the actual work I want to do.

Design principles

We shape our tools and our tools shape us
— John Culkin paraphrasing Marshall McLuhan

Having explicit principles felt particularly important because the tools I build here will influence what I build and how I build in the future.

I'm not particularly excited with the direction we're subtly pushed towards by the current agents where we delegate most decisions to the AIs: I would much rather be amplified by them instead.

The tools need to adapt to my preferences, not the other way around.
- with software being effectively free, I don't see a reason to spend my time learning arbitrary interfaces.
- lean into self-improving tools that can adapt themselves to how I end up using them
- I must always be able to override behavior.
Apply AI at the most leveraged interfaces.
- the more I can make the tools work at interfaces I directly use (the terminal, the browser, Emacs) the more leverage I get to improve the way I work.
Make it explicit, explainable, auditable, and instrumented.
- understanding how and why things work is particularly important for me as a systems engineer.
- I want to understand what the tools did, be able to debug them trivially, and understand how much they cost in compute.
- A lot of the current approaches to leveraging AI feel as though we're setting money on fire to boil a cup of water.
Build with a deep understanding of the models themselves.
- skills, crafting prompts, etc. feel too shallow to build really good tools
- learning to use post-training and working with model architectural limitations is a much more effective way to design.

Mechanics

Actually trying to build projects (see the next section for specifics) taught me a few things on interacting with agents as they are today. For now I'm exploring what works for me and don't claim to be AI-native (nor particularly aspire to it).

The frequently updating interfaces to using models, mcps, skills, swarms, etc. as well as prompt hacks feel somewhat shallow (and I expect them to be shorter lived as we rapidly iterate on models, tools, and RL to improve the models). I don't want to spend time surfing the current shape of the jagged intelligences we're working with. I plan to lean on the Lindy effect and pick things up once they've stuck for a little bit.

With all of that out of the way, my current vibe-based workflows based on a string of failures and successes in applying models:

Iterate on Design first:
- I ask the model to discuss a design with me, and then continuously refine it further.
- Asking the model to poke holes in the design and clarify often helps in getting a great outline out quickly.
- Make sure the design is committed to the same repository and maintained with edits.
- Think of it as if you're doing waterfall but you have the ability to iterate so rapidly it becomes agile.
Incremental commits:
- Always ask the model to make tiny commits and add as a rule to the CLAUDE.md (or equivalent) from the beginning.
- This makes the changes more cohesive and bite-sized in case I need to revert or review, reducing the odds that I'll exceed my vibe limits.
- Like a bespoke repository from the dark ages, small commits help in understanding why a repository is shaped the way it is.
Explicitly document and commit plans and conversations:
- See if you can get the agent to regularly summarize and commit your conversations to the repository as well: this could be updates to the design.md, agent.md, or plan.md files.
- Try to keep all of this state in the same repository so that it's somewhat reproducible and manageable.
- Other humans may still end up reading the code, and save your original intent for them to gain context much faster.
- This also makes it much easier to start a new session without having to continuously forward context.
Build in automatic feedback loops for the agent:
- An anti-pattern I watch out for is if I find myself playing QA repeatedly for the agent — I found this to be the easiest way to frustrate myself.
- Have the model generate unit tests, end tests, or script out a workflow you play for it once. Sadly, I haven't managed to have a good enough UI yet.
- Also try to add some signal or measure of non-functional metrics like memory utilization to prevent unpleasant surprises.
- Surprisingly, I've also had to instruct the model to always follow compiler warnings; but using Rust, linters, and existing tools compounds well.
Build your own knowledge by inverting roles:
- If I need to learn or explore a paper, I'll sometimes ask the agent to make a plan for me to follow in text.
- Then I implement everything by hand — in a mirror to days of using stack overflow where I made it a point to never directly copy/paste in code, but instead would always type it in from scratch — asking the agent for help if I get stuck.
- I've found this to be an order of magnitude speed up in how I learn with concrete projects.
Have the agent one-shot the problem, delete it and restart:
- There's something about seeing and interacting with something that works end to end that can greatly clarify thought.
- Seeing one approach to a problem can also be very helpful in overcoming programmer's block: I can judge the concrete implementation and decide where I want to go.
- This can be particularly useful for unfamiliar APIs and languages. Asking the model to make my code idiomatic tends to be extremely helpful when building initial muscle.
- Generally the size of the one-shot is so big I can't trust it and definitely don't want to debug backwards: which is why I inevitably throw it away and then use that as one potential path.

Projects

Finally getting to the actual experiments I've been playing with during my sabbatical. I ended up subscribing to all of Codex, Claude, and Gemini, but most of the code here is from Claude — as you'll also see in the history from the GitHub repo.

Djn

My original problem with agentic programming was having to pass context around: i really wanted the agent to pull context from whatever I happened to be doing and the windows I had open at that point.

There was also the mechanical part of needing to support different inference endpoints (local or remote) and some way to orchestrate API calls meaningfully which I wanted to centralize for observability and key management.

Which is what led to the idea behind djinn: a set of software that could easily compose to share context, manage inference, and act as the foundation for any AI powered tools I'd build in the future. Ideally I'd bootstrap it, building djn faster and faster as I could get more tools online.

I failed miserably at this attempt.

I started by asking Codex (a few generations ago) to implement this with Python: some tmux based introspection, some emacs mcp servers, a central python server for making the calls with API support and didn't really have a fast or sane way to validate.

The agent would keep asking me for help validating the changes, and though I could play QA while streaming Netflix that felt very far from a good use of my time, and progress was minuscule.

I also ran into a different analysis paralysis: in retrospect I was asking for changes way beyond my vibe limit, and didn't quite know how to structure things so that the agents would build something I'd actually use.

After a few days I just reset the repository and decided to try much, much smaller projects instead.

Emacs, ZSH & other configurations

With hindsight I should have started using agents to own and maintain my configurations much earlier; if you're thinking of getting your feet wet I'd strongly recommend using an agent to optimize your .zshrc or .tmuxrc.

dotfiles

Following Jim Meyering's advice I've been dutifully version controlling my dotfiles for several years. After a refactor I had a bit of a mess on my hands because I'd duplicated the configurations to safely update them.

To mitigate risk, I decided to do a new branch and folder for the files so I could adopt them much more incrementally.

I asked the agent to look at the previous checkout and copy changes into the new one while cleaning up any broken settings; while deleting completely obsolete configurations for software I don't use anymore.

My main prompt was to make sure changes were incrementally committed so I could easily revert breaks, but this was purely mechanical and a great warmup.

You can compare the before and after on GitHub.

I'd generally recommend giving your dotfile management over to a model; just remember to move your API keys into a separate file first.

Color Schemes

Along the way, I was somewhat dissatisfied with the color schemes I had available on WezTerm (and browsing through all 1,000 options gets monotonous) so I asked Claude to whip up a light theme based on Cosmic Latte as the background color.

Then, obviously, the next step was a dark theme that complemented it. The moment of joy was Claude calling the dark variant Cosmic Espresso.

This was another excellent experience: color schemes can be personalized a lot and not having to learn the syntax/format for each different tool goes a long way. After Poet I haven't wanted to maintain my own, but I can easily see myself build themes around paintings and photographs.

Emacs

Preparation

My Emacs configuration is generally split into a default init.el and a local.el for laptop-specific configurations. I decided I wanted to be able to have models live modify these configurations, and broke them out separately from the dotfiles.

Accordingly I ended up forking a new GitHub repository and local folder: again, I wanted to make sure I could keep working even if my Emacs configuration broke in strange ways, I had a backup so I set up a new configuration folder at $XDG_CONFIG_HOME/.aiemacs. Instead of relying on my regular wrappers (e, en to launch Emacs client or standalone Emacs, respectively) I ended up using some custom wrappers to launch standalone and emacs client instances with the new config folder.

(I'm keeping AI-generated Emacs configurations private for now: I'm not entirely certain I'd catch it if something sensitive leaked into the configurations. So you'll find more gists and snippets in this section instead.)

My configurations have grown (cough) organically: I had different settings for Python, languages, and different modes spread out somewhat randomly. The first step was to have Claude reorganize everything and make sure I could version control it: that way I can get incremental updates without having to go all the way back to my original configurations (even though I still maintain that nuclear option).

Claude cleaned up all of my configurations, collected them into init.el; I'd also been introduced to straight.el because of vterm and leaned into it to get fully reproducible configurations. (It relies explicitly on the committed configuration and doesn't allow for package-install packages to be saved.)

To further reduce the amount of context required to make incremental changes and add features, I had Claude refactor all the small plugins baked into my existing configuration (eg. for desaturating a color scheme) and make them into separate version controlled pieces.

With all of that preparation in place I've been an order of magnitude more comfortable with Claude modifying and extending my emacs configurations.

Agent integration

I wanted to be able to trivially use agents from within Emacs: Agent Shell that relies on Zed's Agent Client Protocol is excellent, but I really wanted it to be trivial to give the agent context based on the my current files and position.

Agent Shell

Claude whipped up a small plugin that makes it trivial to launch into an agent session using C-c a c: I can enter some instructions and immediately get back to whatever I was working on. Grab the plugin here: ai.el.

You can see Claude sanity-check this file as an example:

This has been really helpful in making extremely targeted changes that I can apply regardless of how critical the code I happen to be working on, while maintaining full personal context.

A fun additional use case I have with this workflow is to drop into my Emacs config with C-c e i and ask Claude to add a new keyboard shortcut or customize behavior and it Just Works. (And, of course, I used Claude to add that shortcut to my configuration.)

X (`exec`)

Agents and UIs often feel incredibly heavyweight when I need to figure out the right flags for a quick command (it's almost always git, of course), but that also feels like the single most tractable problem for agents to solve for me with sophistication.

My desired workflow is that I can just type out a command in English, and get a model to convert it into something that just works that I can then run trivially. I've made several iterations of this tool with different backends, generally with direct API calls.

This time around I had Claude make something slightly different:

Rust-based, both for performance and to reduce risk with help from the compiler
Rely on a partial implementation of the Agent Client Protocol to launch into an agent, avoiding the need to manage API keys
Recognize if running in tmux: that way I can get away with things like asking it to "fix the previous command" and include tmux buffer context
Actually publish the generated command to my shell history too: that way I can press up (or Ctrl-R) and modify the generated command
Use tty inputs so that I can both pipe input into the binary and interact with the generated command live

In practice, agent startup times are fairly bad (all that JavaScript); enough that I had Claude profile and give me some estimates: this was a fascinating exercise in itself with the output captured here.

For now, I'm just having it default to codex for these commands, but there are obvious solutions to working around this (either maintaining long-running agent sessions, or eating the cost of maintaining API keys explicitly).

With the standard caveats to pay attention to your personal vibe limit and that I haven't vetted this code carefully, you can install it with:

cargo install --git https://github.com/kunalb/djn x

Day.html

I need to give some context on this one, but if you're impatient you can go and directly play with it here.

day

For the past several years I've been using Stalogy Editor notebooks to map out my day. Each page has a 24-hour grid on the side, and the notebooks are sized to last for either half a year or a full year; I'm on my third notebook at this point. The way I like to use them is to:

map out my fixed plans for the next day by looking up all calendars across devices
write out things I'd like to accomplish
block off time for these in between existing, fixed commitments
at the end of the day, review and annotate what I actually spent my time on

I lean on this heavily when I have a lot going on and need to be precise; the freedom of sketching on paper also works really well while I'm traveling and I can draw things based on my perceived time without having to constantly configure time zones.

Single-page HTML applications can be really powerful and simple: extremely self contained, and can even access files on disk with specific Chrome API's and permissions.

As an explicit design choice I prefer to build these with plain old JavaScript (both before and after the availability of coding agents), and Claude quickly built an equivalent of my workflow with just 2000 lines (HTML, CSS, and JavaScript included).

This experiment is one of the places where I couldn't get Claude to test what it was doing and it shows: there are still several annoying glitches with small touch areas throughout the UI, though the design itself matches my tastes well. At this point it's good enough for me to use, but not something I'd recommend to others. I've been through several iterations, and generally feel as if I'm approaching the limit at which point I'd be better off digging into the code and fixing it myself.

etcetera

There are several more (surprisingly alliterative) projects that I've been prototyping but won't get into here: Palimpsest, for doing code archaeology on files and commits rapidly so I can see how a specific function evolved; Planner, for something that's like FocalBoard but with a few more features I wanted, Polyptych, for generating Perfetto traces dynamically — I'm connecting it with DuckDB to have a lot of data sources just work. This is going to be an attempt at making production software with vibe-coding, and I need to devote a lot more time to it.

Looking ahead

It's hard to write about AI without addressing the flashing neon elephant in our collective, virtual drawing room: "Software Engineering is dead", "There will no longer be any jobs", "Coding is meaningless as a career", "100x engineers", and so on and so forth.

Belonging to the tribe of programmers has always meant signing up for a lifetime of learning; AI happens to be one more step, and one that gives us a lot more leverage. We can learn much more rapidly, work in unfamiliar contexts that much faster.

Things will change, perhaps rapidly; but not entirely overnight: see Dario's note on Machines of Loving Grace: the consequences of AI adoption are still limited by physics and human behavior, which can only move so fast. We'll have time to adapt and learn.^[2]

The better AI becomes, the easier it will be to adopt by definition. Surfing at the frontier can be fascinating — and can open opportunities for the entrepreneurial — but it isn't necessary.

There are several well-thought-through documents on using AI tools within organizations: Oxide's RFD on LLM usage, and Monarch's philosophy on AI use — particularly the emphasis on accountability. Isometric.nyc is an example of something that I don't think would exist without AI.

I'm optimistic about intentionally applying AI to magnify my effectiveness across different domains and becoming able to build and learn things I never would have been able to otherwise; I'm pessimistic about the consequences of the hype cycle around it, particularly those that lead to sudden catastrophic failures once complexity compounds in vibe-coded systems.

The tools still have a very long way to go for me: to start, I really don't want to have to give context to the models anymore and would love for it to be trivially fetched based on what I'm doing on the computer. Djn was supposed to help enable this; while I ended up deleting the first vibe-coded attempt, I plan to resurrect it soon.

The second thing I would like to see is getting much better at managing the complexity generated by applying agents; the Agent-in-IDE / UI interfaces do this, but I think there are more interfaces possible to orchestrate and engage with agents in a way that enhances trust and pushes my vibe limit forward.

Being able to understand and navigate someone else's vibe-coded output (preferably with the original dialogue) will become increasingly critical to manage complexity, and something that seems entirely under-served.

A new bottleneck I expect to see as it becomes much easier to build software is to distribute it. Relying on the browser — or phone OS — as a sandbox and making it trivial to share applications should unlock a lot of creativity and value.

At the consumer end, it becomes incredibly difficult to understand what's actually worth using: mobile stores already suffer from clones and poorly executed copies, I can only expect that problem to compound significantly.

My intuition suggests that open source software that was already built to be customized: Emacs, (Neo)Vim, SmallTalk, Moldable Development, the suckless suite should work really well with agents. Trivially updated, rebuilt and easily sanity checked, with lots of examples available.

A dream I have about the future of software is that the idea of applications disappears entirely; instead we can express our intent — just by talking to the machine — and get something perfectly suited to the hardware it runs on and the person actually using it.

Comments

Email me, Threads, Twitter or Hacker News.

I explicitly call it complexity instead of size because lines of code only occasionally maps to complexity. Large changes can sometimes be extremely simple, mechanical codemods: a direct search and replace; small changes can trigger catastrophic performance failures. ↩︎
The economic consequences are currently beyond my ability to predict: I'm fairly curious about how things change as inference becomes less subsidized, and at the same time technology and hardware improves to make it cheaper regardless. Working out the tokenomics — including physical resources — is something I'll leave to SemiAnalysis. ↩︎

Experiments in building bespoke tools with AI

Vibe Limits

Complexity-bending

Jagged edges

My Sweet Spot

High risk tolerance, controlled environments

A large surface area to tackle

Design principles

Mechanics

Projects

Djn

Emacs, ZSH & other configurations

dotfiles

Color Schemes

Emacs

Preparation

Agent integration

X (exec)

Day.html

etcetera

Looking ahead

Comments

X (`exec`)