Getting agents to code less slop
Coding agents have completely reshaped the way I work. I don’t think this is going to come as a surprise to anyone who’s been using these tools in the last 3-6 months. I now spend most of my engineering time building plans and reviewing agent code.
I don’t know that I’ll ever stop wanting to review agent code. I understand that probably makes me much slower compared to people who accept it as-is and ship to prod, but at the very least it seems like I’m not the only one1.
My rationale is twofold:
- I echo the sentiments of those who talk about losing touch with code and its architecture. Despite my efforts to combat that, I still find myself asking “how did that thing work again?” Maybe it’s because I’m now covering a lot more breadth over a shorter period of time. Maybe I’m getting older. Maybe it’s both. I don’t remember finding myself in this position as often as I do now.
- If I’m going to be on the hook for the code I ship, then I want to know what’s in it and I want to build it with posterity in mind. I’ve spent quite a lot of time learning what actually works in production: why would I throw that all away? If I can use that experience to shape outcomes, why wouldn’t I?
When I review agent code, I look for architecture or implementation problems. Architecture problems commonly arise from non-exhaustive planning where there was some gap in my understanding of the problem space that wasn’t filled by either my own research or the agent’s. Implementation problems commonly arise out of wrong architecture choices: solutions that don’t fit the framework, the language or even the structure of the existing codebase. These problems are not unique to agents.
Agents have introduced a new kind of implementation problem, one that arises out of the agent’s inherent stochasticity. Even when you get the architecture right, its taste in code structure and modularity is more informed by which branch of the decoding loop it took than some objective measure.
All of these scenarios produce slop. This post is about this new kind.
Partway through one of my review sessions, I asked myself a simple question: can the intuition I have about what makes code “clean” be automated? If it can be automated, can an agent use it?2
Static analysis has been in the engineering toolbelt for a very long time3 and tools4 that analyze programs are abundantly available and regularly used. One class of static analysis I haven’t seen used broadly5 is the kind of tool that tells you your code has too many conditionals, or that your functions are too big, or that you have an object with methods that don’t share any state. Essentially, the kind of tool that tells you how sloppy your code is.
I attribute this to two things:
- Building consensus is hard6. Even more so when it’s about something subjective like taste.
- There has, till recently, not been a generic tool that can automatically transform a body of code in a way that minimizes this kind of objective.
I7 wrote mdlr to give agents that objective, because
- I now have half the time to review 2-3x the code
- I need to be able to jump into any part of the diff or codebase and quickly get up to speed
- I want one tool that I can use across multiple programming languages
- Cleaning up dirty code is not my preferred method for staying sharp8
mdlr scans a codebase and outputs a list of metrics and their associated symbols sorted in descending severity. Here’s an example output.
$ mdlr check --pretty
metric symbol value bucket
function_size replay_endpoint_error::main 141 critical
cognitive replay_endpoint_error::main 26 critical
cyclomatic replay_endpoint_error::main 18 critical
Getting an agent to use this is very simple: ask it to run mdlr prompt and follow the instructions.
$ mdlr prompt
# Auto-Improve
Use mdlr to identify and improve modularity issues in the codebase.
## mdlr Reference
### Quick Start
# Analyze codebase (diff mode on branches, all files on main/master)
mdlr check
# Force all files even when on a branch
mdlr check -A
# Analyze specific directory or file
mdlr check src/metrics
mdlr check src/main.rs
...
Here’s an example that Claude was able to improve entirely on its own with this tool. This is part of a script I had it build to help me debug query timeouts in ClickHouse.
def main():
# ...argparse setup for --set, --delete, --gap, --explain, --output...
args = parser.parse_args()
if not args.file.exists():
print(f"Error: {args.file} not found", file=sys.stderr)
sys.exit(1)
error = json.loads(args.file.read_text())
endpoint_name = error.get("endpoint_name")
if not endpoint_name:
print("Error: no endpoint_name found in error file", file=sys.stderr)
sys.exit(1)
# Extract request from request_json (toJSONString output) or request
request = {}
if "request_json" in error:
request = json.loads(error["request_json"])
elif "request" in error:
request = error["request"] if isinstance(error["request"], dict) else {}
# Apply gap adjustment first
if args.gap:
apply_gap(request, parse_duration(args.gap), endpoint_name)
# Apply overrides
for kv in args.set:
key, _, value = kv.partition("=")
if not _:
print(
f"Error: invalid --set format '{kv}', expected key=value",
file=sys.stderr,
)
sys.exit(1)
request[key] = value
# Apply deletions
for key in args.delete:
request.pop(key, None)
print(f"Endpoint: {endpoint_name}", file=sys.stderr)
print(f"Request: {json.dumps(request, indent=2)}", file=sys.stderr)
# ...load host + token from token file...
host, token = parse_token(token_path)
headers = {"Authorization": f"Bearer {token}"}
# Explain mode: hit /explain for each cte
if args.explain:
ctes = discover_ctes(endpoint_name)
if not ctes:
# ...error + exit if no .endpoint file found...
sys.exit(1)
explain_dir = args.file.parent / f"{args.file.stem}_explain"
explain_dir.mkdir(parents=True, exist_ok=True)
print(f"Explaining {len(ctes)} ctes...", file=sys.stderr)
for cte in ctes:
resp = httpx.post(
f"{host}/v0/endpoints/explain",
json=request,
headers=headers,
timeout=60,
)
out_path = explain_dir / f"{cte}.json"
try:
result = resp.json()
out_path.write_text(json.dumps(result, indent=2, default=str))
except json.JSONDecodeError:
out_path.write_text(resp.text)
status = "ok" if resp.is_success else str(resp.status_code)
print(f" {cte}: {status}", file=sys.stderr)
print(f"Explain output written to {explain_dir}/", file=sys.stderr)
# Replay: POST to the endpoint
response = httpx.post(
f"{host}/v0/endpoints/{endpoint_name}",
json=request,
headers=headers,
timeout=60,
)
print(f"Status: {response.status_code}", file=sys.stderr)
try:
result = response.json()
output = json.dumps(result, indent=2, default=str)
except json.JSONDecodeError:
output = response.text
if args.output:
args.output.write_text(output)
print(f"Written to {args.output}", file=sys.stderr)
else:
print(output)
I have two arguments for why this is bad:
- You can’t quickly skim this to get the function’s story. If I have to spend more than 5 seconds making heads or tails of what something does, it’s usually too big or poorly structured.9
- If you want to change this, you have to keep a lot of different requirements in your head to make sure nothing breaks when adding a new thing. I acknowledge that sometimes that’s a feature, not a bug.
Here’s the whole thing afterward:
def main():
# ...same argparse setup...
args = parser.parse_args()
if not args.file.exists():
print(f"Error: {args.file} not found", file=sys.stderr)
sys.exit(1)
endpoint_name, request = load_error_request(args.file)
if args.gap:
apply_gap(request, parse_duration(args.gap), endpoint_name)
apply_overrides(request, args.set, args.delete)
print(f"Endpoint: {endpoint_name}", file=sys.stderr)
print(f"Request: {json.dumps(request, indent=2)}", file=sys.stderr)
# ...load host + token from token file...
host, token = parse_token(token_path)
headers = {"Authorization": f"Bearer {token}"}
if args.explain:
run_explain(host, headers, endpoint_name, request, args.file)
run_replay(host, headers, endpoint_name, request, args.output)
Here’s another example that surprised me10, this time from mdlr’s own codebase. These test cases all repeated the same test setup code, which was surfaced via the code duplication metric.
Before:
#[test]
fn test_load_from_current_dir() {
let temp = TempDir::new().unwrap();
let config_dir = temp.path().join(".mdlr");
fs::create_dir(&config_dir).unwrap();
fs::write(
config_dir.join("config.yaml"),
r#"
thresholds:
dag_density:
excellent: 0.3
good: 0.8
fair: 1.2
poor: 1.8
"#,
)
.unwrap();
let config = load_from_dir(temp.path()).unwrap();
assert_eq!(config.thresholds.dag_density.excellent, 0.3);
// Defaults still work for unspecified fields
assert_eq!(config.thresholds.fan_in_max.excellent, 3.0);
}
#[test]
fn test_load_disabled_metrics() {
let temp = TempDir::new().unwrap();
let config_dir = temp.path().join(".mdlr");
fs::create_dir(&config_dir).unwrap();
fs::write(
config_dir.join("config.yaml"),
r#"
disabled_metrics:
- lcom
- duplication_pct
"#,
)
.unwrap();
let config = load_from_dir(temp.path()).unwrap();
assert!(config.is_disabled("lcom"));
assert!(config.is_disabled("duplication_pct"));
assert!(!config.is_disabled("cyclomatic"));
}
// ...and a third test with the same setup block...
After:
/// Write `.mdlr/config.yaml` under `root` with the given contents.
fn write_config(root: &Path, yaml: &str) {
let config_dir = root.join(".mdlr");
fs::create_dir_all(&config_dir).unwrap();
fs::write(config_dir.join("config.yaml"), yaml).unwrap();
}
#[test]
fn test_load_from_current_dir() {
let temp = TempDir::new().unwrap();
write_config(
temp.path(),
r#"
thresholds:
dag_density:
excellent: 0.3
good: 0.8
fair: 1.2
poor: 1.8
"#,
);
let config = load_from_dir(temp.path()).unwrap();
assert_eq!(config.thresholds.dag_density.excellent, 0.3);
// Defaults still work for unspecified fields
assert_eq!(config.thresholds.fan_in_max.excellent, 3.0);
}
#[test]
fn test_load_disabled_metrics() {
let temp = TempDir::new().unwrap();
write_config(
temp.path(),
r#"
disabled_metrics:
- lcom
- duplication_pct
"#,
);
let config = load_from_dir(temp.path()).unwrap();
assert!(config.is_disabled("lcom"));
assert!(config.is_disabled("duplication_pct"));
assert!(!config.is_disabled("cyclomatic"));
}
mdlr’s goal isn’t to be an authority on what makes code good. I’m also not making the claim that optimizing software metrics will automatically give you perfectly clean code. The goals here are
- Provide a tool that enables agents to think a little more like we do.
- Provide a tool that enables agents to produce directionally better code.
- Get people thinking more about the boundaries between what agents should do versus what traditional engineering can do.
I’m not convinced11 that today’s agents are the final answer. The idea that we all just have to spin up as many agents as humanly possible to craft our world and review our work feels odd, especially when cheap deterministic tools can be so successful.
-
See Simon Willison in this HN thread, Vicki Boykis on being more tired than the model, Nolan Lawson on using AI to write better code, more slowly, and Lalit Maganti on building SyntaQLite with AI. ↩
-
There’s a deeper version of this question: dirty code is sometimes a symptom of poor architecture. Could an agent work backwards from that signal and realize the architecture is wrong? This was one of the original goals of this tool and will be the next area of exploration. ↩
-
Insert every tool like
go vet,eslint,clippy,flake8, etc. that is used daily, and some older ones too. ↩ -
I’m not saying these metrics and tools don’t exist, in fact many static analysis tools have these built in or available as community built plugins. There’s nothing truly novel or groundbreaking in this post. ↩
-
An extensive conversation about adding cognitive complexity to
clippy. ↩ -
Really, Claude deserves all the credit. This is yet another vibe-coded tool. I did read it though. ↩
-
I broadly agree with Minas Karamanis in The machines are fine. I’m worried about us., but I also agree with this HN comment. Sometimes the minutiae of engineering can be deeply inconvenient, especially when I have a thing I need to do! I think what makes us valuable as engineers is under active metamorphosis, and I’m excited to see where we land. ↩
-
I know someone is going to say: “Complex things are complicated! Sometimes you need a big function!”. Yes, I agree. This is more a rule of thumb than a law. I still think it’s only in very rare scenarios where a monolith is better suited to the occasion. ↩
-
This is a kind of change I’ve found myself making many times when I used to write unit tests by hand. The fact that the tool got the agent to make this change automatically was ironically surprising. ↩
-
While I don’t agree that AI agents are a mistake, I do agree with Hotz in The Eternal Sloptember that today’s coding agents don’t code the way we do. ↩