Building AERO

At AETHER, we’ve experimented with building agentic pipelines for various usecases. During our internal Agents hackathon in September, we were tasked with implementing an agentic system for machine learning research. What began as a few simple agents amassed to a large-scale systems engineering effort, with a lot of blood, sweat and tears invested and an open source package for all to use. This is our story.

Open Table of contents

Building AERO

At first glance, AERO (Automated Exploration, Research & Orchestration) looks clean. It has modular workflows, a tidy README, and even a PyPI release. But getting it there in just four weeks was far from smooth. What follows is the real story of how we built it, why we set out to cover the entire research cycle, what broke along the way, and how we eventually packaged it into something open source ready.

When we started, the challenge was open-ended: build something useful with LLMs. Rather than spin up another chatbot, we decided to aim for the entire end-to-end research process. That meant starting from specific tasks and breaking them into properties like variable length or time invariance, mapping those to model architectures, and for broader topics, generating research plans that were concrete and time-bounded. From there, the loop had to continue into experiment design and runnable code, analyzing the results and planning follow-up experiments, and finally assembling everything into a report that could be slotted into different conference templates. In other words, we didn’t just want one shiny feature, we wanted the full loop of research, from framing to execution to reporting.

The first two weeks were powered almost entirely by vibe coding¹. We didn’t obsess over architecture diagrams or clean abstractions, we wrote just enough code to pass information between stages and prove the concept. Model Researcher was able to surface task properties and recommend model families, Research Planner generated three-month roadmaps with milestones, and Experiment Designer was already producing Python stubs that actually ran. The codebase was messy, but it moved ideas forward quickly, which mattered more than polish at that stage.

By weeks two and three, the costs of this speed started to show. Long chains would time out in frustrating ways, so we added retries and broke workflows into smaller hops. More painfully, API costs crept up. Over four weeks we spent roughly $300 on calls, and one especially bad day saw $25 disappear in just three runs after we accidentally sent a whole pipeline through a premium model. We fixed that by introducing cheaper defaults, model overrides, and caching, but it was a hard-earned lesson. Meanwhile, environment and dependency juggling ate far too much time, .env files refused to stay consistent, imports broke depending on which folder you ran from, and we had to wrestle with version pinning more than once. Still, the pieces were starting to come together. Experiment Designer moved beyond placeholders and grounded code in referenced methods, the Experimentalist could analyze results and propose realistic follow-ups, and Report Writer began producing structured drafts with references, complete with the option to target different venue templates.

The final week was the real turning point. Building the workflows was one thing, but turning AERO into something other people could use meant packaging. That required restructuring the whole repo into a proper Python package, cleaning up imports, killing circular dependencies, and writing documentation that didn’t assume you could DM us for help. We also pinned dependencies tightly, introduced sensible defaults so nobody accidentally repeated our $25 mistake, and added logging so failures didn’t silently tank entire runs. In short, we had to grow the project from hackathon prototype into open-source software.

Looking back, a few lessons stand out. Vibe coding was the only reason we got a working loop so quickly, but without packaging, it would have stayed a hackathon artifact. Costs are real, and budgeting API calls from day one would have saved us stress and money. Designing for short, checkpointed chains made failures recoverable rather than catastrophic. And most importantly, the framing side of research, problem decomposition and planning, has to feed directly into execution, just as execution has to flow back into refined plans. Without both halves, you end up with either abstract slides or disconnected code, neither of which solves the problem.

Today, AERO is installable with a single pip install aeroml and ready to use, but behind the tidy package is four weeks of trial, error, and caffeine. It was built by three interns, Jacob Wong, Ethan Lau, and Charmaine Chua, who learned the hard way that prototypes may win hackathons, but packaging is what makes them last. Special thanks also goes to Prannaya, who helped us take the leap from “it runs on our laptops” to a proper open-source package.

Vibe coding is an AI-driven software development method where developers use natural language prompts to direct large language models (LLMs) to generate code, allowing for rapid prototyping. ↩

Building AERO

Table of contents

Building AERO

Footnotes