Major New Version of the hana-cli: How It Was Built With the Help of AI (and a Lot of Human Review)
This week I released the newest hana-cli version, and I wanted to share what actually happened behind the scenes. Not just the feature list, but the real process since this was the first version with significant AI assistance. The small steps, the rough edges, and the lessons learned.
Short version: code generation helped the project add more features in two months than I'd been able to deliver in the last two years; but quality still came from discipline, architecture, and a ton of testing.
What Changed in the 4.0 Release
4.0 release was not a small maintenance version. It was a structural and architectural shift for the tool.
In this release, I pushed major updates across runtime, architecture, test coverage, docs, and developer workflow. Looking at the git history since January 2026, the repo saw 235 commits so far, with a huge concentration in February and March. This was not a patch release cycle. This was the kind of work that changes how a tool feels to use.
The version line moved to 4.x, which tells you something important: substantial internal changes happened. We migrated to Express 5, now that CAP supports it. The MCP server evolved from an experimental rough draft into something that actually feels almost production-ready, with stronger tool registration patterns, better resource handling, and practical guidance for setup and troubleshooting.
Under the hood, database connection handling had to be redesigned to support MCP and multi-environment execution better. CLI startup performance improved significantly through lazy loading and faster execution paths. This is an item that's bothered me personally for a while and that I've tackled in the past. We made import and export commands much stronger, adding richer import options, better schema handling, and more thoughtful validation behavior but still with a strict focus on keeping the processing client side. Internationalization was finally a focus (thanks to machine translation), with full translation bundles for additional locales like German, Spanish, French, Japanese, Korean, Portuguese, Simplified Chinese, Hindi, and Polish.
One of the bigger quality-of-life improvements was consistency work across our very large command surface. This is where AI coding assistants really shine. They helped normalize parameter naming, improved how the tool suggests fixes for unknown commands and options, and made sure 200+ commands follow the same conventions. The docs got rebuilt from scratch with a modern VitePress site and better technical guides. And the tests scaled hard. We went from roughly 40 percent coverage on critical paths to roughly 85 percent. That matters more than people think. And this was actually the first thing I did before I would start using AI. Without more tests I wouldn't have trusted AI to make some of the consistency changes. In the end, the unit and E2E tests will really help when introducing new features in the future, because after each small change you can run the entire battery of tests. If it's just me testing by hand I often can't test every edge case or every command variation. With a strong test suite, I can be more confident that the changes I make do not break existing behavior.
The Architecture Shift That Matters Most
For me, the most important architecture story in 4.x is this: we moved from "a big bag of commands" toward "a more intentional platform". Instead of four separate operating modes living in isolation, CLI mode, interactive mode, API server mode, and MCP mode now feel like parts of one coherent system. That might sound abstract, but it changed how I approach new features.
The MCP server is a good example. It went from experimental command passthrough to something that actually feels like a real integration layer. Instead of duplicating command definition, it dynamically reads from the CLI setup or from the help output. We built better registration patterns, better metadata handling, and practical troubleshooting guidance. When I want to add a new command now, I do not have to think about five different places to integrate it. The pattern is clear: dynamic discovery and reuse was the only way to continue to scale.
Connection and execution flows became more deliberate too. We handle profile-based and CDS-based cases much more consistently now, with stronger error messaging and fewer of those fragile edge cases that only show up in production. Performance was treated as a product feature, not an afterthought. Startup path optimization and lazy loading were not "nice to have". They changed day-to-day usability in a real way for common, frequent used commands.
How I Used GitHub Copilot in Small, Incremental Steps
I did not sit down and ask for one giant AI-generated rewrite. That sounded like a recipe for disaster to me.
Instead, I worked in small, intentional slices. Start with one behavior. Write a few targeted tests. Implement a narrow change. Re-run all the tests. Refine. Repeat. This rhythm sounds obvious until you are staring at the opportunity to ask Copilot to "just generate the whole feature", which is precisely when discipline matters most.
A key part of the workflow was starting with hand-coded examples first. I wrote the initial 7 to 8 unit tests manually to establish the baseline style and what really mattered. Those tests acted like calibration points. They showed the agent what good tests look like in this codebase. After that, I used the coding agent to extrapolate safely. Generate related scenarios. Extend edge-case coverage. Replicate patterns across similar commands. Fill out the repetitive but necessary test variations. That gave us speed without surrendering control.
The other way that I used AI was for parallel work. This project has long been a weekend/late night/passion project for me. This isn't part of my day job. But free time is precious and limited. I've had a long list of features and wishes for this project that just never seemed to get shorter. With AI coding agents I could kick off tasks while I'm working during the day. I could go to a meeting for TechEd 2026 and in my other window, an AI agent would be running tests or renaming parameters. At the end of the day I could review the results, make adjustments, and keep the momentum going. It turned AI from a "weekend experiment" into a "day-to-day teammate". Although a teammate that sometimes felt like it had amnesia. Sometimes it would repeat a mistake it made a few days prior. Or it would completely forget a critical architectural pattern that I had painstakingly documented in the repo instructions. That is where the next section comes in.
Why the .github Folder Was Critical
If I had to point to one thing that made AI-assisted work sustainable in this repo, it is the .github customization layer. This is where the project-specific memory and operating model live. No longer did have a team of minions with short term memory loss, but a team of agents with a shared playbook and role-specific instructions. I finally started seeing real consistency across multiple requests.
Here is how each layer of that system actually works.
copilot-instructions.md — the always-on baseline
This file is loaded into every single agent session, regardless of what you ask or which files are open. It is the unconditional context layer. In this project it does a few things: it redirects the agent to the project-overview.instructions.md file for deeper architecture context, it states the non-negotiables (Node.js ESM module format, i18n text bundles for every user-facing string, VitePress docs structure), and it sets the behavioral guardrails (minimal changes, no new dependencies without human approval, do not touch docs or workflows unless explicitly asked). Think of it as the onboarding document that every new teammate has to read before their first commit. The agent reads it before every conversation.
.github/instructions/*.instructions.md — file-scope rules with automatic injection
This repo has 25 instruction files, each targeted at a specific area: cli-command-development.instructions.md, testing.instructions.md, route-development.instructions.md, i18n-translation-management.instructions.md, mcp-server-development.instructions.md, vitepress-config-management.instructions.md, and so on. Each file has an applyTo front-matter field that is a glob pattern, for example bin/*.js or tests/**/*.Test.js or docs/.vitepress/**/*.{ts,js,json}. When the agent touches a file that matches a glob, the corresponding instruction file is automatically pulled into context without you having to ask. So when the agent is editing a command file in bin/, it automatically picks up the rules about yargs structure, required JSDoc, i18n bundle usage, error handling shape, and database connection patterns. When it is editing a test file, it automatically picks up the Mocha conventions, assertion style, and the utilities it should reuse. This is the mechanism that prevented AI "drift". The rules followed the files, not the conversation.
.github/agents/*.agent.md — specialized agent personas
Each agent file defines a named role that can be invoked explicitly. The YAML declares a name, a description, and a list of allowed tools. The body defines the agent's job, its constraints, its approach steps, and its expected output format. The CLI Agent (cli.agent.md) is scoped to bin/, routes/, utils/, and related tests and is explicitly constrained from touching documentation unless asked. The DocOps Agent (docops.agent.md) knows the docs taxonomy and sidebar structure. The MCP Agent (mcp.agent.md) enforces JSON-RPC compliance and tool registration patterns. The Tooling Agent (tooling.agent.md) handles automation scripts, npm scripts, and GitHub workflows with an eye on cross-platform compatibility. The Version Maintenance Agent orchestrates the full release choreography: version bumps across multiple package.json files, SAPUI5 version updates, changelog management, and documentation synchronization. The E2E Test Agent specializes in multi-step CLI testing scenarios. What this means in practice is that the agent does not have to figure out what role to play from scratch each time. It is handed a specific operating mode with explicit scope and constraints baked in.
.github/skills/*/SKILL.md — bundled, reusable workflows
Skills are a level above agents. Where an agent defines a persona, a skill defines a procedure. Each skill folder contains a SKILL.md that lists the steps to follow for a complete workflow, the instruction files that apply, and the references the agent should load. This repo has three skills: cli-command-development, docs-automation, and mcp-server-workflows. The cli-command-development skill, for example, tells the agent to load the CLI command instructions, the route instructions if a route file is involved, the i18n instructions, the testing instructions, and the test-utilities-reuse instructions, then work through them in sequence. Skills eliminate the overhead of re-explaining which instruction files are relevant and in what order to apply them. They encode accumulated workflow knowledge so the agent does not have to reconstruct it on the fly.
.github/prompts/*.prompt.md — ready-made entry points for common tasks
Prompt files are parameterized task templates. Each one has a name, a description, an argument-hint that tells the user what input to provide, and a body that pre-fills the task framing the agent would otherwise have to reconstruct. For example, update-markdown-docs.prompt.md instructs the agent to read the existing doc style, follow the command-documentation rules when working in docs/02-commands/, keep changes minimal, and then report both the edits and a consistency cross-check. Other prompts cover updating GitHub workflows, VitePress config, MCP server code, tsconfig files, and package.json scripts. Using a prompt file instead of a freeform question means the agent starts from a well-defined task structure rather than having to interpret a vague request and risk misunderstanding the scope.
.github/hooks/ and quality-gates.json — deterministic guardrails
This is the most technical layer. The hooks are three Node.js scripts — pre-tool-use.js, post-tool-use.js, and stop.js — wired up via quality-gates.json as lifecycle interceptors. The quality-gates.json file maps hook names to commands: PreToolUse runs pre-tool-use.js before every tool call, PostToolUse runs post-tool-use.js after every tool call that modifies files, and Stop runs stop.js when the agent finishes a session. The pre-hook inspects the tool call payload and blocks a list of risky commands outright: npm install, npm publish, git push, git reset --hard, rm -rf, and similar destructive or irreversible operations.
If the agent tries to run any of those, the hook intercepts it and returns a blocking response before the command ever executes. The post-hook checks what files changed in the working tree after each tool use, runs targeted lint checks if source files changed, and surfaces any new errors to the agent so it can self-correct before moving to the next step. The stop hook runs a final summary of what changed in the session. This made a real difference: instead of discovering that the agent had introduced a linting error three steps after the fact, the post-hook caught it immediately and the agent fixed it in place. It turned error correction from a cleanup pass at the end into a continuous loop within the session.
This structure sounds elaborate until you use it. What it does is turn Copilot from "autocomplete with opinions" into "a teammate with repo-specific understanding".
The agent stayed aligned with project patterns, especially around ESM syntax, i18n text handling, and command conventions. I could steer style and scope per task type instead of hoping the AI would guess correctly. Generated code drifted away less from project norms. I got safer behavior around risky commands through hooks and guardrails. After you read a few horror stories online about AI wiping out a repo or an entire drive, you see why this is so important. And we preserved architectural context across long sessions and many edits so the tool never forgot why things were designed a certain way.
Each agent came with matching instruction files that spelled out the specific rules, patterns, and anti-patterns for that domain. When I needed to add a new command, I asked the CLI Agent, and it immediately understood command structure, i18n requirements, test patterns, and where to hook into the command metadata. When I wanted to regenerate all command docs at once, I asked the DocOps Agent, and it knew the taxonomy categories, sidebar structure, and the exact generation scripts to use.
Fast Code Generation Is Real, but So Is Review Work
Code generation is fast. Review is not optional.
Even with strong prompts and repo instructions, I still spent many full days validating behavior changes across a 200-plus command surface. Checking cross-command consistency, particularly parameter naming, schema alignment, and error handling patterns. Running tests and fixing failures, especially the tricky cross-platform compatibility issues between Windows, Linux, and macOS. Reviewing docs and examples for completeness and accuracy. Manually verifying edge cases that looked correct but were not robust enough. Testing UI components and WebSocket behavior in the browser. Validating MCP tool registrations and JSON-RPC message flow.
This is the part some people skip when they tell the AI productivity story as pure Vibe coding. The output quality you that we hopefully have in hana-cli came from that review loop, not from generation alone.
The Testing Story
One of the biggest wins this cycle was test coverage. We went from something like 40 percent coverage on critical paths to around 85 percent. Did code generation write all those tests? No. But it made it fast enough that writing comprehensive tests became practical instead of aspirational.
The pattern that worked was to hand-write 7 to 8 seed tests that show the ideal test structure and what you actually care about validating. Then ask the agent to extend those, creating related scenarios, error cases, and variations. Verify every generated test by running it locally and checking it actually catches bugs. Keep the tests you trust, delete the rest. By the end, we had better coverage and faster iteration. Both mattered.
What Took the Most Time
If you ask me where the real work happened, it was not generation. Creating the code went from taking most of the time to probably only about 5% of the time. Architecture decisions about how commands should register with the MCP server. Plugin and hook design. Connection flow refactoring. It was writing the tests first, before or alongside the feature. It was validating that the generated code actually works by running it, breaking it on purpose, fixing it. It was keeping docs in sync every time we changed command behavior. It was cross-platform testing to ensure Windows and Linux do not have surprises. And it was reviewing for consistency, making sure 200-plus commands all follow the same conventions.
That is all human work. Generation helped with the repetitive parts. Discipline came from the human part of the project.
My Honest Take After This Release
Copilot did not replace engineering here. It amplified it.
The wins came from combining clear architecture that you can reference and extend. Repo-specific agent customization so the tool understands your project, not just general code. Small incremental development instead of big risky changes. Human-authored seed tests so the AI has good examples to learn from. And relentless validation through testing and review of every piece.
If you are building serious tooling, that balance works. Skip the validation half and you are gambling.
For this release, I am proud of both the speed and the discipline. And yes, I am also proud that we made it feel better for users while modernizing a lot under the hood.
Special thanks to everyone who uses hana-cli and reports issues. That feedback drives where we invest next.