COMMON IMPLEMENTATION PITFALL AND SOLUTION

It seems the implementation of this interaction in an AI app tends to fall into the suboptimal trap: the agent hard-codes its static knowledge of the available things on the MCP server (tools, resources, prompts); that is, the developer (human or AI coding agent) hard-codes this knowledge into the agent at design time. This is distinct from the optimization practice of caching this knowledge on the agent at runtime for performance enhancement.

I believe generally, this information should be dynamically discovered at runtime (i.e. upon start up either by the agent or the MCP server, or upon the first agent request to the server). Then, it could be cached on the agent for performance enhancement (i.e. subsequent calls to the server can be readily constructed and executed), and periodically refreshed to minimize staleness.

PROMPT TO STEER THE CODING AGENT TO THE RIGHT IMPLEMENTATION

Falling into this hard-coding trap is exactly what initially happened in my project: that’s how the AI coding agent implemented the agent <-> MCP server interaction in its first attempt. Discovered this mistake during code review, I steered my AI teammate in the right direction with the following architectural guidance prompt, which immediately achieved the desired result.

Without domain knowledge about this app, and with sufficient technical background, if you follow this prompt closely, it should be clearly understandable. It articulates the flow of communication between the actors being implemented (agent, LLM, MCP) in the good, non hard-coding practice. Hopefully it demonstrates some attributes of good prompt engineering.

(Note:

  • The eventual implementation in the app is more sophisticated than what this prompt suggests, due to added features including MCP server capabilities caching and refresh on the agent, safety guardrail handling,… but this prompt captures the core communication logic between the actors involved, including dynamic MCP capabilities discovery.
  • This project is 100% vibe-coded from scratch. For some technical behind-the-scene highlights, see here).

The communication between the actors, as specified in the architectural guidance prompt to the AI coding agent, is illustrated in the following flowchart (Note: this chart is not part of the prompt; the prompt itself is just text, as seen next; this visual representation is added here now just to aid your (human) understanding):

And here’s the prompt itself (literally, no editing):

ME (Code review comment to the AI coding agent):

I have a couple of comments reviewing your changes to agent.py.

Your changes don’t make the MCP capabilities discovery fully dynamic. Ex: in process_question(), in the INSTRUCTIONS section, you still hard-code those for the tools, assuming this agent is statically aware of those tools in advance. The same flaw exists in extract_tool_calls().

I may or may not be right (correct me to the extent that I’m wrong and suggest alternatives) but I’d imagine the flow (for our currently scoped app) should be as follows (the current implementation follows some of these points, but deviates in some important ways):

  1. The user poses a question to the agent. The agent isn’t aware of any MCP capabilities, so needs to discover them fully in detail and forwards this info, the user question, and appropriate instructions to the LLM.
  2. The user question turns out generic and doesn’t rely on the MCP server; in other words, the LLM could handle it on its own.
  3. Or it could be mapped (neatly) to an available MCP tool. In which case, the LLM should build this tool call and return it to the agent. For this to happen, the instructions the agent sends to the LLM should contain the full signature of the tool (name, params, data types,…) so that the LLM could extract relevant inputs from the user question and build the precise tool call. That’s why in step 1, the agent should discover the MCP capabilities “fully in detail”, which the current implementation does not, because it doesn’t appear to discover the full signature of tools, and thus incapable of providing this info in the system prompt to the LLM.
  4. The user question could be mapped (neatly) to an available MCP resource. In which case, the LLM should return this constructed resource call.
  5. The LLM could “sense” that the user intends to explore some aspect covered by an available prompt (may be the question kind of heads toward that direction but not quite there yet (missing some necessary info,…), so the app can’t answer clearly,…). It returns to the agent the exact call to that prompt on the server.
  6. Whether a tool / resource / prompt call, the LLM, via its intelligence plus detailed instructions containing full server capabilities spec provided by the agent and the user question, will build it and return to the agent. The agent will just execute this call against the MCP server, provide the results to the LLM again to formulate the final answer for the user.
  7. Note: for the prompts, the MCP server just returns the templates to the agent; for the resources and tools, db queries typically need to be executed.
  8. Given the richness of our db, users could ask highly sophisticated and comprehensive questions. There will potentially be many corresponding prompt templates to guide users, and respective supporting tools (and resources). I mean there will be (tight) connection between prompts and tools. This doesn’t yet appear to be the case in this 1st round of implementation, which is fine. Keep this in mind when we evolve the app.
  9. If user question is independent of the MCP server capabilities and only dependent on the LLM’s native capability, the agent should return the 1st LLM response immediately as the final answer.

I think this flow would leverage more the intelligence of the LLM, reduce the burden on the agent, decouple more the agent from the server, and make the server capabilities discovery fully dynamic. Thoughts?

SPEC / CONTEXT / PROMPT ENGINEERING: STRATEGIC / OPERATIONAL / TACTICAL LEVEL

(This is a different point from the main theme of this post. I might elaborate this point further in a future post).

To more effectively execute a vibe coding (or more broadly, heavily AI-assisted software development) project, I kind of see “working with AI” at 3 levels of engineering (of which “prompting” is an activity): spec engineering, context engineering, and prompt engineering.

Metaphorically, if viewing the execution of such a non-trivial project as “conducting a war”, to explain in military terms, these engineering levels could respectively be mapped roughly to strategic, operational, and tactical levels. The prompt in the previous section would fall under the tactical level; of course, as in military, the leveling is not always clear cut.

I might elaborate my evolving conception of these 3 levels of engineering as applied to the execution of an AI project in a future post. For now, just know that:

Strategy without Tactics is the slowest route to victory.

Tactics without Strategy is noise before defeat.

Sun Tzu

Because:

Vision, Strategy, and Diplomacy win Wars.

Planning, Coordination, and Logistics win Campaigns.

Tactics win Battles.