How AI Agents Run on Kubernetes
Summary
AI agents run on Kubernetes as ordinary workloads — Deployments, Services, HPAs — plus queues for long jobs and GPU nodes for self-hosted models. With finance examples.
Short answer: AI agents run on Kubernetes as ordinary workloads — usually a Deployment for the agent service plus Services for the MCP servers and models it calls. Kubernetes gives agents the same self-healing, scaling, and isolation as any service, with two extra concerns: bursty/long-running work (use queues and Jobs) and GPU scheduling for self-hosted models.
Part 17 and the finale of the series. Previous: Running MCP Servers on Kubernetes.
Introduction
Agentic systems sound exotic, but operationally an AI agent is software that calls tools and models in a loop. That makes Kubernetes an excellent home for them. This closing article ties the whole series together around AI infrastructure.
The problem
An agent that triages production incidents or reviews portfolios needs to run reliably, scale with demand, recover from crashes, reach its tools securely, and sometimes run heavy or long jobs. Running it as a script on one box gives you none of that. You need an orchestration platform — which you now understand.
Simple explanation
The agent is a service. It receives a request, reasons, and calls MCP servers and model endpoints, possibly over several steps. On Kubernetes you run it as a Deployment behind a Service, scale it with an HPA, and let the cluster restart it if it fails. Long or spiky work goes onto a queue and is processed by worker Pods or Jobs so a slow task never blocks the request path.
Official Kubernetes concept
- Deployment + Service: the agent API, stateless and replicated.
- HPA: scale agent replicas with load (often on queue depth or custom metrics).
- Job / CronJob: run finite or scheduled agent tasks (batch analysis, nightly reconciliation).
- Queue (SQS/Kafka) + worker Deployment: decouple long-running agent work from request handling.
- GPU node groups + resource requests: schedule self-hosted models onto GPU Nodes; managed model APIs need none of this.
How it works
The agent Pod calls MCP servers (other Pods) by Service name to use tools and resources, and calls a model — either a managed API (Bedrock, OpenAI) or a self-hosted model Pod on GPU Nodes. For interactive requests, keep the loop on the agent Deployment. For heavy work, the agent enqueues a task; worker Pods consume the queue and scale independently. Self-healing, secrets, and network isolation all come from the platform patterns covered earlier in this series.
Finance example
A portfolio-review agent on EKS: an agent Deployment (4 replicas) answers client questions by calling the portfolio MCP server and a managed LLM. Heavy "analyze the entire book" requests are pushed to an SQS queue; a separate worker Deployment scales on queue depth and runs the long analysis as Kubernetes Jobs, writing results to a database. Broker and model credentials come from Secrets Manager via IRSA. If volume spikes at market open, the HPA scales agents and workers; overnight, both scale down to control cost. A crashed agent Pod is replaced automatically.
C# example
The agent service is a normal ASP.NET app that offloads long work to a queue:
app.MapPost("/review/{accountId}", async (string accountId, IQueue queue) =>
{
// Fast path: enqueue heavy analysis, return immediately.
await queue.Enqueue(new PortfolioReviewJob(accountId));
return Results.Accepted();
});
public record PortfolioReviewJob(string AccountId);A worker (`BackgroundService`) running as its own Deployment consumes the queue and scales independently of the request path.
AWS example
On EKS: agent and worker images in ECR; model access via Bedrock (managed, no GPU Nodes) or a self-hosted model on a GPU node group scheduled with `nvidia.com/gpu` requests; queues via SQS; secrets via IRSA; observability via CloudWatch. Karpenter provisions GPU or high-memory Nodes only when agent workloads need them, then removes them — important for controlling AI infrastructure cost.
Architecture diagram
Production reality
Agentic workloads break the assumptions of normal request/response services, and the operational details matter:
- Agent loops are long and bursty, not uniform. A single request may fan out into many model and tool calls over minutes. Sizing an HPA on CPU alone misleads; scale on queue depth or in-flight requests, and set generous timeouts.
- Token and GPU cost is the dominant line item. Idle GPU Nodes are extremely expensive. Use Karpenter to provision GPU capacity only when self-hosted-model demand exists, and prefer managed model APIs when utilization is low.
- Runaway agents are a real risk. A looping agent can hammer tools and rack up spend or trip rate limits. Enforce per-agent rate limits, step/iteration caps, and circuit breakers — in the app, since Kubernetes will not stop a "healthy" busy Pod.
- Determinism and audit. In finance, you must be able to reconstruct what an agent did and why. Log every tool call and model decision with correlation IDs; treat it as a regulated audit trail.
- Security: scope each agent's MCP access with NetworkPolicies and least-privilege IRSA so an experimental agent cannot reach production payment or settlement systems.
Interview questions
- How does an AI agent run on Kubernetes? As a stateless Deployment behind a Service, scaled with an HPA, calling MCP servers and models over the network.
- How do you handle long-running agent tasks? Decouple them with a queue and process them in worker Pods or Jobs so they do not block request handling.
- How are self-hosted models scheduled? On GPU node groups using GPU resource requests; managed model APIs avoid GPU scheduling entirely.
- How do agents reach their tools securely? They call MCP server Services by name, with credentials from Secrets Manager via IRSA and NetworkPolicies for isolation.
- How do you control AI infrastructure cost on Kubernetes? Right-size requests, scale Pods and Nodes with HPA and Karpenter, and provision GPU Nodes only on demand.
Key takeaways
- AI agents are ordinary Kubernetes workloads — Deployment, Service, HPA, secrets.
- Use queues + worker Deployments/Jobs for long or bursty agent work.
- Schedule self-hosted models on GPU Nodes; managed APIs skip that.
- Everything in this series — self-healing, scaling, security, cost — applies directly to agentic AI.
Series wrap-up
You now have the full picture: from What Is Kubernetes? through deploying MCP servers and AI agents on EKS. For more on the AI side, see From AI Model Consumer to AI Application Builder and the MCP series. Previous: Running MCP Servers on Kubernetes.
Frequently asked questions
- How does an AI agent run on Kubernetes?
- As a stateless Deployment behind a Service, scaled with an HPA, calling MCP servers and models over the network just like any other service.
- How do you handle long-running agent tasks?
- Decouple them with a queue (such as SQS or Kafka) and process them in worker Pods or Kubernetes Jobs so they do not block request handling.
- How are self-hosted AI models scheduled on Kubernetes?
- On GPU node groups using GPU resource requests. Managed model APIs like Bedrock avoid GPU scheduling entirely.
Related reading
Running MCP Servers on Kubernetes
An MCP server is just an HTTP service — run it on Kubernetes as a Deployment and Service with probes, autoscaling, and Secrets Manager. The MCP-specific decisions explained.
From AI Model Consumer to AI Application Builder
A practical guide for .NET engineers moving from chat prompts to RAG, MCP servers, agents, and agentic workflows — with security patterns, architecture diagrams, and platform mental models.
Kubernetes Concepts Every Staff Engineer Should Understand
Beyond YAML: the reconciliation model, resource requests, failure design, security, scaling, and cost trade-offs that staff engineers are expected to reason about.