Expert knowledge of the Agent Operating System (AOS) architecture, components, and design patterns. Provides deep understanding of how AOS works as a complete operating system for AI agents.
Installation
Details
Usage
After installing, this skill will be available to your AI coding assistant.
Verify installation:
npx agent-skills-cli listSkill Instructions
name: aos-architecture description: Expert knowledge of the Agent Operating System (AOS) architecture, components, and design patterns. Provides deep understanding of how AOS works as a complete operating system for AI agents.
Agent Operating System Architecture
Description
Expert knowledge of the Agent Operating System (AOS) architecture, components, and design patterns. This skill provides deep understanding of how AOS works as a complete operating system for AI agents.
When to Use This Skill
- Understanding AOS architecture and design
- Working across multiple AOS components
- Making architectural decisions
- Integrating new components
- Understanding component interactions
- Debugging cross-component issues
Core Architectural Concept
Operating System Paradigm
AOS is an operating system for AI agents, not a framework or library. This means:
- Kernel: Core orchestration engine
- System Services: Storage, messaging, auth, etc.
- Applications: Business agents (CEO, CFO, etc.)
- System Calls: Python APIs to access services
- Processes: Perpetual agents (like daemons)
- IPC: Inter-agent messaging
Perpetual vs Task-Based (The Fundamental Difference)
Task-Based Frameworks (Traditional):
Create Agent → Execute Task → Terminate → State Lost
Perpetual Architecture (AOS):
Register Agent → Runs Forever → Event-Driven → State Persists
This is the defining characteristic of AOS architecture.
Architectural Layers
1. Application Layer
Business applications and agents that use AOS services.
Components:
- PurposeDrivenAgent (fundamental building block)
- Custom business agents (CEO, CFO, CTO)
- Business applications using AOS
Location: External applications, examples/
Key Pattern: Applications extend base classes and use system APIs
from AgentOperatingSystem.agents import PurposeDrivenAgent
class CEOAgent(PurposeDrivenAgent):
"""Business-specific agent using AOS."""
def __init__(self):
super().__init__(
agent_id="ceo",
purpose="Strategic oversight",
adapter_name="ceo"
)
2. System Service Layer
Services that agents use (like OS system calls).
Core Services:
- Authentication & Authorization (
auth/) - Storage Management (
storage/) - Message Bus (
messaging/) - Environment & Config (
environment/,config/) - State Persistence (via ContextMCPServer)
Advanced Services:
- ML Pipeline (
ml/) - MCP Integration (
mcp/) - Knowledge & Learning (
knowledge/,learning/) - Governance & Audit (
governance/)
Cross-Cutting Services:
- Reliability (
reliability/) - Observability (
observability/) - Extensibility (
extensibility/) - Platform Services (
platform/)
Location: src/AgentOperatingSystem/
Key Pattern: Services provide clean APIs for agents
3. Kernel Layer
Core orchestration and agent lifecycle management.
Components:
- Orchestration Engine (
orchestration/) - Agent Lifecycle Manager (
agents/) - State Machine Manager (
reliability/state_machine.py)
Location: src/AgentOperatingSystem/orchestration/, src/AgentOperatingSystem/agents/
Key Pattern: Event-driven, asynchronous operation
Key Components Deep Dive
Orchestration Engine
Purpose: Kernel that manages agent lifecycles and workflows.
Responsibilities:
- Agent registration and discovery
- Workflow state management
- Dependency resolution
- Resource scheduling
- Event routing
Key Files:
src/AgentOperatingSystem/orchestration/src/AgentOperatingSystem/agent_operating_system.py
Usage:
from AgentOperatingSystem import AgentOperatingSystem, AOSConfig
aos = AgentOperatingSystem(config)
await aos.initialize()
# Register agent
await aos.register_agent(agent)
# Process events
await aos.process_event(event)
Agent Lifecycle Manager
Purpose: Process management for agents (like init system in Linux).
Responsibilities:
- Agent provisioning and initialization
- Health monitoring and auto-recovery
- Capability tracking
- State preservation via ContextMCPServer
Key Files:
src/AgentOperatingSystem/agents/base_agent.pysrc/AgentOperatingSystem/agents/perpetual_agent.pysrc/AgentOperatingSystem/agents/purpose_driven_agent.py
Agent States:
- Created
- Initialized (ContextMCPServer created)
- Running (event loop active)
- Sleeping (idle, waiting for events)
- Terminated (deregistered)
Message Bus
Purpose: Inter-Process Communication (IPC) for agents.
Responsibilities:
- Topic-based routing
- Message delivery guarantees
- Conversation management
- Azure Service Bus integration
Key Files:
src/AgentOperatingSystem/messaging/src/AgentOperatingSystem/messaging/servicebus_manager.py
Patterns:
# Publish message
await message_bus.publish(topic="agent.events", message=data)
# Subscribe to topic
await message_bus.subscribe(topic="agent.events", handler=my_handler)
# Agent-to-agent communication
await send_agent_message(from_agent, to_agent, message)
Storage Service
Purpose: Unified file system abstraction.
Responsibilities:
- Azure Blob Storage (objects)
- Azure Table Storage (structured data)
- Azure Queue Storage (message queues)
- Cosmos DB (documents)
- Backend-agnostic interface
Key Files:
src/AgentOperatingSystem/storage/
Usage:
from AgentOperatingSystem.storage import AzureBlobStorage
storage = AzureBlobStorage(connection_string)
await storage.upload("container", "blob", data)
data = await storage.download("container", "blob")
MCP Integration
Purpose: Model Context Protocol for tool/resource access.
Responsibilities:
- MCP client/server implementation
- Tool discovery and execution
- Resource access management
- ContextMCPServer for state preservation
Key Files:
src/AgentOperatingSystem/mcp/src/AgentOperatingSystem/mcp/context_server.py
Pattern:
# ContextMCPServer for agent state
agent.context_server = ContextMCPServer(agent_id)
await agent.context_server.initialize()
await agent.context_server.save_state(state)
ML Pipeline
Purpose: Machine learning infrastructure.
Responsibilities:
- Azure ML integration
- LoRA adapter management
- Model versioning and deployment
- Inference with caching
- DPO (Direct Preference Optimization) training
Key Files:
src/AgentOperatingSystem/ml/examples/dpo_training_example.py
Governance Service
Purpose: Enterprise compliance and audit.
Responsibilities:
- Tamper-evident audit logging
- Policy enforcement
- Risk registry
- Decision rationale tracking
Key Files:
src/AgentOperatingSystem/governance/
Observability Service
Purpose: System monitoring and tracing.
Responsibilities:
- Metrics collection (counters, gauges, histograms)
- Distributed tracing (OpenTelemetry)
- Structured logging
- Alert management
Key Files:
src/AgentOperatingSystem/observability/src/AgentOperatingSystem/monitoring/
Usage:
from AgentOperatingSystem.observability import metrics, tracing
# Metrics
metrics.increment("agent.events.processed")
metrics.histogram("agent.processing.duration", duration)
# Tracing
with tracing.span("process_event"):
await process_event(event)
Data Flow
Event Processing Flow
1. Event arrives (HTTP, Service Bus, Timer)
↓
2. Azure Functions receives event
↓
3. AOS initialized/retrieved
↓
4. Event routed to appropriate agent(s)
↓
5. Agent(s) process event (async)
↓
6. State persisted (ContextMCPServer)
↓
7. Result returned/published
↓
8. Agent returns to sleep state
Agent Initialization Flow
1. Agent created (PurposeDrivenAgent)
↓
2. await agent.initialize()
├─ ContextMCPServer created
├─ State loaded from storage
└─ Resources allocated
↓
3. Agent registered with orchestrator
↓
4. Agent enters event loop (perpetual)
Inter-Agent Communication Flow
Agent A → Message Bus → Agent B
↓ ↓ ↓
Save State Route Load State
(Context) Message (Context)
↓ ↓ ↓
Response ← Message ← Process
↓ Bus ↓
Update Update
State State
Design Patterns
1. Perpetual Process Pattern
Agents are long-running processes, not transient tasks.
# Register once, runs forever (use concrete implementation)
agent = LeadershipAgent(...) # Or GenericPurposeDrivenAgent
await agent.initialize() # ContextMCPServer created
manager.register_agent(agent)
# Agent now perpetual, responds to events
2. Event-Driven Architecture
Agents wake on events, sleep when idle.
async def handle_event(self, event):
"""Wake up, process, go back to sleep."""
result = await self.process(event)
await self.save_state() # Persist via ContextMCPServer
return result
3. State Machine Pattern
Deterministic state transitions.
from AgentOperatingSystem.reliability import StateMachine
states = ["created", "initialized", "running", "sleeping"]
transitions = [
("created", "initialize", "initialized"),
("initialized", "start", "running"),
("running", "idle", "sleeping"),
]
sm = StateMachine(states, transitions)
4. Circuit Breaker Pattern
Fault tolerance and resilience.
from AgentOperatingSystem.reliability import CircuitBreaker
breaker = CircuitBreaker(
failure_threshold=5,
timeout=60,
expected_exception=ServiceError
)
async with breaker:
result = await external_service.call()
5. Repository Pattern
Backend-agnostic storage access.
# Same interface, different backends
storage = AzureBlobStorage(...) # or
storage = AzureTableStorage(...) # or
storage = CosmosDBStorage(...)
await storage.save(key, value)
Component Interactions
Agent → Storage
agent = LeadershipAgent(...) # Or GenericPurposeDrivenAgent
await agent.initialize() # ContextMCPServer created
await agent.save_state() # State → ContextMCPServer → Azure Storage
Agent → Message Bus → Agent
# Agent A
await message_bus.publish("task.created", task_data)
# Agent B (subscribed to "task.created")
async def handle_task(message):
await self.process_task(message)
Application → AOS → Azure
# Application layer
aos = AgentOperatingSystem(config)
await aos.initialize()
# AOS uses services
# Services use Azure SDK
# Azure SDK calls Azure cloud services
Configuration Architecture
Hierarchical Configuration
- Default Configuration: Built-in defaults
- File Configuration:
config/*.json - Environment Variables: Override files
- Runtime Configuration: Override all
Example:
config = AOSConfig(
storage_connection_string=os.getenv("AZURE_STORAGE_CONNECTION_STRING"),
servicebus_connection_string=os.getenv("AZURE_SERVICEBUS_CONNECTION_STRING"),
environment=os.getenv("APP_ENVIRONMENT", "production"),
)
Configuration Files
config/consolidated_config.json- Main configurationconfig/example_app_registry.json- App registry exampleconfig/self_learning_config.json- Self-learning configurationlocal.settings.json- Local development (not in git)
Deployment Architecture
Azure Functions Deployment
Internet → Azure Load Balancer
↓
Azure Functions (AOS)
↓
┌────────┴────────┐
↓ ↓
Service Bus Storage Account
↓ ↓
Agents ←────────→ State (ContextMCPServer)
Production Components
- Azure Functions: Serverless compute for AOS
- Azure Service Bus: Message queuing and routing
- Azure Storage: Blob, Table, Queue for persistence
- Azure Key Vault: Secrets management
- Application Insights: Monitoring and telemetry
- Cosmos DB: Optional document storage
File Organization
src/AgentOperatingSystem/
├── __init__.py # Package exports
├── agent_operating_system.py # Main AOS class
├── agents/ # Agent implementations
│ ├── base_agent.py
│ ├── perpetual_agent.py
│ └── purpose_driven_agent.py
├── orchestration/ # Orchestration engine
├── messaging/ # Message bus
├── storage/ # Storage services
├── auth/ # Authentication
├── mcp/ # MCP integration
│ └── context_server.py # ContextMCPServer
├── ml/ # ML pipeline
├── governance/ # Compliance
├── reliability/ # Fault tolerance
├── observability/ # Monitoring
├── knowledge/ # Knowledge base
├── learning/ # Self-learning
└── ...
Best Practices
- Understand the Paradigm: AOS is an OS, not a framework
- Think Perpetual: Agents run forever, not per-task
- Use ContextMCPServer: All state should persist via ContextMCPServer
- Event-Driven Design: Design for event-driven architecture
- Async All The Way: Use async/await throughout
- Leverage Services: Use existing services, don't reinvent
- Follow Patterns: Use established patterns (circuit breaker, state machine)
- Monitor Everything: Use observability services
- Secure by Design: Use auth service, never hardcode secrets
- Test Thoroughly: Test async, concurrent, and failure scenarios
Common Architectural Decisions
When to Create a New Service?
Create a new service when:
- Functionality is cross-cutting (used by multiple agents)
- Clear separation of concerns
- Reusable across applications
- Complex enough to warrant isolation
Add to existing service when:
- Closely related to existing functionality
- Small, focused feature
- No clear reason for separation
When to Use Message Bus vs Direct Calls?
Use Message Bus:
- Async, fire-and-forget communication
- Multiple subscribers
- Decoupled components
- Event broadcasting
Use Direct Calls:
- Synchronous responses needed
- Single recipient
- Tightly coupled components
- Simple request-response
When to Create a New Agent Type?
Create new agent type when:
- Distinct business purpose
- Different capabilities/tools
- Separate domain of responsibility
- Different lifecycle requirements
Extend existing agent when:
- Minor variation
- Same core purpose
- Shared capabilities
Related Skills
perpetual-agents- Working with agentsazure-functions- Deployment architectureasync-python-testing- Testing patternsmcp-integration- MCP and ContextMCPServer usage
Additional Resources
- docs/architecture/ARCHITECTURE.md - Detailed architecture documentation
- README.md - Core concepts overview
- docs/development/CONTRIBUTING.md - Development guidelines
- docs/ - Additional technical documentation
- examples/ - Architecture examples in practice
More by ASISaga
View allExpert knowledge for developing, deploying, and debugging Azure Functions in the Agent Operating System (AOS). Covers the serverless deployment model used by AOS for production workloads, including integration with Microsoft Foundry Agent Service (Azure AI Agents runtime).
Master Pylint usage for maintaining high code quality in the Agent Operating System (AOS) repository. Includes static code analysis, error detection, PEP 8 enforcement, and code quality metrics.
Expert knowledge for performing major version refactoring in the Agent Operating System, including removing backward compatibility code, consolidating duplicate implementations, and updating all references to use the latest patterns.
Bootstrap GitHub Copilot agent intelligence system in new repositories with complete setup
