Overview
Core Components:- ProcessManager: Handles spawning, communication, and lifecycle of stdio-based MCP servers
- RuntimeState: Maintains in-memory state of all processes with team-grouped tracking
- TeamIsolationService: Validates team-based access control for process operations
- Development: Direct spawn without isolation (cross-platform)
- Production: nsjail isolation with resource limits (Linux only)
Process Spawning
Spawning Modes
The system automatically selects the appropriate spawning mode based on environment: Direct Spawn (Development):- Standard Node.js
child_process.spawn()without isolation - Full environment variable inheritance
- No resource limits or namespace isolation
- Works on all platforms (macOS, Windows, Linux)
- Resource limits: 50MB RAM, 60s CPU time, and one process per started MCP server
- Namespace isolation: PID, mount, UTS, IPC
- Filesystem isolation: Read-only mounts for
/usr,/lib,/lib64,/binwith writable/tmp - Team-specific hostname:
mcp-{team_id} - Non-root user (99999:99999)
- Network access enabled
Mode Selection: The system uses
process.env.NODE_ENV === 'production' && process.platform === 'linux' to determine isolation mode. This ensures development works seamlessly on all platforms while production deployments get full security.Process Configuration
Processes are spawned using MCPServerConfig containing:installation_name: Unique identifier in format{server_slug}-{team_slug}-{installation_id}installation_id: Database UUID for the installationteam_id: Team owning the processcommand: Executable command (e.g.,npx,node)args: Command argumentsenv: Environment variables (credentials, configuration)
MCP Handshake Protocol
After spawning, processes must complete an MCP handshake before becoming operational: Two-Step Process:- Initialize Request: Sent to process via stdin
- Protocol version: 2025-11-05
- Client info: deploystack-satellite v1.0.0
- Capabilities: roots.listChanged=false, sampling=
- Initialized Notification: Sent after successful initialization response
- 30-second timeout (accounts for npx package downloads)
- Response must include
serverInfowith name and version - Process marked ‘failed’ and terminated if handshake fails
stdio Communication Protocol
Message Format
All communication uses newline-delimited JSON following JSON-RPC 2.0 specification: stdin (Satellite → Process):- Write JSON-RPC messages followed by
\n - Requests include
idfield for response matching - Notifications omit
idfield (no response expected)
- Buffer-based parsing accumulates chunks
- Split on newlines to extract complete messages
- Incomplete lines remain in buffer for next chunk
- Parse complete lines as JSON
- Requests (with
id): Expect response, tracked in active requests map - Notifications (no
id): Fire-and-forget, no response tracking - Responses: Match
idto active request, resolve or reject promise
Request/Response Handling
Active Request Tracking:- Map of request ID → {resolve, reject, timeout, startTime}
- Configurable timeout per request (default 30s)
- Automatic cleanup on response or timeout
- Validate process status (must be ‘starting’ or ‘running’)
- Register timeout handler
- Write JSON-RPC message to stdin
- Wait for response via stdout parsing
- Resolve/reject promise based on response
- Write errors: Immediate rejection
- Timeout errors: Clean up active request, reject with timeout message
- JSON-RPC errors: Extract
error.messagefrom response
Process Lifecycle
Idle Process Management: Processes that remain inactive for extended periods are automatically terminated and respawned on-demand to optimize memory usage. See Idle Process Management for details on automatic termination, dormant state tracking, and respawning.
Configuration Updates: When a user updates their MCP server configuration (args, env) via the dashboard, the backend sends a configure command to the satellite. For stdio servers, the satellite automatically restarts the process with the new configuration. See Backend Communication for the command flow.
Lifecycle States
starting:- Process spawned with handlers attached
- MCP handshake in progress
- Accepts handshake messages only
- Handshake completed successfully
- Ready for JSON-RPC requests
- Tools discovered and cached
- Graceful shutdown initiated
- Active requests cancelled
- Awaiting process exit
- Process exited
- Removed from tracking maps
- Spawn or handshake failure
- Not operational
Graceful Termination
Process termination follows a two-step graceful shutdown approach to ensure clean process exit and proper resource cleanup.Termination Steps
Step 1: SIGTERM (Graceful Shutdown)- Send SIGTERM signal to the process
- Process has 10 seconds (default timeout) to shut down gracefully
- Process can complete in-flight operations and cleanup resources
- Wait for process to exit voluntarily
- If process doesn’t exit within timeout period
- Send SIGKILL signal to force immediate termination
- Guaranteed process termination (cannot be caught or ignored)
- Used as last resort for unresponsive processes
Termination Types
The system handles four types of intentional terminations differently: 1. Manual Termination- Triggered by explicit restart or stop commands
- Status set to
'terminating'before sending signals - No auto-restart triggered
- Standard graceful shutdown with SIGTERM → SIGKILL
- Triggered by idle timeout (default: 180 seconds of inactivity)
- Process marked with
isDormantShutdownflag - Configuration stored in dormant map for fast respawn
- Tools remain cached for instant availability
- No auto-restart triggered (intentional shutdown)
- See Idle Process Management for details
- Triggered when server removed from configuration
- Process marked with
isUninstallShutdownflag - Complete cleanup: process, dormant config, tools, restart tracking
- No auto-restart triggered (intentional removal)
- Invoked via
removeServerCompletely()method
- Triggered when stdio server configuration is modified (e.g., user args change)
- Detected via
DynamicConfigManagercomparing old vs new configuration - Existing process terminated with graceful shutdown
- Tools cleared from cache via
stdioToolDiscoveryManager.clearServerTools() - New process spawned with updated configuration (new args, env)
- Tool discovery runs automatically on the new process
- Enables real-time configuration updates without satellite restart
HTTP/SSE Servers: Unlike stdio servers, HTTP/SSE servers don’t require restart on config changes. Their configuration (headers, query params, URL) is read fresh on each request, so updates are immediate.
Crash Detection vs Intentional Shutdown
The system distinguishes between crashes and intentional shutdowns: Crash Detection Logic:- SIGTERM exit code is 143 (non-zero)
- Without flags, graceful termination would trigger auto-restart
- Flags prevent unwanted restarts for intentional shutdowns
Cleanup Operations
During termination, the following cleanup operations occur:-
Active Request Cancellation
- All pending JSON-RPC requests are rejected
- Active requests map is cleared
- Clients receive termination error
-
State Cleanup
- Remove from processes map (by process ID)
- Remove from processIdsByName map (by installation name)
- Remove from team tracking sets
- Clear dormant config if exists (for uninstall)
-
Resource Tracking
- Restart attempts cleared (for uninstall)
- Respawn promises cleared
- Process metrics finalized
-
Event Emission
- Emit
processTerminatedinternal event - Emit
processExitwith exit code and signal - Emit
mcp.server.crashedif crash detected (Backend event)
- Emit
Complete Server Removal
TheremoveServerCompletely() method provides comprehensive cleanup for server uninstall:
Method Signature:
-
Check for active process
- If found: Set
isUninstallShutdownflag - Terminate with graceful shutdown
- Return
active: true
- If found: Set
-
Check for dormant config
- If found: Remove from dormant map
- Return
dormant: true
-
Clear restart tracking
- Delete restart attempts history
- Prevent any future restart attempts
Termination Timing
Normal Termination:- SIGTERM sent: ~1ms
- Process cleanup: 10-500ms (application-dependent)
- Total time: 11-501ms
- SIGTERM sent: ~1ms
- Timeout wait: 10,000ms
- SIGKILL sent: ~1ms
- Immediate kill: ~10ms
- Total time: ~10,012ms
- MCP servers should handle SIGTERM gracefully
- Complete in-flight requests within timeout
- Close file handles and network connections
- Exit with code 0 for clean shutdown
Auto-Restart System
Crash Detection
The system detects crashes based on exit conditions:- Non-zero exit code
- Process not in ‘terminating’ state
- Unexpected signal termination
Restart Policy
Limits:- Maximum 3 restart attempts in 5-minute window
- After limit exceeded: Process marked ‘permanently_failed’ in RuntimeState
- Process ran >60 seconds before crash: Immediate restart
- Quick crashes: Exponential backoff (1s → 5s → 15s)
- Detect crash with exit code and signal
- Check restart eligibility (3 attempts in 5 minutes)
- Apply backoff delay based on uptime
- Attempt restart via
spawnProcess() - Emit ‘processRestarted’ or ‘restartLimitExceeded’ event
RuntimeState Integration
RuntimeState maintains in-memory tracking of all MCP server processes: Tracking Methods:- By process ID (UUID)
- By installation name (for lookups)
- By team ID (for team-grouped operations)
- Extends ProcessInfo with:
installationId,installationName,teamId - Health status: unknown/healthy/unhealthy
- Last health check timestamp
- Permanently Failed Map: Separate storage for processes exceeding restart limits
- Team-Grouped Sets: Map of team_id → Set of process IDs for heartbeat reporting
- Get all processes (includes permanently failed for reporting)
- Get team processes (filter by team_id)
- Get running team processes (status=‘running’)
- Get process count by status
Process Monitoring
Metrics Tracked
Each process tracks operational metrics:- Message count: Total requests sent to process
- Error count: Communication failures
- Last activity: Timestamp of last message sent/received
- Uptime: Calculated from start time
- Active requests: Count of pending requests
Events Emitted
The ProcessManager emits events for monitoring and integration:processSpawned: New process started successfullyprocessRestarted: Process restarted after crashprocessTerminated: Process shut downprocessExit: Process exited (any reason)processError: Spawn or runtime errorserverNotification: Notification received from MCP serverrestartLimitExceeded: Max restart attempts reachedrestartFailed: Restart attempt failed
Logging
stderr Handling:- Logged at debug level (informational output, not errors)
- MCP servers often write logs to stderr
- Malformed JSON lines logged and skipped
- Does not crash the process or satellite
- All operations include:
installation_name,installation_id,team_id - Request tracking includes:
request_id,method,duration_ms - Error context includes: error messages, exit codes, signals
Event Emission
The ProcessManager emits real-time lifecycle events (started, crashed, restarted, permanently_failed) to the Backend for operational visibility and audit trails. ProcessManager internal events (processSpawned, processTerminated) are for satellite-internal coordination. Event System events (mcp.server.started, etc.) are sent to Backend for external visibility. See Event Emission - Process Lifecycle Events for complete event types, payloads, and batching configuration.Team Isolation
Installation Name Format
Installation names follow strict format for team isolation:filesystem-john-R36no6FGoMFEZO9nWJJLTcontext7-alice-S47mp8GHpNGFZP0oWKKMU
Team Access Validation
TeamIsolationService provides:extractTeamInfo(): Parse installation name into componentsvalidateTeamAccess(): Ensure request team matches process teamisValidInstallationName(): Validate name format
- RuntimeState groups processes by team_id
- nsjail uses team-specific hostname:
mcp-{team_id} - Heartbeat reports processes grouped by team
Performance Characteristics
Timing:- Spawn time: 1-3 seconds (includes handshake and tool discovery)
- Message latency: ~10-50ms for stdio communication
- Handshake timeout: 30 seconds
- Memory per process: Base ~10-20MB (application-dependent, limited to 50MB in production)
- Event-driven architecture: Handles multiple processes concurrently
- CPU overhead: Minimal (background event loop processing)
- No hard limit on process count (bounded by system resources)
- Team-grouped tracking enables efficient filtering
- Permanent failure tracking prevents infinite restart loops
Development & Testing
Local Development
Development Mode:- Uses direct spawn (no nsjail required)
- Works on macOS, Windows, Linux
- Full environment inheritance simplifies debugging
Testing Processes
Manual Testing Methods:getAllProcesses(): Inspect all active processesgetServerStatus(installationName): Get detailed process statusrestartServer(installationName): Test restart functionalityterminateProcess(processInfo): Test graceful shutdown
- Development: All platforms (macOS/Windows/Linux)
- Production: Linux only (nsjail requirement)
Security Considerations
Environment Injection:- Credentials passed securely via environment variables
- No credentials stored in process arguments or logs
- nsjail enforces hard limits: 50MB RAM, 60s CPU, one process
- Prevents resource exhaustion attacks
- Complete process isolation per team
- Separate PID, mount, UTS, IPC namespaces
- System directories mounted read-only
- Only
/tmpwritable - Prevents filesystem tampering
- Enabled by default (MCP servers need external connectivity)
- Can be disabled for higher security requirements
Status Events
Process lifecycle emits status events to backend for real-time monitoring: Status Event Emission:connecting- When process spawn startsonline- After successful handshake and tool discoverypermanently_failed- When process crashes 3 times in 5 minutes
Log Buffering
Process stderr output is buffered and batched before emission: Buffering Strategy:- Batch interval: 3 seconds after first log
- Max batch size: 20 logs (forces immediate flush)
- Grouping: By installation_id + team_id
- Inferred from message content (
errorif contains “error”, etc.) - Metadata includes process_id for debugging
Configuration Restart Flow
When configuration is updated (env vars, args, headers, query params):- Backend sets installation status to
restarting - Backend sends
configurecommand to satellite - Satellite receives command and stops old process
- Satellite clears tool cache for installation
- Satellite spawns new process with updated configuration
- Status progresses:
restarting→connecting→discovering_tools→online
Related Documentation
- Satellite Architecture Design - Overall system architecture
- Idle Process Management - Automatic termination and respawning of idle processes
- Tool Discovery Implementation - How tools are discovered from processes
- Event Emission - Process lifecycle events
- Log Capture - stderr log buffering
- Status Tracking - Process status management
- Backend Communication - Integration with Backend commands

