How we sped up deploys, updates, and undeploys in Magic Containers

Posted by:

Magic Containers became programmable with the introduction of the Public API. The next step was making it react in real time.

In practice, operations like deploys and updates could take tens of seconds to complete. Not because anything was failing, but because of how control loops work.

Each part of the system waits for the next reconciliation cycle, and those delays stack up across components.

As workloads become more short-lived and automated, lifecycle speed becomes an increasingly important part of the developer experience. Spinning up a container, scaling it, or tearing it down should feel immediate.

This is the problem we set out to solve.

The problem: control loops introduce latency

Magic Containers is built around a control loop architecture.

Each component continuously reconciles desired state → actual state:

  • Application Provisioner → selects regions for deployment
  • Controller Manager → ensures the desired number of replicas
  • Scheduler → assigns pods to nodes
  • Local Container Manager (LCM) → ensures containers run on the assigned node

Each of these components runs independently in a loop:

while True:
	observe_state()
  diff = desired_state - actual_state
  if diff:
	  reconcile(diff)
  sleep(interval)

This model is extremely reliable and forms the backbone of many distributed systems, including Kubernetes.

But it comes with a tradeoff.

Latency compounds across loops.

Each loop runs on an interval of ~5 to 10 seconds.

That means a single operation doesn’t execute immediately. Instead, it waits for the next loop iteration.

Now chain multiple components together:

User action
	→ waits for Provisioner loop (up to 10s)
	→ waits for Controller loop (up to 10s)
	→ waits for Scheduler loop (up to 10s)
  → waits for LCM loop (up to 10s)

In the worst case, this stacks up to tens of seconds before a container is fully running.

Even under typical conditions, deploys and updates felt noticeably delayed.

Nothing was technically incorrect. The system always converged to the correct state, but the experience lagged behind what modern workflows expect.

What we considered (and rejected)

We explored several approaches before settling on the final solution.

1. Decreasing loop intervals

Reducing intervals from ~10 seconds to sub-second.

Why we rejected it:

  • Significant increase in CPU usage across all control plane components
  • Higher pressure on state storage and coordination systems
  • Still fundamentally polling-based, meaning latency never reaches zero

2. Removing loops entirely

Moving to a purely event-driven system.

Why we rejected it:

  • Loops are critical as a safety mechanism
  • They continuously verify and correct drift between desired and actual state
  • Without them, missed events could lead to permanent inconsistencies

3. Hybrid model (chosen)

Keep loops for correctness and safety, but introduce events for immediacy.

The solution: event-driven acceleration

We introduced a message broker with event queues between components.

The key idea: Loops ensure consistency. Events provide speed.

Instead of waiting for the next loop iteration, components now react immediately when something changes.

Before: loop-driven propagation

[User Action]
↓
(wait for Provisioner loop)
↓
(wait for Controller loop)
↓
(wait for Scheduler loop)
↓
(wait for LCM loop)
↓
[Container Running]

After: event-accelerated flow

[User Action]
↓
[Event: Application Created] → Queue
↓
[Provisioner triggered immediately]
↓
[Event: Provisioning Complete]
↓
[Controller Manager triggered]
↓
[Event: Pods Created]
↓
[Scheduler triggered]
↓
[Event: Pod Scheduled]
↓
[LCM triggered]
↓
[Container Running]

How It Works

Each component now listens for specific events and reacts instantly.

Example event

{
    "type": "application.created",
    "app_id": "app_123",
    "regions": ["eu-central", "us-east"],
    "timestamp": 1713949200
}

Provisioner

func handleApplicationCreated(event Event) {
    regions := selectRegions(event)
    publish("provisioning.completed", regions)
}

Controller Manager

func handleProvisioningComplete(event Event) {
    createReplicas(event.app_id, desiredReplicas)
    publish("replicas.created", event.app_id)
}

Scheduler

func handleReplicasCreated(event Event) {
    node := selectNode(event)
    publish("pod.scheduled", node)
}

Local Container Manager (LCM)

func handlePodScheduled(event Event) {  
		startContainer(event.node, event.pod)
}

Important: loops still exist

The control loops were not removed.

They still run continuously to:

  • Detect drift (e.g., crashed containers or missing replicas)
  • Reconcile inconsistencies
  • Act as a fallback if events are delayed or lost
// simplified reconciliation loop
for {
    diff := computeDiff(desiredState, actualState)
    if diff != nil {
        reconcile(diff)
    }
    sleep(5 * time.Second)
}

This hybrid design gives us:

  • Fast reaction time (events)
  • Strong consistency guarantees (loops)

These events don’t just drive internal components, they also power real-time updates across the platform, including the Dashboard.

From control plane to user experience

Reducing backend latency is only part of the story.

Before this change, even when operations completed, the Dashboard still relied on polling to fetch updates. This introduced an additional delay between something happening in the system and the user actually seeing it.

In practice, this meant:

  • Deploy finishes → UI updates a few seconds later
  • Scaling event happens → user sees it after the next refresh cycle

To solve this, we extended the same event-driven model all the way to the frontend.

Real-time updates via WebSockets

We introduced WebSocket-based event streaming between the control plane and the Dashboard.

Instead of polling for state changes, the UI now subscribes to live updates:

Client → opens WebSocket connection
→ subscribes to application events

Whenever something changes:

[Control Plane Event]
↓
[Message Broker]
↓
[WebSocket Gateway]
↓
[Dashboard UI updates instantly]

What this changes

This removes the final layer of perceived latency.

Before

  • Backend finishes → UI polls → user sees update later

After

  • Backend finishes → event emitted → UI updates instantly

Result

  • Deploy progress updates feel real time
  • Scaling actions are visible immediately
  • State transitions (creating → running → scaling) feel continuous

Why this matters

Without this step, the platform would be technically fast but still feel slow.

By pushing events all the way to the UI, we aligned:

  • System speed
  • User perception

The impact

By eliminating waiting between steps, we removed the largest source of latency.

Before vs. after

Operation Before (loop-driven) After (event-driven)
Deploy 10–40s < 5s
Update 10–40s < 4s
Undeploy ~60s+ ~60s (grace period)

What changed technically

  • Removed dependency on loop timing for forward progress
  • Reduced end-to-end latency by an order of magnitude
  • Maintained correctness via continuous reconciliation

Why this matters

This fundamentally changes how Magic Containers behaves.

  • CI/CD pipelines speed up. Infrastructure is no longer the slowest step
  • Ephemeral workloads become practical. Create → run → destroy flows now complete in seconds
  • Event-driven systems feel natural. Infrastructure now reacts at the same speed as application logic

Tradeoffs and challenges

1. Event ordering

Ensuring correct sequencing across distributed components.

Solution:

  • Idempotent handlers

2. Reliability

Events can fail or be delayed.

Solution:

  • Retry mechanisms
  • Dead-letter queues
  • Control loops as fallback

3. Observability

Async systems are harder to debug.

Solution:

  • Correlation IDs
  • Event tracing across components

What’s next

We’re already exploring:

  • Optimizing image download times to reduce startup time
  • Automated build and updates directly from your GitHub repository
  • Access to recent log history alongside live logs for easier troubleshooting

Final thoughts

Magic Containers started as a loop-driven control plane designed for correctness.

With the introduction of event-driven acceleration, it now reacts immediately to changes, without relying on the next reconciliation cycle.

The result is a system that converges just as reliably, but gets there much faster.