Stack monitoring

Every provisioned stack gets two complementary layers of health monitoring out of the box: an off-box port:22 probe that detects when the host stops answering, and an opt-in on-box agent that reports disk, RAM, and Docker telemetry. Both surface in the stack detail page; the off-box probe is what alerts you when the host is genuinely down.

The two layers

Off-box port monitorOn-box Heartbeat agent
AnswersIs the host reachable right now?What's my disk %, RAM, container count, load?
Where it runsOwnStack's monitoring service, externallyOn the stack itself
Works when the host is deadYes — that's the point.No — the agent dies with the host.
Detect time on a crash~60–180 sWhenever the agent times out (~5 min)
Install requiredNo (auto-provisioned)Yes — install via Stack → Add-ons → Heartbeat

They're not competing tools — they're two layers of the same product. The off-box probe is liveness; the on-box agent is telemetry.

Off-box port monitor

When a stack reaches provisioned, OwnStack creates a port-22 monitor in the Heartbeat service. It probes tcp://<instance_ip>:22 every 60 seconds. Two consecutive failures (≈ 180 s) marks the host as down; the next success marks it recovered.

On a down event:

  1. Heartbeat POSTs an HMAC-signed webhook to the control plane.
  2. The control plane writes a host_unreachable alert against the stack and stamps unreachable_since.
  3. Account users with notify_host_unreachable=true get an email.
  4. The stack list card switches to a red dot; the stack detail's Host status panel flips to Unreachable for X.

On recovery the same flow runs in reverse: alert resolved, last_reachable_at stamped, recovery email sent.

On-box agent (Heartbeat add-on)

Opt-in per stack from Stack → Add-ons → Heartbeat → Install agent. The agent installs over SSH, runs under systemd, and pushes disk/RAM/docker/load every 60 s. The data drives the Health panel on the stack page (disk-use gauge, container counts, alerts at 80% / 90% disk).

The agent doesn't replace the off-box probe — when the host dies the agent dies too. Keep both.

Host actions

The Host status panel exposes provider-level controls for AWS stacks:

ActionWhat it runsWhen the button shows
StartStartInstancesProvider state is stopped / shutting-down
StopStopInstancesProvider state is running (confirms with EBS-billing warning)
RestartRebootInstancesProvider state is running, or the host is unreachable
ConsoleGetConsoleOutputAlways (for AWS stacks with an instance)

Each action is recorded as a StackHostAction row with the before/after provider state, user, and result. The Timeline section on the stack detail lists them alongside alert events.

Auto-recovery (opt-in)

Turn the toggle on in the Host status panel and pick a threshold (default 5 min). If the off-box probe reports the host unreachable for at least that long and AWS still says the instance is running, OwnStack fires a restart on its own, recorded with triggered_by=auto_recovery. It's rate-limited to one auto-recovery per 30 minutes per stack to avoid loops.

Auto-recovery is a sledgehammer.

A reboot fixes a hung kernel and a wedged systemd, but it also drops in-flight requests and forces every container to restart. Use it for hosts you'd rather lose 60 s of traffic on than leave dark for hours; leave it off for stacks where a graceful shutdown matters.

The serial console

For AWS stacks, the Host status panel's "View log" link calls GetConsoleOutput against EC2. That returns the most recent ~64 KB of hypervisor-captured serial output — the boot log, kernel messages, OOM dumps, panic traces. It's what you reach for when the OS is too sick to SSH into. The output is a snapshot, not a stream.

From the CLI

$ ownstack stack health <stack>
=== node-hub-538 ===
  IP:                44.237.78.27
  Provider state:    running

  Host reachability: ● Active (port:22, 60s probe)
  On-box agent:      ○ Not installed

  Active alerts (1):
    • [critical] host_unreachable — TCP timeout (2026-05-10T18:13:25Z)