Distributed & Role-Based Deployment

MoleSignal is a single binary: the same executable (the same image) serves every role, and you choose which roles a given process runs. This lets it act as a sandbox that starts with one command, and scale out horizontally into a multi-role cluster. This page is aimed at operators / SREs. It systematically covers the role breakdown, which roles own which background workers, the data flow, cluster discovery, external dependencies and ports, and three deployment topologies: single-node, Docker Compose, and Kubernetes.

Overview: design philosophy

Single binary

Every role is compiled into the same binary and packaged into the same image. Processes are distinguished purely by configuration ([node].roles); no separate build artifacts are required.

Roles selected by config

[node].roles determines what this process exposes and which foreground responsibilities it carries. The default is ["standalone"], i.e. a single process running every responsibility.

State externalized where possible

Metadata lands in Postgres, data lands in the object store. Aside from the Ingester’s WAL/buffer, most roles are stateless and scale horizontally.

Discovery via Postgres

There is no separate gossip / consensus component. Nodes write themselves into the cluster_nodes table via heartbeat, and other nodes discover live peers by querying that table directly.

When to run single-node vs. split by role:

Single-node standalone
Role-split cluster

Evaluation, development, PoC, low-traffic production.
One process exposes HTTP + gRPC and runs every internal responsibility in-process (ingest, query, compaction, alert evaluation, etc.).
Dependencies are still externalized: Postgres + object store (or a local filesystem backend).

Role enum values use snake_case in TOML / environment variables (standalone, alert_manager), matching the naming convention of the internal implementation. All config examples below use snake_case.

Node role reference

[node].roles is an array (Vec<Role>), with valid values: standalone, router, ingester, querier, compactor, alert_manager. The default is ["standalone"].

The current implementation only uses the first role in the array for foreground dispatch and the PeerRole annotation on cluster heartbeats. A multi-role form like [node].roles = ["ingester", "compactor"] parses successfully, but only the first element currently takes effect. “A single process carrying multiple foreground roles at once” is a design intent that has not yet landed, so do not rely on it in production. If you need to combine responsibilities, use standalone.

Role	Foreground exposure	Stateful / stateless	Scaling characteristics
`standalone`	HTTP + gRPC, running every internal responsibility in-process	Stateful (embeds an Ingester)	Single instance for evaluation; not suitable as a horizontal scaling unit
`router`	HTTP reverse proxy + rate limiting; does not directly expose internal services beyond the business HTTP listener	Stateless (the rate limiter is in-process and ephemeral)	Scales horizontally; just put an L4/L7 LB in front
`ingester`	Carries WAL replay + buffering + periodic flush (the actual work is pre-spawned at process startup)	Stateful (WAL, in-memory buffer, file_meta cache)	Sharded via consistent hashing; scaling must account for the WAL persistent volume and rebalancing
`querier`	Designed to be the distributed scan endpoint	Stateless	Scales horizontally
`compactor`	Background tick loop (pre-spawned); the role body itself idles	Stateless (processes tick by tick)	Single instance recommended (see the high-availability section)
`alert_manager`	Two pre-spawned background loops: evaluation + dispatch; the role body itself idles	Stateful (evaluation state and incident state are in-process)	Single instance recommended

The Querier role is currently a placeholder / incomplete. The role dispatch function currently only logs a single “querier role started (stub)” line and idles; the standalone distributed-scan server implementation has not yet been wired up. The fan-out protocol for distributed queries (see the data flow section) exists in the code and is invoked by the coordinator, but the layer that serves the scan RPC as a standalone querier process is not yet complete. The currently usable form of distributed query is the engine composition inside a standalone process.

Which role carries which background worker

One key fact: nearly all background workers are spawned unconditionally during the unified assembly phase at process startup, not inside each role’s dispatch function. They self-gate via feature flags or configuration. This means: in a standalone process, they all run; in a role-split image, whether they actually “do work” depends on whether that role was injected with the corresponding dependencies and configuration. The table below gives the recommended carrying role and interval.

Background worker	Recommended carrying role	Interval	Config key	Status
heartbeat	All roles	`heartbeat_interval_secs`, default 5s (first beat immediate)	`[cluster].heartbeat_interval_secs`	Wired up
stale node cleanup (sweeper)	All roles	Fixed 60s	None (hardcoded)	Wired up
object store health probe	All roles (blocking probe once at startup)	`health_probe_interval_secs`, default 30s	`[store.object].health_probe_interval_secs`	Wired up
alert evaluation (evaluator)	`alert_manager`	`eval_interval_secs`, default 30s	`[alert_manager].eval_interval_secs`	Wired up
alert dispatch (dispatcher)	`alert_manager`	`dispatch_interval_secs`, default 10s	`[alert_manager].dispatch_interval_secs`	Wired up
compaction	`compactor`	`interval_secs`, default 300s (5 min)	`[compactor].interval_secs`	Wired up
file_meta_dumper (cold partition flush to storage)	`compactor` (same process as compaction)	`interval_secs`, default 3600s (1h)	`[storage.file_meta_dump].interval_secs`; set `enabled=false` to disable	Wired up
scheduled_reports	`alert_manager` (needs rendering dependencies, see below)	Fixed 60s tick; each report’s own cron decides whether it is due	the report’s `cron` (the tick interval is not configurable)	Wired up
search_jobs (async search pool)	The role carrying `AppState` (typically `standalone` / a node exposing query)	Idle poll `idle_poll_secs`, default 2s; cleanup `cleanup_interval_secs`, default 3600s	`[search_jobs].workers` (default 2), etc.	Wired up
ACME issuance / renewal	`router` (TLS entry point)	Issuance `issue_poll_secs` (default 60s); renewal `renewal_retry_secs` (default 21600s / 6h)	`[http.tls].issue_poll_secs` / `renewal_retry_secs`	Currently a placeholder / not enabled

The ACME issuance / renewal loops are currently a stub and are not spawned during the assembly phase. The loop interval fields, per-domain cooldown, and single-issuance method are all present, but the functions that scan for pending domains and renewal domains currently just return success without doing anything, and nothing spawns the runner anywhere. Conclusion: automatic certificate issuance / renewal is not yet in effect. TLS itself additionally requires building with the domain-acme-tls feature; OSS builds do not support TLS.

The compactor and alert_manager loops follow a “pre-spawned” model: they run as independent background tasks during the unified assembly phase, while the role body functions compactor::run() / alert_manager::run() are merely idle placeholders. So setting a process’s [node].roles to compactor or alert_manager results in “that process’s role body idling + the background loops running as usual.”

How to select roles

Roles are configured via [node].roles; they can be overridden with environment variables. Environment variables share the MS_ prefix, with section and field separated by a . (dot) (the MS_ prefix is stripped and the remaining key is split on .). For example: MS_NODE.ROLES, MS_STORE.META.DSN, MS_HTTP.PORT.

TOML
Env var override
A set of role-split processes

[node]
# Unique node identifier; leave empty to have the process generate one
id = "ingester-a1"
# Role array; only the first element currently takes effect
roles = ["ingester"]

[cluster]
# gRPC address (host:port) peers use to interconnect
advertise_addr = "10.0.1.21:5082"
heartbeat_interval_secs = 5
peer_timeout_secs = 15

# Variable names use the dotted form, matching the docker-compose / k8s manifests.
# Note: a dot is not a valid POSIX shell identifier, so you cannot `export MS_NODE.ROLES=…`;
# inject via the container / orchestrator env, or launch with an env prefix:
env 'MS_NODE.ROLES=["ingester"]' \
    'MS_NODE.ID=ingester-a1' \
    'MS_CLUSTER.ADVERTISE_ADDR=10.0.1.21:5082' \
    'MS_HTTP.PORT=5080' \
    'MS_GRPC.PORT=5082' \
    molesignal --config ./conf/config.toml

# One env line per process (the value is the role that process carries):
# Entry router
MS_NODE.ROLES='["router"]'

# Ingest node (needs a persistent volume for the WAL)
MS_NODE.ROLES='["ingester"]'

# Query node
MS_NODE.ROLES='["querier"]'

# Compaction + cold partition flush to storage
MS_NODE.ROLES='["compactor"]'

# Alert evaluation + dispatch + scheduled reports
MS_NODE.ROLES='["alert_manager"]'

A few bootstrap / secret variables are flat single-underscore and live outside Settings, read directly from the environment: MS_CIPHER_KEY, MS_AUTH_JWT_SECRET_OVERRIDE, MS_LICENSE_FILE, MS_COPILOT_<PROVIDER>_API_KEY, etc. All other structured fields use the MS_<SECTION>.<FIELD> dotted form, exactly as the docker-compose and k8s manifests do (e.g. MS_NODE.ROLES, MS_STORE.META.DSN, MS_CLUSTER.ADVERTISE_ADDR).

Data flow

Ingest path

The entry point lands on some Ingester via the Router (rate limiting + consistent hashing); the Ingester first writes the WAL (durably persisted to disk), then writes the in-memory buffer; a background flush loop encodes the buffer into columnar files + a search index by time window or size threshold and uploads it to the object store, finally landing the FileMeta in Postgres and truncating the flushed WAL segments.

                 X-Org-Id            consistent hash( org_id|stream )
 client ──HTTP/gRPC──► Router ──consistent hashing──► Ingester (stateful)
 :5080/:5082          (stateless, rate limit)         │
                       │ 429 + Retry-After             │ 1) write WAL (to disk, configurable fsync strategy)
                       │ when over org QPS             │ 2) write in-memory buffer (accumulated by seq)
                                               ▼
                              background flush loop (triggered by flush_interval_secs / buffer threshold)
                                               │
                          snapshot → columnar files + search-index sidecar
                                               │ upload
                          ┌────────────────────┴────────────────────┐
                          ▼                                          ▼
                  object store (local / s3 / azure / gcs)        Postgres: FileMeta
                  {org}/{type}/{stream}/{date}/{file-id}  (time_range, rows,
                  …a search-index sidecar per file              min/max, object_key)
                                               │
                          after a successful flush, seal + truncate the flushed WAL segments

Key config: [wal].dir, [wal].segment_size_mb, [wal].flush_strategy (batch/none/every_write), [wal].sync_level (data/all); [ingester].buffer_max_mb (default 256), flush_interval_secs (default 30), flush_parallelism (default 4); [router.rate_limit].ingest_qps (default 1000 per org, 0 = unlimited).

The Router rate limits at (org_id, route_class) granularity, returning 429 with a Retry-After header when exceeded. org_id comes from the X-Org-Id request header and defaults to default.

Query path

The query entry point is on a node that exposes HTTP. The engine is wrapped layer by layer according to cluster size: local query engine → (with ≥2 querier peers) distributed engine → (when a remote cluster is specified) federated engine.

 client ──POST /api/v1/query──► QueryService
                                   │  Prefer: respond-async / respond-sync
                                   │  or auto_async_threshold_rows triggers automatic async
                          ┌────────┴─────────┐
                       synchronous execution   async → create a search_jobs row, return 202 + job_id
                          │
       engine composition: query engine (local) → distributed (≥2 querier) → federated (remote cluster)
                          │
            FileMeta partition pruning (org/stream/stream_type/time_range) + full-text (MATCH) pruning
                          │
      distributed sharding: hash(object_key) % peer_count → each shard encoded as a scan request (ticket)
                          │  internal scan RPC (:5082)
            ┌─────────────┼─────────────┐
            ▼             ▼             ▼
        Querier peer   Querier peer   …(each reads columnar files, SELECT * scan, returns result batches)
            └─────────────┼─────────────┘
                          ▼
        coordinator UNION ALL → register an in-memory table → execute the full user SQL (aggregation/filter/projection)
                          ▼
                   QueryResult (columns, rows, scanned_rows, took_ms)

Distributed query shards are fanned out to peers via a hash of object_key; the shard SQL only performs a scan (SELECT * FROM <stream>), while the full aggregation runs on the coordinator to avoid partial/final aggregation inconsistencies. The distributed path is taken only when the cluster has ≥2 querier peers; otherwise it falls back to the local engine with no network hop.

Async search job pipeline

 POST /query (Prefer: respond-async or over threshold)
        │
        ▼  write a search_jobs row (state=Pending, request_json, expires_at = now+7d)
   return 202 + job_id + monitoring URL
        │
        ▼  search_jobs worker pool (default 2 workers)
   claim_next_pending()  ──PG FOR UPDATE SKIP LOCKED──► deserialize QueryRequest
        │
        ▼  QueryService::run() executes
   result encoded as NDJSON, uploaded to object store: {org_id}/search_jobs/{job_id}.ndjson
        │
        ▼  mark_done(result_object_key, rows) / mark_failed(error) on failure
   cleanup loop deletes expired jobs every cleanup_interval_secs (DB + object store)
        │
 client polls /api/v1/query/jobs/{job_id}, downloads the NDJSON once state=Done

Config: [search_jobs].workers (default 2), idle_poll_secs (default 2), cleanup_interval_secs (default 3600); the automatic async threshold [querier].auto_async_threshold_rows (default 50 million rows).

The FOR UPDATE SKIP LOCKED claim semantics are safe across multiple workers / multiple nodes: multiple processes carrying search_jobs can share the same search_jobs table and claim concurrently without executing the same job twice.

Federated / multi-cluster query

Target clusters are specified via ?clusters=local,sf,nyc. After scanning locally, the coordinator issues one internal scan RPC (with a Bearer token) to each enabled remote cluster, then UNION ALLs the batches each cluster returns with the local data and executes the full SQL.

 POST /api/v1/query?clusters=local,sf,nyc
        │  license check: contains a non-"local" target AND no federated_search feature → 403 Forbidden
        ▼
   local scan (single-threaded, for correctness) + fan-out to each enabled remote cluster:
        │   token: env:VAR / cipher_keys:<id> (the latter needs the injected secret store to decrypt)
        │   connect to advertise_addr (http:// or https:// depending on tls_verify)
        ▼
   remote returns → failures (auth/unreachable) recorded in degraded_clusters and processing continues (graceful degradation)
        ▼
   merge all successful batches → execute the full user SQL
   return federation: { scanned_clusters, degraded_clusters, degraded_reason }

Federated query is license-gated: as long as clusters contains a non-local target and the current license lacks the federated_search feature, the HTTP layer returns 403 directly. OSS / community editions fall back to single-cluster. Remote cluster definitions are stored in the Postgres remote_clusters table (advertise_addr, token_secret_ref, tls_verify, enabled); clusters with enabled=false are skipped during fan-out. Remote auth currently supports Bearer token only; tls_verify=false maps to http:// (rather than “https with verification skipped”).

Cluster membership and discovery

Node registration

The heartbeat task periodically upserts (node_id, role, advertise_addr, last_heartbeat_at_micros) into the Postgres cluster_nodes table (primary key node_id, ON CONFLICT DO UPDATE). The first heartbeat fires immediately, subsequent ones at the configured interval.

Heartbeat interval

[cluster].heartbeat_interval_secs, default 5s. advertise_addr defaults to 127.0.0.1:5082, i.e. the gRPC address (host:port) peers use to interconnect.

Liveness window and stale cleanup

The liveness window is controlled by [cluster].peer_timeout_secs, default 15s: the registry only returns nodes with last_heartbeat_at >= now - peer_timeout. Additionally, a sweeper deletes cluster_nodes rows that have not heartbeated for over 5 minutes every 60s.

Peer discovery

There is no gossip / consensus; in distributed mode each role queries the cluster_nodes table directly and filters live peers by role. standalone mode skips the entire discovery flow, returning only itself and never touching cluster_nodes.

Placement algorithm

Router selecting an Ingester: applies consistent hashing over org_id|stream_name then mods, landing deterministically on a particular Ingester. Router selecting a Querier: naive round-robin (now_ns % peer_count), not full consistent hashing. Distributed query sharding: a hash over object_key then mods, fanning out to querier peers.

Distributed scan RPC

The coordinator encodes a scan request (org/stream/sql/file_metas/time_range) into a ticket and sends it to peers over the internal scan RPC; the peer reads columnar files, registers an in-memory table, runs the shard SQL, and returns a result stream. This RPC shares the same port (default 5082) with the gRPC node service and ingest service.

Local scan-RPC calls within the cluster are unauthenticated; only federated / remote-cluster calls use an optional Bearer token.

cluster_nodes table schema: node_id VARCHAR(64) PK, role VARCHAR(32), advertise_addr VARCHAR(255), started_at_micros BIGINT, last_heartbeat_at_micros BIGINT, with indexes on role and last_heartbeat_at_micros.

External dependencies and ports

PostgreSQL

The metadata database: FileMeta, streams, rules, incidents, users, orgs, audit, quotas, certificates, cluster_nodes, search_jobs, remote_clusters, and more. [store.meta]: backend (default sqlite; set to postgres for production), dsn, max_connections (default 16). Migrations are embedded at compile time.

Object store

Columnar files + search-index sidecars. [store.object].backend: local (default, root=./data/objects) / s3 (including MinIO, R2, Alibaba Cloud OSS, overridden via endpoint) / azure / gcs. Credential precedence: env vars > credential file > inline TOML.

No external cache / consensus

Only in-process LRU+TTL caches, the rate limiter, and the async runtime. No Redis / Memcached, no external consensus component.

Rendering dependency (optional)

PNG/PDF rendering for scheduled reports requires headless Chromium and is feature-gated; OSS builds always fall back to an SVG placeholder. The image’s runtime layer already ships with chromium.

Listening ports

Port	Protocol / service	Config key	Notes
5080	HTTP	`[http].bind` (default `0.0.0.0`), `[http].port`	`/api/v1/*`, `/metrics`, `/healthz`, `/readyz`, `/.well-known/acme-challenge/`, etc.
5082	gRPC	`[grpc].bind`, `[grpc].port`, `max_message_size_mb` (default 32MB)	Ingest + scan RPC + cluster heartbeat, three services sharing one port
80	Plaintext HTTP in TLS mode	`[http.tls].plain_port`	Health checks + ACME HTTP-01 challenge + 301 redirect to HTTPS
443	HTTPS in TLS mode (SNI)	`[http.tls].port`	Full routing; requires the `domain-acme-tls` feature

/metrics (GET, Prometheus text 0.0.4) is controlled by [telemetry].metrics_enabled (default true) and exposes metrics such as cache hits / invalidations, object_store_*, wal_*, file_meta_dump_*, tantivy_pruned_files_total, alert_rule_eval_timeout_total.

Health / readiness probe semantics

Probe	Path	200 condition	503 condition	Purpose
Liveness	`GET /healthz`	WAL replay complete (Ingester-relevant only; other roles bypass it) and the object store is not degraded	replay in progress or object store connection lost	Process liveness and dependency health
Readiness	`GET /readyz`	`replay_done = true` (even if the object store is degraded)	replay still in progress	Lets K8s keep admitting read traffic while the write path is degraded

Object store probe at startup (blocking)

startup_ping() synchronously performs a PUT→GET→DELETE (128 bytes) against _health/{uuid}.probe. On failure, the process does not start.

Background periodic probe

The same round-trip runs every [store.object].health_probe_interval_secs (default 30s); only 3 consecutive failures set object_store_degraded=true, and a success resets the counter to zero.

Ingester WAL replay

At startup the Ingester scans the segment files under [wal].dir, replays them into the in-memory buffer by (org, stream_type, stream), forces a single flush once everything is loaded, then sets replay_done=true. /readyz returns 503 until replay completes.

Authentication and config overrides

The JWT secret is auto-bootstrapped by the database on first startup; the old jwt_secret TOML field is deprecated (parsing is retained only for backward compatibility with old configs). To pin a fixed secret, use the environment variable MS_AUTH_JWT_SECRET_OVERRIDE.
API tokens take the form ms_<prefix>_<secret>, with the secret stored as an argon2id hash.
At startup, fields such as store.meta.dsn, wal.dir, http.port, grpc.port, node.id are treated as immutable; changing them at runtime triggers a warning.

Deployment topologies

All topologies share the same image (e.g. molesignal:dev), distinguished by MS_NODE.ROLES. The web frontend is a separate nginx image (e.g. molesignal-web:dev).

Single-node sandbox
Docker Compose
Kubernetes

A single process with all roles combined; the fastest way to get started.

[node]
roles = ["standalone"]

[store.meta]
backend = "postgres"
dsn = "postgres://molesignal:molesignal@localhost:5432/molesignal"

[store.object]
backend = "local"
root = "./data/objects"

[wal]
dir = "./data/wal"

molesignal --config ./conf/config.toml
# HTTP :5080  gRPC :5082

The local backend (store.object.backend=local) is fine for development; use an object store backend for production.

The Compose file lives at deploy/docker/docker-compose.yaml and depends on postgres:5432 (with a health check) and minio:9000, with config mounted (read-only) from ../../conf/config.toml. The image is built by deploy/docker/Dockerfile in three stages (frontend pnpm → Rust release → bookworm-slim runtime layer with chromium).Default standalone profile:

docker compose up
# A single molesignal container, MS_NODE.ROLES=["standalone"]
# Exposes 5080(HTTP) / 5082(gRPC), mounts obs-wal:/data/wal

multirole profile (six roles in separate containers):

docker compose --profile multirole up
# router / ingester / querier / compactor / alert-manager / connector

molesignal-router: exposes 5080, the entry point.
molesignal-ingester: mounts obs-wal:/data/wal for a persistent WAL; the other roles are stateless.
The remaining role containers each carry their corresponding responsibility.

There is also deploy/docker/Dockerfile.web and nginx.conf: the web container reverse-proxies to MS_BACKEND, disables proxy_buffering for /api/v1/query/stream and relaxes read/write timeouts to 600s to support live tail; /assets/* (Vite hash-named) is set to 1-year immutable, and index.html is set to no-store.

The manifests live in deploy/k8s/ and implement the multirole topology. All core roles inject: MS_NODE.ROLES, POD_IP (fieldRef status.podIP), MS_CLUSTER.ADVERTISE_ADDR=$(POD_IP):5082, plus the object store secret, cipher key, and optional license file injected from a Secret.Deployment order: 00-namespace.yaml → 10-configmap.yaml → 20-secret.yaml → the individual role manifests.

Manifest	kind	Replicas	Role	Volumes / Service
`30-router.yaml`	Deployment	2	`router`	Stateless; 5080/http + 5082/grpc; ClusterIP; `readinessProbe GET /api/v1/healthz`
`40-ingester.yaml`	StatefulSet	2	`ingester`	`volumeClaimTemplate` data 20Gi RWO/replica mounted at `/data` (WAL); headless Service (`clusterIP: None`); readiness probe with 10s initial delay (leaving a replay window)
`50-querier.yaml`	Deployment	2	`querier`	Stateless; 5080 + 5082 (scan RPC); ClusterIP; readiness probe 5s
`60-compactor.yaml`	Deployment	1	`compactor`	Stateless; single instance to avoid merge conflicts (a lease-table lock is planned for the future); ClusterIP
`70-alert-manager.yaml`	Deployment	1	`alert_manager`	Carries alert evaluation + scheduled reports (PNG/PDF needs chromium, annotated `molesignal.io/scheduled-reports-renderer: requires-chromium`); ClusterIP
`90-connector.yaml`	Deployment	1	`connector`	Drives pull / receive-push connectors; only 5080/http (no grpc); ClusterIP
`80-web.yaml`	Deployment	2	—	`molesignal-web:dev` nginx, `MS_BACKEND=router:5080`, 8080→Service 80; includes an Ingress (host `molesignal.local`)
`95-ingress.yaml`	Ingress	—	—	`molesignal-router`, host `molesignal.local`, disables buffering for `/api/v1/query/stream`

connector appears in the Compose multirole profile and the k8s manifests, but it is not among the six roles in the [node].roles enum—it is a responsibility process defined separately at the deployment layer. Use it per the existing manifests, and do not conflate it with the [node].roles enum.

Both 80-web.yaml and 95-ingress.yaml define an Ingress for host molesignal.local (one pointing to the web container, one pointing directly at the router). They are likely two alternative topologies (via the web layer vs. directly connecting to the router); deploying both at once will conflict, so pick one. The router backend in 95-ingress.yaml references port 80, while the Service in 30-router.yaml only explicitly defines 5080/5082, so align the Service port or the Ingress backend before going live.

Scaling and high availability

Router — scales horizontally

Stateless; scale replicas with entry traffic. The rate limiter is in-process and ephemeral: under multiple replicas each replica counts independently, so the effective org QPS ceiling is roughly configured value × replica count. When needed, move rate limiting up to a unified gateway or lower the per-replica threshold.

Querier — scales horizontally

Stateless. Distributed scans only trigger with ≥2 peers. Note: the standalone querier’s scan-RPC server is currently a placeholder, so the fully landed form of distributed query is the in-process engine of standalone. The bottleneck is columnar-file reads + query-engine memory.

Ingester — stateful

Use a StatefulSet + a per-replica PVC for a persistent WAL. The Router places via consistent hashing over org|stream; scaling changes the mod result and causes rebalancing, so ensure the WAL is flushed before scaling down. Leave enough of a replay window for the readiness probe (the manifest sets a 10s initial delay).

Compactor / AlertManager — single instance

Both are recommended at a single replica: multiple Compactor instances would produce merge conflicts (a lease-table lock is planned for the future), and AlertManager’s evaluation / incident state is in-process. Reliability comes from fast restarts rather than multiple replicas.

Recommended metrics to monitor: for the write path, watch wal_append_lock_wait_seconds, wal_append_inflight, wal_fsync_errors_total, file_meta_dump_*; for the object store, object_store_operations_total, object_store_errors_total, object_store_op_duration_seconds, object_store_probe_*; for queries, the cache hit rate cache_* and tantivy_pruned_files_total; for alerts, alert_rule_eval_timeout_total. Combine /healthz, /readyz, and the cluster_nodes table to observe member liveness.

Minimal production checklist / validation checklist

External dependencies ready

Postgres reachable, [store.meta].backend = "postgres" with a correct dsn; choose an object store backend of s3/azure/gcs (not local), with credentials injected in the precedence order “env vars > credential file > inline”.

Roles and interconnect addresses

Each process has an explicit MS_NODE.ROLES; set MS_CLUSTER.ADVERTISE_ADDR to a peer-reachable host:5082 (K8s uses $(POD_IP):5082). Confirm only a single foreground role is set (a multi-role array only takes the first element).

Ingester persistence

Run the Ingester as a StatefulSet + PVC mounting the WAL directory; set [wal].flush_strategy / sync_level according to durability requirements; leave enough replay delay for the readiness probe.

Single-instance roles

Keep Compactor and AlertManager at 1 replica each. Confirm [compactor].interval_secs, retention_days, and [storage.file_meta_dump].enabled match expectations.

Entry point and rate limiting

Put an LB in front of the Router; set [router.rate_limit].ingest_qps / query_qps per org; note that rate limiting is approximate under multiple replicas. Disable buffering for /api/v1/query/stream on the Ingress.

Observability and probes

Wire /metrics into Prometheus; point K8s probes at /api/v1/healthz (liveness) and /readyz (readiness). Confirm the object store startup probe passes (otherwise the process will not start).

Security and license

Set MS_AUTH_JWT_SECRET_OVERRIDE when you need a fixed JWT secret; otherwise it is auto-bootstrapped by the DB. Inject a cipher key (do not use the dev all-zeros value in production). Federated query requires the federated_search license feature; OSS does not support TLS/ACME (which requires building with the domain-acme-tls feature), and automatic certificate issuance/renewal is currently a placeholder and not enabled.

Before going live, be clear about the following “current state” boundaries: the standalone querier role is a placeholder (the distributed scan-RPC server is incomplete); a multi-role array only takes effect for the first element; ACME issuance/renewal is a placeholder and not spawned; the distributed consensus WAL term source is still a static placeholder (multi-node consensus is not implemented); federated query and SSO (OIDC only; SAML is not implemented), among others, are license/feature-gated.

​Overview: design philosophy

Single binary

Roles selected by config

State externalized where possible

Discovery via Postgres

​Node role reference

​Which role carries which background worker

​How to select roles

​Data flow

​Ingest path

​Query path

​Async search job pipeline

​Federated / multi-cluster query

​Cluster membership and discovery

​External dependencies and ports

PostgreSQL

Object store

No external cache / consensus

Rendering dependency (optional)

​Listening ports

​Health / readiness probe semantics

​Authentication and config overrides

​Deployment topologies

​Scaling and high availability

Router — scales horizontally

Querier — scales horizontally

Ingester — stateful

Compactor / AlertManager — single instance

​Minimal production checklist / validation checklist

Overview: design philosophy

Node role reference

Which role carries which background worker

How to select roles

Data flow

Ingest path

Query path

Async search job pipeline

Federated / multi-cluster query

Cluster membership and discovery

External dependencies and ports

Listening ports

Health / readiness probe semantics

Authentication and config overrides

Deployment topologies

Scaling and high availability

Minimal production checklist / validation checklist