Overview: design philosophy
Single binary
Every role is compiled into the same binary and packaged into the same image. Processes are distinguished purely by configuration (
[node].roles); no separate build artifacts are required.Roles selected by config
[node].roles determines what this process exposes and which foreground responsibilities it carries. The default is ["standalone"], i.e. a single process running every responsibility.State externalized where possible
Metadata lands in Postgres, data lands in the object store. Aside from the Ingester’s WAL/buffer, most roles are stateless and scale horizontally.
Discovery via Postgres
There is no separate gossip / consensus component. Nodes write themselves into the
cluster_nodes table via heartbeat, and other nodes discover live peers by querying that table directly.- Single-node standalone
- Role-split cluster
- Evaluation, development, PoC, low-traffic production.
- One process exposes HTTP + gRPC and runs every internal responsibility in-process (ingest, query, compaction, alert evaluation, etc.).
- Dependencies are still externalized: Postgres + object store (or a local filesystem backend).
Role enum values use snake_case in TOML / environment variables (
standalone, alert_manager), matching the naming convention of the internal implementation. All config examples below use snake_case.Node role reference
[node].roles is an array (Vec<Role>), with valid values: standalone, router, ingester, querier, compactor, alert_manager. The default is ["standalone"].
| Role | Foreground exposure | Stateful / stateless | Scaling characteristics |
|---|---|---|---|
standalone | HTTP + gRPC, running every internal responsibility in-process | Stateful (embeds an Ingester) | Single instance for evaluation; not suitable as a horizontal scaling unit |
router | HTTP reverse proxy + rate limiting; does not directly expose internal services beyond the business HTTP listener | Stateless (the rate limiter is in-process and ephemeral) | Scales horizontally; just put an L4/L7 LB in front |
ingester | Carries WAL replay + buffering + periodic flush (the actual work is pre-spawned at process startup) | Stateful (WAL, in-memory buffer, file_meta cache) | Sharded via consistent hashing; scaling must account for the WAL persistent volume and rebalancing |
querier | Designed to be the distributed scan endpoint | Stateless | Scales horizontally |
compactor | Background tick loop (pre-spawned); the role body itself idles | Stateless (processes tick by tick) | Single instance recommended (see the high-availability section) |
alert_manager | Two pre-spawned background loops: evaluation + dispatch; the role body itself idles | Stateful (evaluation state and incident state are in-process) | Single instance recommended |
Which role carries which background worker
One key fact: nearly all background workers are spawned unconditionally during the unified assembly phase at process startup, not inside each role’s dispatch function. They self-gate via feature flags or configuration. This means: in astandalone process, they all run; in a role-split image, whether they actually “do work” depends on whether that role was injected with the corresponding dependencies and configuration. The table below gives the recommended carrying role and interval.
| Background worker | Recommended carrying role | Interval | Config key | Status |
|---|---|---|---|---|
| heartbeat | All roles | heartbeat_interval_secs, default 5s (first beat immediate) | [cluster].heartbeat_interval_secs | Wired up |
| stale node cleanup (sweeper) | All roles | Fixed 60s | None (hardcoded) | Wired up |
| object store health probe | All roles (blocking probe once at startup) | health_probe_interval_secs, default 30s | [store.object].health_probe_interval_secs | Wired up |
| alert evaluation (evaluator) | alert_manager | eval_interval_secs, default 30s | [alert_manager].eval_interval_secs | Wired up |
| alert dispatch (dispatcher) | alert_manager | dispatch_interval_secs, default 10s | [alert_manager].dispatch_interval_secs | Wired up |
| compaction | compactor | interval_secs, default 300s (5 min) | [compactor].interval_secs | Wired up |
| file_meta_dumper (cold partition flush to storage) | compactor (same process as compaction) | interval_secs, default 3600s (1h) | [storage.file_meta_dump].interval_secs; set enabled=false to disable | Wired up |
| scheduled_reports | alert_manager (needs rendering dependencies, see below) | Fixed 60s tick; each report’s own cron decides whether it is due | the report’s cron (the tick interval is not configurable) | Wired up |
| search_jobs (async search pool) | The role carrying AppState (typically standalone / a node exposing query) | Idle poll idle_poll_secs, default 2s; cleanup cleanup_interval_secs, default 3600s | [search_jobs].workers (default 2), etc. | Wired up |
| ACME issuance / renewal | router (TLS entry point) | Issuance issue_poll_secs (default 60s); renewal renewal_retry_secs (default 21600s / 6h) | [http.tls].issue_poll_secs / renewal_retry_secs | Currently a placeholder / not enabled |
The compactor and alert_manager loops follow a “pre-spawned” model: they run as independent background tasks during the unified assembly phase, while the role body functions
compactor::run() / alert_manager::run() are merely idle placeholders. So setting a process’s [node].roles to compactor or alert_manager results in “that process’s role body idling + the background loops running as usual.”How to select roles
Roles are configured via[node].roles; they can be overridden with environment variables. Environment variables share the MS_ prefix, with section and field separated by a . (dot) (the MS_ prefix is stripped and the remaining key is split on .). For example: MS_NODE.ROLES, MS_STORE.META.DSN, MS_HTTP.PORT.
- TOML
- Env var override
- A set of role-split processes
A few bootstrap / secret variables are flat single-underscore and live outside
Settings, read directly from the environment: MS_CIPHER_KEY, MS_AUTH_JWT_SECRET_OVERRIDE, MS_LICENSE_FILE, MS_COPILOT_<PROVIDER>_API_KEY, etc. All other structured fields use the MS_<SECTION>.<FIELD> dotted form, exactly as the docker-compose and k8s manifests do (e.g. MS_NODE.ROLES, MS_STORE.META.DSN, MS_CLUSTER.ADVERTISE_ADDR).Data flow
Ingest path
The entry point lands on some Ingester via the Router (rate limiting + consistent hashing); the Ingester first writes the WAL (durably persisted to disk), then writes the in-memory buffer; a background flush loop encodes the buffer into columnar files + a search index by time window or size threshold and uploads it to the object store, finally landing the FileMeta in Postgres and truncating the flushed WAL segments.[wal].dir, [wal].segment_size_mb, [wal].flush_strategy (batch/none/every_write), [wal].sync_level (data/all); [ingester].buffer_max_mb (default 256), flush_interval_secs (default 30), flush_parallelism (default 4); [router.rate_limit].ingest_qps (default 1000 per org, 0 = unlimited).
The Router rate limits at
(org_id, route_class) granularity, returning 429 with a Retry-After header when exceeded. org_id comes from the X-Org-Id request header and defaults to default.Query path
The query entry point is on a node that exposes HTTP. The engine is wrapped layer by layer according to cluster size: local query engine → (with ≥2 querier peers) distributed engine → (when a remote cluster is specified) federated engine.Distributed query shards are fanned out to peers via a hash of
object_key; the shard SQL only performs a scan (SELECT * FROM <stream>), while the full aggregation runs on the coordinator to avoid partial/final aggregation inconsistencies. The distributed path is taken only when the cluster has ≥2 querier peers; otherwise it falls back to the local engine with no network hop.Async search job pipeline
[search_jobs].workers (default 2), idle_poll_secs (default 2), cleanup_interval_secs (default 3600); the automatic async threshold [querier].auto_async_threshold_rows (default 50 million rows).
The
FOR UPDATE SKIP LOCKED claim semantics are safe across multiple workers / multiple nodes: multiple processes carrying search_jobs can share the same search_jobs table and claim concurrently without executing the same job twice.Federated / multi-cluster query
Target clusters are specified via?clusters=local,sf,nyc. After scanning locally, the coordinator issues one internal scan RPC (with a Bearer token) to each enabled remote cluster, then UNION ALLs the batches each cluster returns with the local data and executes the full SQL.
Cluster membership and discovery
Node registration
The heartbeat task periodically upserts
(node_id, role, advertise_addr, last_heartbeat_at_micros) into the Postgres cluster_nodes table (primary key node_id, ON CONFLICT DO UPDATE). The first heartbeat fires immediately, subsequent ones at the configured interval.Heartbeat interval
[cluster].heartbeat_interval_secs, default 5s. advertise_addr defaults to 127.0.0.1:5082, i.e. the gRPC address (host:port) peers use to interconnect.Liveness window and stale cleanup
The liveness window is controlled by
[cluster].peer_timeout_secs, default 15s: the registry only returns nodes with last_heartbeat_at >= now - peer_timeout. Additionally, a sweeper deletes cluster_nodes rows that have not heartbeated for over 5 minutes every 60s.Peer discovery
There is no gossip / consensus; in distributed mode each role queries the
cluster_nodes table directly and filters live peers by role. standalone mode skips the entire discovery flow, returning only itself and never touching cluster_nodes.Placement algorithm
Router selecting an Ingester: applies consistent hashing over
org_id|stream_name then mods, landing deterministically on a particular Ingester. Router selecting a Querier: naive round-robin (now_ns % peer_count), not full consistent hashing. Distributed query sharding: a hash over object_key then mods, fanning out to querier peers.Distributed scan RPC
The coordinator encodes a scan request (org/stream/sql/file_metas/time_range) into a ticket and sends it to peers over the internal scan RPC; the peer reads columnar files, registers an in-memory table, runs the shard SQL, and returns a result stream. This RPC shares the same port (default 5082) with the gRPC node service and ingest service.
Local scan-RPC calls within the cluster are unauthenticated; only federated / remote-cluster calls use an optional Bearer token.
cluster_nodes table schema: node_id VARCHAR(64) PK, role VARCHAR(32), advertise_addr VARCHAR(255), started_at_micros BIGINT, last_heartbeat_at_micros BIGINT, with indexes on role and last_heartbeat_at_micros.
External dependencies and ports
PostgreSQL
The metadata database: FileMeta, streams, rules, incidents, users, orgs, audit, quotas, certificates,
cluster_nodes, search_jobs, remote_clusters, and more. [store.meta]: backend (default sqlite; set to postgres for production), dsn, max_connections (default 16). Migrations are embedded at compile time.Object store
Columnar files + search-index sidecars.
[store.object].backend: local (default, root=./data/objects) / s3 (including MinIO, R2, Alibaba Cloud OSS, overridden via endpoint) / azure / gcs. Credential precedence: env vars > credential file > inline TOML.No external cache / consensus
Only in-process LRU+TTL caches, the rate limiter, and the async runtime. No Redis / Memcached, no external consensus component.
Rendering dependency (optional)
PNG/PDF rendering for scheduled reports requires headless Chromium and is feature-gated; OSS builds always fall back to an SVG placeholder. The image’s runtime layer already ships with chromium.
Listening ports
| Port | Protocol / service | Config key | Notes |
|---|---|---|---|
| 5080 | HTTP | [http].bind (default 0.0.0.0), [http].port | /api/v1/*, /metrics, /healthz, /readyz, /.well-known/acme-challenge/, etc. |
| 5082 | gRPC | [grpc].bind, [grpc].port, max_message_size_mb (default 32MB) | Ingest + scan RPC + cluster heartbeat, three services sharing one port |
| 80 | Plaintext HTTP in TLS mode | [http.tls].plain_port | Health checks + ACME HTTP-01 challenge + 301 redirect to HTTPS |
| 443 | HTTPS in TLS mode (SNI) | [http.tls].port | Full routing; requires the domain-acme-tls feature |
/metrics (GET, Prometheus text 0.0.4) is controlled by [telemetry].metrics_enabled (default true) and exposes metrics such as cache hits / invalidations, object_store_*, wal_*, file_meta_dump_*, tantivy_pruned_files_total, alert_rule_eval_timeout_total.Health / readiness probe semantics
| Probe | Path | 200 condition | 503 condition | Purpose |
|---|---|---|---|---|
| Liveness | GET /healthz | WAL replay complete (Ingester-relevant only; other roles bypass it) and the object store is not degraded | replay in progress or object store connection lost | Process liveness and dependency health |
| Readiness | GET /readyz | replay_done = true (even if the object store is degraded) | replay still in progress | Lets K8s keep admitting read traffic while the write path is degraded |
Object store probe at startup (blocking)
startup_ping() synchronously performs a PUT→GET→DELETE (128 bytes) against _health/{uuid}.probe. On failure, the process does not start.Background periodic probe
The same round-trip runs every
[store.object].health_probe_interval_secs (default 30s); only 3 consecutive failures set object_store_degraded=true, and a success resets the counter to zero.Authentication and config overrides
- The JWT secret is auto-bootstrapped by the database on first startup; the old
jwt_secretTOML field is deprecated (parsing is retained only for backward compatibility with old configs). To pin a fixed secret, use the environment variableMS_AUTH_JWT_SECRET_OVERRIDE. - API tokens take the form
ms_<prefix>_<secret>, with the secret stored as an argon2id hash. - At startup, fields such as
store.meta.dsn,wal.dir,http.port,grpc.port,node.idare treated as immutable; changing them at runtime triggers a warning.
Deployment topologies
All topologies share the same image (e.g.molesignal:dev), distinguished by MS_NODE.ROLES. The web frontend is a separate nginx image (e.g. molesignal-web:dev).
- Single-node sandbox
- Docker Compose
- Kubernetes
A single process with all roles combined; the fastest way to get started.
The local backend (
store.object.backend=local) is fine for development; use an object store backend for production.Scaling and high availability
Router — scales horizontally
Stateless; scale replicas with entry traffic. The rate limiter is in-process and ephemeral: under multiple replicas each replica counts independently, so the effective org QPS ceiling is roughly
configured value × replica count. When needed, move rate limiting up to a unified gateway or lower the per-replica threshold.Querier — scales horizontally
Stateless. Distributed scans only trigger with ≥2 peers. Note: the standalone querier’s scan-RPC server is currently a placeholder, so the fully landed form of distributed query is the in-process engine of standalone. The bottleneck is columnar-file reads + query-engine memory.
Ingester — stateful
Use a StatefulSet + a per-replica PVC for a persistent WAL. The Router places via consistent hashing over
org|stream; scaling changes the mod result and causes rebalancing, so ensure the WAL is flushed before scaling down. Leave enough of a replay window for the readiness probe (the manifest sets a 10s initial delay).Compactor / AlertManager — single instance
Both are recommended at a single replica: multiple Compactor instances would produce merge conflicts (a lease-table lock is planned for the future), and AlertManager’s evaluation / incident state is in-process. Reliability comes from fast restarts rather than multiple replicas.
wal_append_lock_wait_seconds, wal_append_inflight, wal_fsync_errors_total, file_meta_dump_*; for the object store, object_store_operations_total, object_store_errors_total, object_store_op_duration_seconds, object_store_probe_*; for queries, the cache hit rate cache_* and tantivy_pruned_files_total; for alerts, alert_rule_eval_timeout_total. Combine /healthz, /readyz, and the cluster_nodes table to observe member liveness.
Minimal production checklist / validation checklist
External dependencies ready
Postgres reachable,
[store.meta].backend = "postgres" with a correct dsn; choose an object store backend of s3/azure/gcs (not local), with credentials injected in the precedence order “env vars > credential file > inline”.Roles and interconnect addresses
Each process has an explicit
MS_NODE.ROLES; set MS_CLUSTER.ADVERTISE_ADDR to a peer-reachable host:5082 (K8s uses $(POD_IP):5082). Confirm only a single foreground role is set (a multi-role array only takes the first element).Ingester persistence
Run the Ingester as a StatefulSet + PVC mounting the WAL directory; set
[wal].flush_strategy / sync_level according to durability requirements; leave enough replay delay for the readiness probe.Single-instance roles
Keep Compactor and AlertManager at 1 replica each. Confirm
[compactor].interval_secs, retention_days, and [storage.file_meta_dump].enabled match expectations.Entry point and rate limiting
Put an LB in front of the Router; set
[router.rate_limit].ingest_qps / query_qps per org; note that rate limiting is approximate under multiple replicas. Disable buffering for /api/v1/query/stream on the Ingress.Observability and probes
Wire
/metrics into Prometheus; point K8s probes at /api/v1/healthz (liveness) and /readyz (readiness). Confirm the object store startup probe passes (otherwise the process will not start).Security and license
Set
MS_AUTH_JWT_SECRET_OVERRIDE when you need a fixed JWT secret; otherwise it is auto-bootstrapped by the DB. Inject a cipher key (do not use the dev all-zeros value in production). Federated query requires the federated_search license feature; OSS does not support TLS/ACME (which requires building with the domain-acme-tls feature), and automatic certificate issuance/renewal is currently a placeholder and not enabled.