Overview: design philosophy
Single binary
Every role is compiled into the same binary and packaged into the same image. Processes are distinguished purely by configuration (
[node].roles); no separate build artifacts are required.Roles selected by config
[node].roles determines what this process exposes and which foreground responsibilities it carries. The default is ["standalone"], i.e. a single process running every responsibility.State externalized where possible
Metadata lands in Postgres, data lands in the object store. Aside from the Ingester’s WAL/buffer, most roles are stateless and scale horizontally.
Discovery via Postgres
There is no separate gossip / consensus component. Nodes write themselves into the
cluster_nodes table via heartbeat, and other nodes discover live peers by querying that table directly.- Single-node standalone
- Role-split cluster
- Evaluation, development, PoC, low-traffic production.
- One process exposes HTTP + gRPC and runs every internal responsibility in-process (ingest, query, compaction, alert evaluation, etc.).
- Dependencies are still externalized: Postgres + object store (or a local filesystem backend).
Role enum values use snake_case in TOML / environment variables (
standalone, alert_manager), matching the naming convention of the internal implementation. All config examples below use snake_case.Node role reference
[node].roles is an array (Vec<Role>), with valid values: standalone, router, ingester, querier, compactor, alert_manager. The default is ["standalone"].
Multiple roles compose in one process. A form like
[node].roles = ["ingester", "querier"] starts the de-duplicated set of foreground servers each role needs (here: the gRPC server serving ingest + scan), gates the matching background loops, and registers the node under all its roles in cluster_nodes — so peers discover it under each role it carries. standalone is shorthand for “every role in one process”.| Role | Foreground exposure | Stateful / stateless | Scaling characteristics |
|---|---|---|---|
standalone | HTTP + gRPC, running every internal responsibility in-process | Stateful (embeds an Ingester) | Single instance for evaluation; not suitable as a horizontal scaling unit |
router | HTTP reverse proxy + rate limiting; does not directly expose internal services beyond the business HTTP listener | Stateless (the rate limiter is in-process and ephemeral) | Scales horizontally; just put an L4/L7 LB in front |
ingester | Carries WAL replay + buffering + periodic flush (the flush loop is spawned at startup when the role is configured) | Stateful (WAL, in-memory buffer, file_meta cache) | Sharded via consistent hashing; scaling must account for the WAL persistent volume and rebalancing |
querier | gRPC distributed-scan endpoint (Arrow Flight do_get): reads columnar files, runs the shard SQL, streams result batches | Stateless | Scales horizontally |
compactor | Runs the compaction + retention tick loop (spawned only when the role is configured) | Stateless (processes tick by tick) | Single instance recommended (see the high-availability section) |
alert_manager | Runs the evaluation + dispatch tick loops (spawned only when the role is configured) | Stateful (evaluation state and incident state are in-process) | Single instance recommended |
A standalone
querier (or any role set containing querier / ingester) starts the gRPC server, which carries the Arrow Flight scan service. The coordinator discovers queriers via list_role(querier) and fans shards out to them over that RPC. A single-process standalone still works as before; distributed scan-out requires ≥2 querier peers.Which role carries which background worker
Role-specific loops (ingester flush, compaction + file_meta_dumper, alert evaluation/dispatch) are spawned only when the owning role is configured (standalone counts as all roles). A handful of always-on workers (heartbeat, sweeper, object-store probe, MMDB refresh, search-jobs, scheduled-reports) run on every node. The table below gives the owning role and interval.
| Background worker | Recommended carrying role | Interval | Config key | Status |
|---|---|---|---|---|
| heartbeat | All roles | heartbeat_interval_secs, default 5s (first beat immediate) | [cluster].heartbeat_interval_secs | Wired up |
| stale node cleanup (sweeper) | All roles | Fixed 60s | None (hardcoded) | Wired up |
| object store health probe | All roles (blocking probe once at startup) | health_probe_interval_secs, default 30s | [store.object].health_probe_interval_secs | Wired up |
| alert evaluation (evaluator) | alert_manager | eval_interval_secs, default 30s | [alert_manager].eval_interval_secs | Wired up |
| alert dispatch (dispatcher) | alert_manager | dispatch_interval_secs, default 10s | [alert_manager].dispatch_interval_secs | Wired up |
| compaction | compactor | interval_secs, default 300s (5 min) | [compactor].interval_secs | Wired up |
| file_meta_dumper (cold partition flush to storage) | compactor (same process as compaction) | interval_secs, default 3600s (1h) | [storage.file_meta_dump].interval_secs; set enabled=false to disable | Wired up |
| scheduled_reports | alert_manager (needs rendering dependencies, see below) | Fixed 60s tick; each report’s own cron decides whether it is due | the report’s cron (the tick interval is not configurable) | Wired up |
| search_jobs (async search pool) | The role carrying AppState (typically standalone / a node exposing query) | Idle poll idle_poll_secs, default 2s; cleanup cleanup_interval_secs, default 3600s | [search_jobs].workers (default 2), etc. | Wired up |
| ACME issuance / renewal | the TLS-terminating HTTP server (router / standalone) | Issuance issue_poll_secs (default 60s); renewal renewal_retry_secs (default 21600s / 6h) | [http.tls].issue_poll_secs / renewal_retry_secs | Wired up (active when [http.tls].enabled = true) |
ACME issuance / renewal is implemented: the issue loop scans
pending domains and the renewal loop renews active certs within their 30-day window, both via the single-issuance path with a per-domain cooldown. The runner is started by the TLS-terminating HTTP server when [http.tls].enabled = true. TLS/ACME are compiled into every build (no feature flag) and are purely runtime-gated by [http.tls].enabled.How to select roles
Roles are configured via[node].roles; they can be overridden with environment variables. Environment variables share the MS_ prefix, with section and field separated by a . (dot) (the MS_ prefix is stripped and the remaining key is split on .). For example: MS_NODE.ROLES, MS_STORE.META.DSN, MS_HTTP.PORT.
- TOML
- Env var override
- A set of role-split processes
A few bootstrap / secret variables are flat single-underscore and live outside
Settings, read directly from the environment: MS_CIPHER_KEY, MS_AUTH_JWT_SECRET_OVERRIDE, MS_LICENSE_FILE, MS_COPILOT_<PROVIDER>_API_KEY, etc. All other structured fields use the MS_<SECTION>.<FIELD> dotted form, exactly as the docker-compose and k8s manifests do (e.g. MS_NODE.ROLES, MS_STORE.META.DSN, MS_CLUSTER.ADVERTISE_ADDR).Data flow
Ingest path
The entry point lands on some Ingester via the Router (rate limiting + consistent hashing); the Ingester first writes the WAL (durably persisted to disk), then writes the in-memory buffer; a background flush loop encodes the buffer into columnar files + a search index by time window or size threshold and uploads it to the object store, finally landing the FileMeta in Postgres and truncating the flushed WAL segments.[wal].dir, [wal].segment_size_mb, [wal].flush_strategy (batch/none/every_write), [wal].sync_level (data/all); [ingester].buffer_max_mb (default 256), flush_interval_secs (default 30), flush_parallelism (default 4); [router.rate_limit].ingest_qps (default 1000 per org, 0 = unlimited).
The Router rate limits at
(org_id, route_class) granularity, returning 429 with a Retry-After header when exceeded. org_id comes from the X-Org-Id request header and defaults to default.Query path
The query entry point is on a node that exposes HTTP. The engine is wrapped layer by layer according to cluster size: local query engine → (with ≥2 querier peers) distributed engine → (when a remote cluster is specified) federated engine.Distributed query shards are fanned out to peers via a hash of
object_key; the shard SQL only performs a scan (SELECT * FROM <stream>), while the full aggregation runs on the coordinator to avoid partial/final aggregation inconsistencies. The distributed path is taken only when the cluster has ≥2 querier peers; otherwise it falls back to the local engine with no network hop.Async search job pipeline
[search_jobs].workers (default 2), idle_poll_secs (default 2), cleanup_interval_secs (default 3600); the automatic async threshold [querier].auto_async_threshold_rows (default 50 million rows).
The
FOR UPDATE SKIP LOCKED claim semantics are safe across multiple workers / multiple nodes: multiple processes carrying search_jobs can share the same search_jobs table and claim concurrently without executing the same job twice.Federated / multi-cluster query
Target clusters are specified via?clusters=local,sf,nyc. After scanning locally, the coordinator issues one internal scan RPC (with a Bearer token) to each enabled remote cluster, then UNION ALLs the batches each cluster returns with the local data and executes the full SQL.
Cluster membership and discovery
Node registration
The heartbeat task periodically upserts
(node_id, roles, advertise_addr, last_heartbeat_at_micros) into the Postgres cluster_nodes table (primary key node_id, ON CONFLICT DO UPDATE). The node’s full role set is stored comma-joined, so a multi-role node is one row discoverable under each of its roles. The first heartbeat fires immediately, subsequent ones at the configured interval.Heartbeat interval
[cluster].heartbeat_interval_secs, default 5s. advertise_addr defaults to 127.0.0.1:5082, i.e. the gRPC address (host:port) peers use to interconnect.Liveness window and stale cleanup
The liveness window is controlled by
[cluster].peer_timeout_secs, default 15s: the registry only returns nodes with last_heartbeat_at >= now - peer_timeout. Additionally, a sweeper deletes cluster_nodes rows that have not heartbeated for over 5 minutes every 60s.Peer discovery
There is no gossip / consensus; in distributed mode each role queries the
cluster_nodes table directly and filters live peers by role membership (a node matches if its role set contains the requested role). standalone mode skips the entire discovery flow, returning only itself.Placement algorithm
Router selecting an Ingester: applies consistent hashing over
org_id|stream_name then mods, landing deterministically on a particular Ingester. Router selecting a Querier: naive round-robin (now_ns % peer_count), not full consistent hashing. Distributed query sharding: a hash over object_key then mods, fanning out to querier peers.Distributed scan RPC
The coordinator encodes a scan request (org/stream/sql/file_metas/time_range) into a ticket and sends it to peers over the internal scan RPC; the peer reads columnar files, registers an in-memory table, runs the shard SQL, and returns a result stream. This RPC shares the same port (default 5082) with the gRPC node service and ingest service.
Local scan-RPC calls within the cluster are unauthenticated; only federated / remote-cluster calls use an optional Bearer token.
cluster_nodes table schema: node_id VARCHAR(64) PK, role VARCHAR(128) (the comma-joined role set), advertise_addr VARCHAR(255), started_at_micros BIGINT, last_heartbeat_at_micros BIGINT. list_role scans alive rows and matches membership in code, so role lookups don’t depend on the column being indexed.
External dependencies and ports
PostgreSQL
The metadata database: FileMeta, streams, rules, incidents, users, orgs, audit, quotas, certificates,
cluster_nodes, search_jobs, remote_clusters, and more. [store.meta]: backend (default sqlite; set to postgres for production), dsn, max_connections (default 16). Migrations are embedded at compile time.Object store
Columnar files + search-index sidecars.
[store.object].backend: local (default, root=./data/objects) / s3 (including MinIO, R2, Alibaba Cloud OSS, overridden via endpoint) / azure / gcs. Credential precedence: env vars > credential file > inline TOML.No external cache / consensus
Only in-process LRU+TTL caches, the rate limiter, and the async runtime. No Redis / Memcached, no external consensus component.
Rendering dependency (optional)
PNG/PDF rendering for scheduled reports requires headless Chromium and is feature-gated; OSS builds always fall back to an SVG placeholder. The image’s runtime layer already ships with chromium.
Listening ports
| Port | Protocol / service | Config key | Notes |
|---|---|---|---|
| 5080 | HTTP | [http].bind (default 0.0.0.0), [http].port | /api/v1/*, /metrics, /healthz, /readyz, /.well-known/acme-challenge/, etc. |
| 5082 | gRPC | [grpc].bind, [grpc].port, max_message_size_mb (default 32MB) | Ingest + scan RPC + cluster heartbeat, three services sharing one port |
| 80 | Plaintext HTTP in TLS mode | [http.tls].plain_port | Health checks + ACME HTTP-01 challenge + 301 redirect to HTTPS |
| 443 | HTTPS in TLS mode (SNI) | [http.tls].port | Full routing; enabled at runtime by [http.tls].enabled (compiled into every build) |
/metrics (GET, Prometheus text 0.0.4) is controlled by [telemetry].metrics_enabled (default true) and exposes metrics such as cache hits / invalidations, object_store_*, wal_*, file_meta_dump_*, tantivy_pruned_files_total, alert_rule_eval_timeout_total.Health / readiness probe semantics
| Probe | Path | 200 condition | 503 condition | Purpose |
|---|---|---|---|---|
| Liveness | GET /healthz | WAL replay complete (Ingester-relevant only; other roles bypass it) and the object store is not degraded | replay in progress or object store connection lost | Process liveness and dependency health |
| Readiness | GET /readyz | replay_done = true (even if the object store is degraded) | replay still in progress | Lets K8s keep admitting read traffic while the write path is degraded |
Object store probe at startup (blocking)
startup_ping() synchronously performs a PUT→GET→DELETE (128 bytes) against _health/{uuid}.probe. On failure, the process does not start.Background periodic probe
The same round-trip runs every
[store.object].health_probe_interval_secs (default 30s); only 3 consecutive failures set object_store_degraded=true, and a success resets the counter to zero.Authentication and config overrides
- The JWT secret is auto-bootstrapped by the database on first startup; the old
jwt_secretTOML field is deprecated (parsing is retained only for backward compatibility with old configs). To pin a fixed secret, use the environment variableMS_AUTH_JWT_SECRET_OVERRIDE. - API tokens take the form
ms_<prefix>_<secret>, with the secret stored as an argon2id hash. - At startup, fields such as
store.meta.dsn,wal.dir,http.port,grpc.port,node.idare treated as immutable; changing them at runtime triggers a warning.
Deployment topologies
All topologies share the same image (e.g.molesignal:dev), distinguished by MS_NODE.ROLES. The web frontend is a separate nginx image (e.g. molesignal-web:dev).
- Single-node sandbox
- Docker Compose
- Kubernetes
A single process with all roles combined; the fastest way to get started.
The local backend (
store.object.backend=local) is fine for development; use an object store backend for production.Scaling and high availability
Router — scales horizontally
Stateless; scale replicas with entry traffic. The rate limiter is in-process and ephemeral: under multiple replicas each replica counts independently, so the effective org QPS ceiling is roughly
configured value × replica count. When needed, move rate limiting up to a unified gateway or lower the per-replica threshold.Querier — scales horizontally
Stateless. A querier process serves the scan RPC over gRPC; distributed scans only trigger with ≥2 querier peers, otherwise the coordinator falls back to the in-process engine with no network hop. The bottleneck is columnar-file reads + query-engine memory.
Ingester — stateful
Use a StatefulSet + a per-replica PVC for a persistent WAL. The Router places via consistent hashing over
org|stream; scaling changes the mod result and causes rebalancing, so ensure the WAL is flushed before scaling down. Leave enough of a replay window for the readiness probe (the manifest sets a 10s initial delay).Compactor / AlertManager — single instance
Both are recommended at a single replica: multiple Compactor instances would produce merge conflicts (a lease-table lock is planned for the future), and AlertManager’s evaluation / incident state is in-process. Reliability comes from fast restarts rather than multiple replicas.
wal_append_lock_wait_seconds, wal_append_inflight, wal_fsync_errors_total, file_meta_dump_*; for the object store, object_store_operations_total, object_store_errors_total, object_store_op_duration_seconds, object_store_probe_*; for queries, the cache hit rate cache_* and tantivy_pruned_files_total; for alerts, alert_rule_eval_timeout_total. Combine /healthz, /readyz, and the cluster_nodes table to observe member liveness.
Minimal production checklist / validation checklist
External dependencies ready
Postgres reachable,
[store.meta].backend = "postgres" with a correct dsn; choose an object store backend of s3/azure/gcs (not local), with credentials injected in the precedence order “env vars > credential file > inline”.Roles and interconnect addresses
Each process has an explicit
MS_NODE.ROLES; set MS_CLUSTER.ADVERTISE_ADDR to a peer-reachable host:5082 (K8s uses $(POD_IP):5082). A process may carry several roles at once (e.g. ["ingester","querier"]); it will start the de-duplicated server set and register under all of them.Ingester persistence
Run the Ingester as a StatefulSet + PVC mounting the WAL directory; set
[wal].flush_strategy / sync_level according to durability requirements; leave enough replay delay for the readiness probe.Single-instance roles
Keep Compactor and AlertManager at 1 replica each. Confirm
[compactor].interval_secs, retention_days, and [storage.file_meta_dump].enabled match expectations.Entry point and rate limiting
Put an LB in front of the Router; set
[router.rate_limit].ingest_qps / query_qps per org; note that rate limiting is approximate under multiple replicas. Disable buffering for /api/v1/query/stream on the Ingress.Observability and probes
Wire
/metrics into Prometheus; point K8s probes at /api/v1/healthz (liveness) and /readyz (readiness). Confirm the object store startup probe passes (otherwise the process will not start).Security and license
Set
MS_AUTH_JWT_SECRET_OVERRIDE when you need a fixed JWT secret; otherwise it is auto-bootstrapped by the DB. Inject a cipher key (do not use the dev all-zeros value in production). Federated query requires the federated_search license feature. TLS + automatic certificate issuance/renewal are available in every build (no feature flag) and switch on at runtime via [http.tls].enabled.