1. The Problem

In early 2022, we were running our entire observability stack on Elastic Cloud. It was the obvious choice at the time — managed, hands-off, no operational burden. Except it wasn’t hands-off at all.

The cluster was regularly falling over. OOM kills on data nodes, indexing rejections during peak load, CPU spiking to saturation with no clear root cause. We were opening support tickets with Elastic, waiting days for responses, and getting back generic JVM tuning suggestions that changed nothing. Meanwhile the bill sat at $25,000 per month.

At 60,000 events per second ingestion, we needed a platform that could actually hold that load. Elastic Cloud wasn’t it — and their support couldn’t tell us why.

I’d self-hosted Elasticsearch clusters before, earlier in my career. I knew the operational surface area, and I knew that with Kubernetes we could build something more predictable, more tunable, and dramatically cheaper. I proposed the migration internally. The business case was straightforward: $25,000/month on Elastic Cloud versus an estimated $5,000/month self-hosted on OKE — the same $20,000 monthly delta, compounded over a year, is $240,000 back in the budget. No friction getting the green light.

This post covers how we designed, migrated, and stabilised a 22-node Elasticsearch cluster on Oracle Kubernetes Engine (OKE), processing 60,000 events per second, storing 10.4TB across 7.9 billion documents — for $5,000 a month.


2. Architecture Design

The core design goal was simple: a cluster that could absorb 60,000 events per second without the instability we’d seen on Elastic Cloud, while keeping operational complexity low enough for a small SRE team to own it.

Why OKE

We were already running our broader platform on Oracle Cloud Infrastructure, so OKE was the natural fit. One consideration we explicitly ruled out early was spreading the cluster across multiple Kubernetes clusters for redundancy. The latency overhead of cross-cluster replication for a single observability stack wasn’t worth it — we wanted one cluster, well-designed, rather than distributed complexity we’d have to operate and debug at 2am.

Node Roles and Sizing

The cluster runs 22 nodes across three distinct roles:

  • 3 dedicated master nodes — cluster state management only, no data, no indexing
  • 3 coordinating nodes — query routing and aggregation, no data storage
  • 16 data/ingest nodes — all indexing and storage

The decision to have dedicated coordinating nodes came from painful experience. Early in the migration we were routing all search and aggregation load directly to data nodes. At peak query load, data nodes were context-switching between serving searches and handling indexing — and losing on both. Introducing dedicated coordinating nodes to absorb query routing immediately stabilised data node CPU.

Every data node runs identical specs:

  • 12 vCPU
  • 64GB RAM
  • 1.5TB OCI block volume

Uniform sizing was a deliberate choice. It simplifies capacity planning, makes horizontal scaling predictable, and eliminates the class of problems where one oversized node becomes a hot spot. We arrived at these numbers through load testing — running progressively higher ingestion rates and watching where JVM heap pressure, CPU, and I/O started degrading. 12 CPU / 64GB / 1.5TB was the point where the cluster stopped complaining.

ECK Operator

We used the Elastic Cloud on Kubernetes (ECK) operator to manage the cluster lifecycle. ECK handles rolling upgrades, certificate management, and StatefulSet configuration in a way that would be painful to replicate with raw manifests. For a self-hosted cluster at this scale, it’s the right abstraction — you get the control of self-hosting without reinventing cluster orchestration.

Storage: OCI Block Volumes

We chose OCI block volumes over local NVMe SSDs. Local NVMe gives better raw I/O performance, but it comes with significant operational overhead — manual attachment, node affinity constraints, and data loss risk if a node is terminated. Block volumes are network-attached, independently managed, and trivially reattachable if a pod reschedules. For an observability stack where operational simplicity matters more than absolute I/O throughput, it was the right tradeoff.

We set vpusPerGB: 20 — OCI’s Balanced performance tier. Higher Performance tiers (vpus 30–40) exist but testing showed Balanced was sufficient for our indexing and search workload. The cost delta wasn’t justified.


3. SSO on OSS Kibana — Vouch + OKTA

Elasticsearch and Kibana have native SSO support — but only on the paid Elastic Security license. We were running the OSS distribution. For an internal observability platform used by engineers across the organisation, asking everyone to manage a separate Kibana password was not acceptable. We needed SSO, and we needed to build it ourselves.

The solution was Vouch Proxy — an open source authentication proxy that sits in front of a web application and delegates identity verification to an OIDC provider. We paired it with OKTA as the identity provider.

How It Works

The flow is straightforward:

  1. Engineer hits the Kibana URL
  2. Nginx intercepts the request and checks for a valid Vouch session cookie
  3. If no valid cookie exists, Nginx redirects to Vouch
  4. Vouch redirects to OKTA for authentication
  5. Engineer logs in via OKTA (SSO, MFA, the full enterprise flow)
  6. OKTA redirects back to Vouch with an authorization code
  7. Vouch validates the token, sets a session cookie, and redirects back to Kibana
  8. Nginx sees the valid cookie and proxies the request through

From the engineer’s perspective: they hit the Kibana URL, land on the OKTA login page, authenticate once, and are in. No separate Kibana credentials.

The Limitation: Kibana RBAC

This is the honest part. Vouch + OKTA solves authentication — verifying who you are. It does not solve Kibana’s native role-based access control.

If you want to enforce Kibana roles — admin, viewer, editor — those roles live inside Elasticsearch’s security model, which requires the paid license to integrate with an external identity provider. On OSS, role assignment is still tied to internal Kibana users. In practice this meant engineers went through OKTA SSO to get into Kibana and landed on read-only dashboard access by default. If they needed elevated access, there was a second authentication prompt with an internal Kibana username and password.

For our use case this was an acceptable tradeoff. Most engineers needed read-only dashboard access, which the default role covered. Elevated access was limited to a small group comfortable with the two-step flow.

If you need full RBAC via OKTA groups mapping to Kibana roles, you need the Elastic Security license. Vouch gets you SSO on OSS — it does not get you the full identity integration.

The setup is still running in production today.


4. The Migration

The migration had one hard constraint: zero data loss. We had 7 days of retention on the observability stack, which turned out to be the natural mechanism for a clean cutover.

Dual-Write via Logstash

Rather than a snapshot-and-restore or a big-bang cutover, we used Logstash to dual-write — sending every incoming event to both Elastic Cloud and the new self-hosted OKE cluster simultaneously. This gave us:

  • A live parallel target to validate against
  • No risk of data loss during the transition period
  • A natural 7-day window to verify the new cluster was stable before cutting off the old one

The Logstash pipeline configuration was straightforward — a single input feeding two Elasticsearch outputs, one pointing at Elastic Cloud, one at the OKE cluster. We monitored indexing rates, shard health, and query results on both sides to confirm parity.

Validation Period

We ran both clusters in parallel for the full 7-day retention window, watching for:

  • Indexing rejection rates
  • JVM heap pressure under peak load
  • Query latency on the coordinating nodes
  • Shard allocation health

Once we were satisfied the new cluster was stable under real production load, we updated the Logstash output to point exclusively at OKE and decommissioned the Elastic Cloud cluster.

Timeline

The full migration took 2 months end to end:

  • Month 1 — OKE cluster setup, ECK operator configuration, storage class tuning, initial load testing with synthetic data
  • Month 2 — Performance testing under real workload patterns, dual-write validation, cutover, decommission

The 2-month timeline wasn’t padding. Performance testing under realistic load surfaced tuning requirements we hadn’t anticipated. The coordinating node gap, the storage class configuration, and most of the Elasticsearch settings covered in the next section all came out of this phase.


5. Stability at Scale — The Hard Tuning

Getting the cluster running was straightforward. Getting it stable at 60,000 events per second took the entire second month. Here is every tuning decision that mattered.

vm.max_map_count

initContainers:
  - name: sysctl
    securityContext:
      privileged: true
    command: ["sh", "-c", "sysctl -w vm.max_map_count=262144"]

Elasticsearch uses memory-mapped files for Lucene index segments. The Linux default of 65530 is too low for any serious workload. This is not optional — it is the first thing to check if a new node is behaving erratically on Kubernetes.

Dedicated Coordinating Nodes

Without them, data nodes context-switch between indexing and serving search aggregations — and lose on both. Introducing 3 dedicated coordinating nodes separated these concerns cleanly. Data node CPU dropped immediately, indexing throughput stabilised. If you are running meaningful search alongside high-throughput indexing, coordinating nodes are not optional.

Storage Class

vpusPerGB: "20"

OCI Balanced tier. We tested higher performance tiers — the cost delta was not justified for our workload.

Key Elasticsearch Settings

cluster.routing.allocation.awareness.attributes: k8s_node_name,zone
node.attr.zone: region-1

Ensures primary and replica shards land on different Kubernetes nodes and zones. Without this a single node failure can take out both copies of your data.

node.store.allow_mmap: true
indices.memory.index_buffer_size: 20%

allow_mmap enables memory-mapped Lucene access. index_buffer_size at 20% — double the default — reduces flush frequency and segment merge pressure at high ingestion rates.

transport.compress: indexing_data
transport.compression_scheme: lz4
http.compression: true

LZ4 compression on inter-node transport and HTTP responses. At 60k events/sec the CPU cost of compression is outweighed by the reduction in network I/O.

search.max_buckets: 300000
search.max_async_search_response_size: 20mb

Default bucket limits are conservative. Deep aggregations across 7 days of observability data hit the default ceiling and produce silent partial results in dashboards. Raise these early.

bootstrap.memory_lock: false

Generally recommended for Elasticsearch, but on Kubernetes with proper resource limits set at the pod level, enabling it causes startup failures. Disable it and let Kubernetes resource management handle memory.

ILM and Shard Sizing

Daily rollover, dynamic shard count from 1 to 12 based on index size, 350GB max at rollover. Keeps individual shards within the recommended range without over-sharding small indices. Steady state: 1504 indices, 6359 shards, zero unassigned, cluster green.


6. Where We Are Today

Two years after the migration, the cluster is running without drama.

MetricValue
Nodes22 (3 master, 3 coordinating, 16 data/ingest)
Total data10.4TB
Documents7.9 billion
Indices1,504
Total shards6,359
Unassigned shards0
Peak indexing rate2,000 docs/sec
Cluster statusGreen
Monthly cost$5,000
Annual saving vs Elastic Cloud$240,000

Owning the stack means owning the problems too. But it also means when something goes wrong, you understand every layer well enough to fix it — rather than waiting on a support ticket that comes back with generic JVM tuning advice.


7. What I’d Do Differently

Start with coordinating nodes. We added them after seeing the problem. The symptoms were subtle enough that it took time to diagnose. Save yourself the investigation — put coordinating nodes in your initial design.

Load test longer than you think you need to. Two months felt like a lot. In hindsight it was the right call. Real production traffic surfaces tuning requirements that synthetic benchmarks miss entirely.

Plan for RBAC from day one. Vouch + OKTA solved authentication cleanly. But Kibana role-based access on OSS is a second problem that the setup does not solve. If fine-grained access control matters to your organisation, either budget for the Elastic Security license or design your user tiers before you go live — retrofitting it is painful.

OCI block volumes were the right call. Local NVMe would have given better raw I/O, but the operational overhead — manual attachment, node affinity constraints, data risk on node termination — would have added meaningful burden for a small SRE team. Simplicity won, and the performance was sufficient.