BACK_TO_ARCHIVE
Cyber Security & Networking/Enterprise Network Operations Group

Building A High Throughput Network Telemetry Platform For Security And Operations Monitoring

Designing and building an open source ClickHouse cluster to capture network telemetry from thousands of routers at 500k rows per second, powering SOC and NOC analytics integrated with Splunk.

The Challenge

Our client operates a large enterprise network estate spanning thousands of routers, switches and edge devices distributed across multiple sites and geographies.

Network telemetry, flow data and device level events are fundamental inputs to both their Security Operations Centre (SOC) for threat detection and incident response, and their Network Operations Centre (NOC) for availability, capacity and performance monitoring.

Their incumbent tooling was struggling to keep up with the sustained ingest rates and retention requirements, whilst the cost of pushing the full firehose of telemetry into Splunk for long term retention had become prohibitive. They needed a high throughput, cost effective analytics tier that could sit alongside Splunk and act as the system of record for raw network telemetry.

Customer Pain Points

The customer were experiencing the following challenges prior to the engagement:

  • Sustained network telemetry ingest rates of around 500,000 rows per second from thousands of network devices, with limited headroom in their existing platform.
  • Prohibitive licensing and storage costs associated with retaining the full raw telemetry stream in Splunk for the required look back windows.
  • Slow and expensive historical search over network data, hampering SOC threat hunting and NOC root cause analysis workflows.
  • A need to keep SOC and NOC analysts working in their existing Splunk dashboards and tooling rather than forcing a switch to a new UI.
  • Strict requirements around resilience, data integrity and on premises or self managed deployment for security sensitive workloads.

Our Technical Approach

We took the following approach to this project:

  • Designed and built a self managed open source ClickHouse cluster sized for sustained ingest of 500k rows per second with significant burst headroom and multi year retention.
  • Implemented a streaming ingestion pipeline to collect flow records, device telemetry and event data from thousands of routers and onward into ClickHouse with low end to end latency.
  • Modelled the schema with appropriate sort keys, codecs and partitioning to deliver strong compression and fast time bounded queries over the network dataset.
  • Integrated ClickHouse with the customer's existing Splunk dashboards so that SOC and NOC analysts could continue to work in familiar tooling whilst querying the new high throughput backend.
  • Hardened the platform for production with replication, monitoring, alerting and runbooks aligned to the customer's security and operations standards.
  • Upskilled the in house networking, security and platform teams on ClickHouse operations, schema evolution and query patterns.

Outcomes

Key outcomes of the project included:

  • Production network telemetry platform capturing data from thousands of routers at a sustained rate of around 500,000 rows per second.
  • Substantial reduction in Splunk ingestion and retention costs by offloading raw network telemetry to ClickHouse whilst preserving the analyst experience.
  • Significantly faster historical search and analytics over network data, accelerating SOC threat hunting and NOC incident investigation workflows.
  • A robust, fully open source ClickHouse deployment with no proprietary lock in, owned and operated by the customer's own teams.
  • Upskilled SOC, NOC and platform engineers with the ClickHouse expertise needed to evolve the platform as new telemetry sources and use cases come online.
CASE_ID: network-device-captureRETURN_TO_INDEX