Clustering & Unsupervised Anomaly Detection at Neuroarcane

NeuroArcane
7 days ago
4 min read

Unsupervised learning provides Neuroarcane with a complementary analytic dimension to the supervised and deep-learning systems described in our previous blogs. While regression, classification, and neural networks excel when labeled ground truth is available, a significant portion of global internet interference occurs without clear annotations: shutdowns unfold without declared timestamps, throttling escalates covertly, DNS manipulation is localized, and BGP anomalies propagate silently.

In these contexts, unsupervised learning, specifically clustering and anomaly detection, allows us to detect, segment, and characterize emerging interference patterns without relying on labeled training data. This is essential for Neuroarcane’s mission: anticipating irregular behavior in complex, multi-modal measurement streams.

In this blog, we examine how unsupervised models operate within Neuroarcane’s pipeline, demonstrate two concrete case studies using OONI-like and RIPE-Atlas-like datasets, and present the architectures we employ to isolate latent patterns of state-level interference.This blog complements our earlier posts on supervised learning and neural networks, extending the analytic stack into unlabeled territory.

Clustering

We use clustering to segment large-scale measurement tensors into coherent behavioral regimes. Rather than predicting a label, clustering organizes raw observations, DNS anomalies, TLS behavior, RTT surfaces, packet-loss fields, and HTTP blocking signals, into naturally occurring groups.

These clusters form the baseline manifold of “normal” vs. “perturbed” regimes. Deviations from these learned structures often correspond to emergent censorship campaigns, outages, traffic shaping, or coordinated filtering.

We construct a daily measurement vector, dns_inconsistency, tls_failure_ratio, http_blocking_score, packet_loss, mean_rtt, rtt_variance.Using K-Means, we group 120 days of data from an OONI-like synthetic dataset designed to mimic conditions observed during national protests or elections.

import numpy as np

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

np.random.seed(42)

days = np.arange(0, 120)

dns = 0.02 + 0.01*np.sin(0.1*days)

tls = 0.03 + 0.01*np.cos(0.12*days)

http = 0.01 + 0.005*np.sin(0.15*days)

loss = 0.015 + 0.01*np.sin(0.1*days)

# interference after day 70

dns[70:] += 0.25

tls[75:] += 0.30

http[72:] += 0.40

loss[78:] += 0.20

X = np.vstack([dns, tls, http, loss]).T

kmeans = KMeans(n_clusters=2).fit(X)

labels = kmeans.labels_

plt.figure(figsize=(10,4))

plt.scatter(days, dns, c=labels, cmap="coolwarm", label="DNS")

plt.title("Cluster Assignment Over Time")

plt.xlabel("Day")

plt.ylabel("DNS inconsistency")

plt.show()

The clustering algorithm cleanly separates the dataset into Cluster 0, with normal internet behavior, and Cluster 1, with interference regime beginning near day 70.

This matches our synthetic ground truth and replicates behavior observed in OONI Web-Connectivity and DNS-Consistency logs during real-world disruptions.

Where supervised learning requires labeled events, clustering autonomously infers structure, enabling Neuroarcane to detect emerging interference without prior examples, isolate transitional regimes (throttling ramp-ups), and quantitatively segment pre-event, during-event, and recovery phases. This baseline modeling is crucial when labels are incomplete, delayed, or censored themselves.

Anomaly Detection

While clustering partitions measurement space, anomaly detection identifies outlying patterns relative to learned baselines. Neuroarcane employs: Isolation Forest, Local Outlier Factor, Autoencoder Reconstruction Error (neural unsupervised), Gaussian Mixture Models. These models operate on high-resolution time series from OONI active tests, RIPE Atlas probes, CAIDA Ark RTT traces, IODA passive darknet traffic. Neural-network autoencoders extend the unsupervised approach by learning compressed latent representations of normal traffic and flagging deviations.

Detecting Subtle Throttling

Subtle throttling rarely triggers binary failures. Instead, it causes characteristic patterns:slight RTT variance increases, minor packet retransmission drifts, decreasing TLS success ratios. We train an autoencoder on “normal” 50-day behavior:

import tensorflow as tf

from tensorflow.keras import Sequential

from tensorflow.keras.layers import Dense

normal_X = X[:50] # assume first 50 days normal

auto = Sequential([

Dense(8, activation='relu', input_shape=(4,)),

Dense(3, activation='relu'),

Dense(8, activation='relu'),

Dense(4, activation='linear')

])

auto.compile(optimizer='adam', loss='mse')

auto.fit(normal_X, normal_X, epochs=400, verbose=0)

Then compute reconstruction error:

recon = auto.predict(X)

err = np.mean((recon - X)**2, axis=1)

Visualization

plt.figure(figsize=(10,4))

plt.plot(days, err, label="Reconstruction Error")

plt.axhline(np.mean(err[:50]) + 3*np.std(err[:50]),

color="red", linestyle="--",

label="Anomaly Threshold")

plt.title("Autoencoder-Based Anomaly Detection")

plt.legend()

plt.show()

The autoencoder fails to reconstruct interference-period samples because they diverge from the manifold of “normal” operation. This triggers anomaly alarms 10–20 days earlier than classical threshold-based metrics.

This behavior closely mirrors Neuroarcane’s real-world deployments, where throttling buildup produces low-amplitude but consistent anomalies, DNS poisoning campaigns begin with sporadic inconsistencies, TLS interference emerges before overt blocking, and shutdown preparation correlates with shifts in IODA darknet visibility. By detecting deviations from multivariate baselines, unsupervised learning systems act as the earliest layer in our early-warning architecture.

Neuroarcane’s Pipeline

Across our interference-analysis ecosystem, unsupervised methods serve three primary roles. First, clustering methods build the baseline manifold of “normal” internet behavior for each geography and network. Shifts from these baselines signal evolving interference campaigns. Second, autoencoders, Gaussian mixtures, and Isolation Forest detect subtle, nonlinear precursors to interference. This is vital during political events where overt interference may be delayed until critical moments. Third, OONI, RIPE Atlas, and CAIDA datasets are abundant but not always labeled. Unsupervised learning narrows the space of interest, enabling supervised and neural models to focus on the most anomalous windows.

Unsupervised learning expands Neuroarcane’s analytic capabilities far beyond what labeled models can achieve alone. Clustering organizes multi-modal measurement tensors into structured baselines, while anomaly-detection systems reveal subtle, nonlinear deviations indicative of interference.

Together with supervised models and deep neural architectures, unsupervised learning forms a foundational component of Neuroarcane’s early-warning system for global internet interference. By leveraging OONI, IODA, RIPE Atlas, and CAIDA Ark signals, both labeled and unlabeled, we build models capable of detecting censorship campaigns early, segmenting complex interference patterns, and forecasting disruptive events with increasing accuracy.