Aggregation lab

Federated model aggregation

Combine the per-machine XGBoost tool-wear models with different aggregation strategies and benchmark the result against the local models and the centralized reference, all from prepared artifacts, with no raw data leaving the shop floor.

Runs in your browser
Every aggregation and metric on this page is recomputed live in your browser from the exported artifacts: the local prediction vectors, model metadata and costs. No data is sent anywhere; the benchmark runs entirely on your device.

Models in the pool

The four models in this pool are created on the basis of the public Stiehl et al. milling tool-wear dataset, split by machine tool into three federated sites. Each site trains its own XGBoost regressor on local data only, the centralized reference on the pooled data; all four then predict flank wear V_B in micrometers on the same 4,194-sample held-out test set. Note that these are the actual machines from the dataset and models created from the processed data, not the illustrative machines from the overview page.

Machine 1

Local site model

Training rows
1,218
Trees
390
Model size
377.7 KB
Own-machine MAE
8.82 µm

Machine 2

Local site model

Training rows
1,856
Trees
3,000
Model size
4.51 MB
Own-machine MAE
2.82 µm

Machine 3

Local site model

Training rows
1,218
Trees
1,026
Model size
1.47 MB
Own-machine MAE
11.52 µm

Centralized

Reference baseline

Training rows
4,292
Trees
1,197
Model size
2.16 MB
Global MAE
5.98 µm

The centralized model pools all raw data. It is shown only as a non-sovereign reference for what dropping the data boundary would buy, not as a federated option.

Leakage-safe data split

Every model and every aggregation is judged on the same grouped split. Whole tools are assigned to train, validation or test per client, so no window of the same tool ever leaks across splits. This is a grouped split, not a simple random one.

ClientTrainValidationTestGrouping
Machine 11,2181,2761,218by tool · 3 groups
Machine 21,8561,8561,856by tool · 3 groups
Machine 31,2181,2181,120by tool · 3 groups
All clients4,2924,3504,194seed 42

Aggregation lab

Choose which site models take part, pick an aggregation strategy and set the wear threshold. Each model predicts the flank-wear land width V_B in micrometers; the benchmark compares the aggregated result against three references: owner-local, best-local and centralized.

1

Select participating models

This method was trained on the pipeline across all three sites. Its result is fixed; the model selector does not apply.

2

Choose an aggregation method

Baselines & references

Weighted prediction averaging

Learned on held-out data

Robust pooling

Safety-oriented & decision-level

Tree-ensemble aggregation

Interactive policy weighting

Input-conditional routing

Federated training (precomputed)

3

Wear-decision threshold

120 µm

Tools at or above the threshold count as worn. It drives accuracy, recall and the late-wear metrics.

3 of 3 models · Histogram-based federated boosting

Loading artifacts and benchmarking…

Aggregation methods

The benchmark covers families of sovereign aggregation that range from simple averaging to robust pooling, safety-first decisions and interactive policy weighting.

Baselines & references

Owner-local

Local model artifact

Every sample is scored by its own machine's local model. The sovereignty-respecting floor: each site keeps its own model and shares nothing.

View details

Best-validation selection

Selected local model

Picks the single local model with the lowest validation error and uses it for everyone. Tests whether aggregating beats simply choosing the strongest client.

View details

Weighted prediction averaging

Uniform ensemble

Virtual prediction ensemble

Arithmetic mean of the selected local predictions. The simplest deployable prediction-level ensemble and a prediction-level counterpart to FedAvg's model pooling.

View details

Sample-weighted ensemble

Virtual prediction ensemble

Weights each model by its number of local training samples, on the common assumption that clients with more data should contribute more.

View details

Validation-weighted ensemble

Virtual prediction ensemble

Weights each model by inverse validation error, with the weights clamped so a single small or noisy client cannot dominate. Uses no test labels.

View details

Precision-weighted ensemble

Virtual prediction ensemble

Weights each model by the inverse variance of its validation errors, so the most consistent (least erratic) model contributes most. Distinct from the accuracy-based weighting: a model can be biased yet precise, or accurate on average yet erratic. Uses no test labels.

View details

Learned on held-out data

Stacked ensemble

Virtual prediction ensemble

Learns each model's weight by least squares on the held-out validation split (non-negative, summing to one), then applies those weights to the test predictions. The data-driven counterpart of the heuristic weighted ensembles.

View details

Contribution-weighted ensemble

Virtual prediction ensemble

Weights each model by its Shapley-value contribution to the ensemble's accuracy on the held-out validation split, computed exactly over every client coalition. The game-theoretic, fairness-aware counterpart of the heuristic weighted ensembles: a client that does not improve the ensemble earns little weight. Uses no test labels.

View details

Calibrated ensemble

Virtual prediction ensemble

Fits a scale and offset (a·ŷ+b) on the held-out validation split and applies it to the uniform ensemble, correcting the systematic under-prediction that otherwise suppresses late-wear recall.

View details

Isotonic calibration

Virtual prediction ensemble

Fits a free monotonic map from the uniform ensemble to the target on the validation split (pool-adjacent-violators), then applies it to the test ensemble. Captures curved miscalibration that the affine fit cannot.

View details

Per-client bias correction

Virtual prediction ensemble

Estimates each client model's systematic offset on its own validation rows and subtracts it before owner-local routing. Targets heterogeneous per-site bias rather than one global correction.

View details

Conformal intervals

Virtual prediction ensemble

Keeps the uniform-ensemble point estimate but adds a split-conformal prediction interval from the validation residuals, with a distribution-free coverage target. The only method that quantifies uncertainty instead of a single value.

View details

Robust pooling

Median ensemble

Virtual prediction ensemble

Takes the per-sample median across local predictions, so a single extreme or mis-calibrated model cannot pull the result. Represents the robust family (trimmed mean, winsorized mean, Huber).

View details

Geometric-median ensemble

Virtual prediction ensemble

Finds the single consensus prediction vector that minimizes the total Euclidean distance to every model's prediction vector — the multivariate L1 median, found by Weiszfeld iteration. A model whose whole prediction vector drifts from the consensus is automatically down-weighted. This is the prediction-level analog of Robust Federated Averaging (RFA); unlike the per-sample median it yields one fixed set of client weights.

View details

Safety-oriented & decision-level

Conservative (max)

Virtual prediction ensemble

Takes the highest local prediction per sample. Underestimating a worn tool is more costly than a small early-life error, so a safety-first policy leans high.

View details

Majority threshold vote

Decision-level ensemble

Each model votes worn / not-worn at the threshold and the majority wins. Shows the difference between aggregating wear values and aggregating the replacement decision.

View details

Tree-ensemble aggregation

Federated tree-bagging committee

Prediction-equivalent committee

Treats each client's full XGBoost ensemble as one committee member. Reported as a prediction-equivalent committee; no native merged booster file is produced.

View details

Interactive policy weighting

Manual policy weights

Virtual prediction ensemble

Set each client's weight by hand and watch the effect on accuracy, fairness and cost in real time. Useful for exploring governance and trust policies.

View details

Cost-aware weighting

Virtual prediction ensemble

Down-weights models that are large or expensive to transfer, connecting aggregation quality to the model-exchange cost of the sovereign data space.

View details

Input-conditional routing

Nearest-expert routing

Virtual prediction ensemble

For every sample, picks the site whose training data it most resembles (nearest centroid in feature space) and uses that model alone. A label-free way to identify the originating site; where the sites are well separated it recovers owner-local accuracy without ever seeing the owner label.

View details

Distance-weighted experts

Virtual prediction ensemble

Blends the site models per sample with weights that fall off with feature-space distance (inverse-distance, normalized). A soft version of routing: it hedges between experts instead of committing to one, trading a little accuracy for robustness when a sample sits between distributions.

View details

Federated training (precomputed)

Histogram-based federated boosting

Trained federated model

Each site sends per-feature gradient histograms instead of raw rows; the server sums them to choose every split, so the trained ensemble matches a pooled-data model while raw traces stay local. Trained by the pipeline and replayed here.

View details

Cyclic boosting

Trained federated model

A single boosted ensemble visits the sites in turn, adding trees on each one's data before handing the model on. The result depends on the visit order, so every permutation of the participating sites was trained ahead of time and can be selected for comparison.

View details

Distilled student model

Trained federated model

A single compact model is trained to imitate a teacher ensemble's predictions, yielding one small deployable artifact instead of three. Here the teacher is a uniform committee of the local models — a design choice that shapes the student.

View details

Compare all methods

Each aggregation method is one point. Choose a metric for each axis and the machine test data to score them on. Every method still combines all three site models; only the evaluation set changes.

24 of 24 methods selected

Points are scored at the 120 µm wear threshold over the selected machines' held-out test rows. A method that does not define a chosen metric drops out for that axis; the decision-only vote, for example, has no regression error. Cost metrics — communication size, artifact size, decision trees and latency — are each method's deployment footprint and do not vary with the machines scored; the communication- and artifact-size axes are logarithmic.

Reading the benchmark

The numbers above reward a careful read. Two points matter before you rank one method over another.

Why some methods show 0 % precision, recall and F1

Precision, recall and F1 score the worn / not-worn decision, and a tool only counts as predicted-worn once its predicted wear reaches the threshold. The blending ensembles — the uniform, sample-, validation-, precision-, stacked and contribution-weighted ensembles, the median and the geometric median — systematically under-predict the rare high-wear tools. Their highest prediction across the whole test set sits around 116 µm, just under the default 120 µm line, so they label every tool not-worn. With no predicted-worn tools there are no true positives: recall and F1 are exactly zero, and precision is reported as zero (it is really 0 ÷ 0).

The zero is informative rather than a fault. Averaging minimizes the average error, but it pulls the extremes toward the center, and that is what erases the safety-critical worn / not-worn decision: a single site model does reach 124–141 µm on the worst tools, yet blending it with the lower models drags the result back below the line. Methods that keep one high prediction (owner-local, nearest-expert routing, conservative-max), rescale the output (the calibrated ensembles) or vote on the decision itself (the majority threshold vote) recover a non-zero recall.

To see it move, drag the wear-decision threshold down towards 110 µm and the blending ensembles start to cross it, or compare them with the affine- and isotonic-calibrated ensembles, which lift late-wear recall above zero by rescaling the under-prediction upward.

Small gaps sit inside the noise

Every figure here is a single point estimate from three sites, one held-out test split and one random seed. With only three clients, a gap of a few tenths of a micrometer between two methods can sit inside ordinary run-to-run and split variation. Read the scatter and the small deltas against centralized and best-local as rough tiers rather than a strict ranking: a difference far smaller than the spread between the machines is unlikely to survive a different seed or split.

The offline pipeline puts a figure on this uncertainty with 95 % bootstrap confidence intervals on the headline metrics, resampling the test set about a thousand times. They are computed but kept off this page to keep the comparison readable. The rule of thumb: trust the direction of a large gap, discount a small one.

Concepts behind the catalog

Three organizing ideas the methods above quietly rely on. The same three explain what the catalog deliberately leaves out, below.

One global model, or one per client?

Classic federated learning aims for a single global model. But when each site's data differs (non-IID) — here, a different machine — one model rarely serves everyone best. Personalized FL instead tailors a model to each client.

Several methods here are the prediction-level shadow of that idea: owner-local keeps each site on its own model, the routing methods pick or blend experts per sample, per-client bias correction removes each site's own offset, and the manual weights let you mix a local and a shared view. The neural-network recipes (Per-FedAvg, Ditto, FedRep) do not transfer to trees, but the underlying question — globalize or personalize — is the same.

One-shot vs. iterative federation

Textbook federated learning is iterative: clients and server exchange updates over many communication rounds until a global model converges (FedAvg and its variants).

This demonstrator is one-shot: each site trains once, then the models (or their predictions) are combined a single time. One-shot federation is its own research line — cheaper and simpler, at the cost of the back-and-forth refinement that iterative rounds buy. The cost view measures bytes per exchange; the number of rounds is the other half of communication cost, and here it is one.

Why "data stays local" still needs protecting

Keeping raw data on the shop floor is not the whole privacy story: the model updates that are shared can themselves leak information, and the aggregator must be trusted.

Secure aggregation masks updates so the server sees only their sum; differential privacy adds calibrated noise for a formal guarantee; robust aggregation defends against malicious clients. These wrap the training methods (such as the histogram-aggregation model) rather than the prediction combining, so they are described rather than run here. They also distinguish horizontal federation (sites share samples, as in this demonstrator) from vertical federation (sites share features), where encryption like SecureBoost's becomes essential.

Methods not implemented here

The concepts above also mark the edges of this setting. Most of federated aggregation assumes neural-network weights, raw cross-client training, or many communication rounds — none of which fit a one-shot ensemble of XGBoost models combined from stored predictions. These are the notable methods left out, and why.

Parameter-averaging family (FedAvg and friends)

FedAvg (plain parameter averaging)

The canonical baseline: average each client's model weights element by element. A boosted-tree ensemble is a sequence of split decisions, not an averageable weight vector, so averaging XGBoost files yields no meaningful model. The prediction-level ensembles in this demonstrator are the tree-compatible stand-in.

FedProx

Adds a proximal term that anchors each client's local update to the global model to tame client drift under non-IID data — the central problem in this by-machine setting. It still operates on a differentiable weight vector, which boosted-tree split structures do not have.

Server optimizers (FedOpt / FedAdam / SCAFFOLD)

Replace the plain weight average with an adaptive server-side optimizer or a variance-reduction correction. All assume gradients or weights to aggregate, so none apply to tree ensembles.

Matched averaging (FedMA / PFNM)

Before averaging neural networks you must first match neurons across clients, because weights are only defined up to permutation. Trees have no neurons to match — there is nothing to align, which is the deeper reason parameter averaging fails for them.

Weight merging (model soups, Git Re-Basin)

Modern recipes for averaging already-trained networks by aligning their weight symmetries. They need a weight space with structural correspondence, which tree ensembles lack.

Robust & Byzantine-resilient aggregation

Coordinate-wise trimmed mean

Drops the highest and lowest prediction per sample before averaging. With only three sites, dropping one high and one low leaves the middle value, so it collapses exactly to the median ensemble already offered.

Krum / Multi-Krum

Selects the update most agreed with by its neighbors to resist malicious clients. Krum needs 2f+2 < n participants; with three honest sites it degenerates to a near-trivial choice, so it is conceptual here rather than runnable.

Bulyan

Stacks Krum and the trimmed mean for stronger Byzantine tolerance, requiring even more participants (n ≥ 4f+3). Out of range for three sites.

Vertical & secure tree federation

SecureBoost (vertical federated trees)

The vertical-FL counterpart: parties hold different features for the same samples and exchange encrypted gradients and hessians so the label-holder can choose splits. This demonstrator is horizontal — each site holds whole samples — so no single site can emit a full prediction vector to combine.

Federated random forest (forest merge)

Pools independently grown trees into one forest. Averaging the per-site forest outputs is exactly the tree-ensemble committee already offered, so a separate method would only duplicate it — though, unlike cyclic boosting, a forest merge is order-free.

Privacy-preserving & systems-level aggregation

Secure aggregation

Cryptographic masking so the server learns only the summed update, never any single client's. It wraps the histogram-aggregation training method rather than combining finished predictions, so there is nothing for the browser to recompute. See the privacy concept card above.

Differential privacy (DP-FedAvg / DP-GBDT)

Clips each contribution and adds calibrated noise for a formal privacy guarantee. Also a wrapper on training — applied to the gradient histograms — not to the frozen prediction vectors the browser holds.

Asynchronous aggregation (FedAsync / FedBuff)

Aggregates client updates as they arrive, weighting by staleness, to handle stragglers and scale. A scheduling property of multi-round training; this demonstrator runs a single combine step.

Communication compression (quantization / sparsification)

Shrinks each transmitted update via quantization or sparsification. It changes how many bytes cross the boundary, not how predictions are combined — the cost view above is where this would show up.

Personalization & distillation variants

Personalized FL (Per-FedAvg / pFedMe / Ditto / FedRep)

Learn a tailored model per client instead of one global model. The prediction-level shadows of this idea are already here — owner-local, routing, per-client bias correction, the manual local/global mix — but the neural-network mechanisms themselves do not transfer to trees. See the personalization concept card above.

FedMD (iterated co-distillation)

Clients repeatedly teach one another via soft labels on a shared public set across many rounds. The single-round version is the distilled student already offered; the iterated loop would need cross-client retraining rounds, not a browser recompute.

Learned mixture-of-experts gate

A gating network trained to predict the best expert per input — the learned generalization of the heuristic distance gates in the routing family. It needs the validation-split feature distances (not currently exported) or an in-browser classifier, so it is left as a future computable addition.

Sovereign by design

What ties the whole catalog together — and where its sovereignty stops.

Open any method above and its detail card names what crosses the site boundary: nothing, a single scalar, a prediction vector, validation residuals, gradient histograms, or one model handed from site to site. That column is the spine of the catalog. Every method works on exported artifacts (prediction vectors, validation scores, training sizes, model metadata), never on raw process signals; even the held-out test data this page benchmarks on is a processed export, derived from the signals rather than the signals themselves. No raw data is pooled, which is what lets the whole comparison run in your browser with nothing sent back.

Keeping raw data on the shop floor is where sovereignty starts, not where it ends. The predictions and models a site exports still reach the aggregator and can leak information about the data behind them, so this is a documented boundary, not a privacy guarantee. The secure-aggregation and differential-privacy concepts above are the layer that would close that gap in a real deployment.

Two reference points sit outside the sovereign set. The centralized model trains on the pooled raw data: it shows what dropping the boundary would buy, not a guaranteed best score, and one sovereign method here already edges past it. The three sites are machine partitions of one public dataset standing in for independent operators, so the page shows the mechanism rather than a claim that it carries over to other controllers or materials.

Explore further