Federated model aggregation
Combine the per-machine XGBoost tool-wear models with different aggregation strategies and benchmark the result against the local models and the centralized reference, all from prepared artifacts, with no raw data leaving the shop floor.
Models in the pool
The four models in this pool are created on the basis of the public Stiehl et al. milling tool-wear dataset, split by machine tool into three federated sites. Each site trains its own XGBoost regressor on local data only, the centralized reference on the pooled data; all four then predict flank wear V_B in micrometers on the same 4,194-sample held-out test set. Note that these are the actual machines from the dataset and models created from the processed data, not the illustrative machines from the overview page.
Machine 1
Local site model
- Training rows
- 1,218
- Trees
- 390
- Model size
- 377.7 KB
- Own-machine MAE
- 8.82 µm
Machine 2
Local site model
- Training rows
- 1,856
- Trees
- 3,000
- Model size
- 4.51 MB
- Own-machine MAE
- 2.82 µm
Machine 3
Local site model
- Training rows
- 1,218
- Trees
- 1,026
- Model size
- 1.47 MB
- Own-machine MAE
- 11.52 µm
Centralized
Reference baseline
- Training rows
- 4,292
- Trees
- 1,197
- Model size
- 2.16 MB
- Global MAE
- 5.98 µm
The centralized model pools all raw data. It is shown only as a non-sovereign reference for what dropping the data boundary would buy, not as a federated option.
Leakage-safe data split
Every model and every aggregation is judged on the same grouped split. Whole tools are assigned to train, validation or test per client, so no window of the same tool ever leaks across splits. This is a grouped split, not a simple random one.
| Client | Train | Validation | Test | Grouping |
|---|---|---|---|---|
| Machine 1 | 1,218 | 1,276 | 1,218 | by tool · 3 groups |
| Machine 2 | 1,856 | 1,856 | 1,856 | by tool · 3 groups |
| Machine 3 | 1,218 | 1,218 | 1,120 | by tool · 3 groups |
| All clients | 4,292 | 4,350 | 4,194 | seed 42 |
Aggregation lab
Choose which site models take part, pick an aggregation strategy and set the wear threshold. Each model predicts the flank-wear land width V_B in micrometers; the benchmark compares the aggregated result against three references: owner-local, best-local and centralized.
Select participating models
This method was trained on the pipeline across all three sites. Its result is fixed; the model selector does not apply.
Choose an aggregation method
Baselines & references
Weighted prediction averaging
Learned on held-out data
Robust pooling
Safety-oriented & decision-level
Tree-ensemble aggregation
Interactive policy weighting
Input-conditional routing
Federated training (precomputed)
Wear-decision threshold
Tools at or above the threshold count as worn. It drives accuracy, recall and the late-wear metrics.
3 of 3 models · Histogram-based federated boosting
Loading artifacts and benchmarking…
Aggregation methods
The benchmark covers families of sovereign aggregation that range from simple averaging to robust pooling, safety-first decisions and interactive policy weighting.
Baselines & references
Owner-local
Local model artifactEvery sample is scored by its own machine's local model. The sovereignty-respecting floor: each site keeps its own model and shares nothing.
View details
Best-validation selection
Selected local modelPicks the single local model with the lowest validation error and uses it for everyone. Tests whether aggregating beats simply choosing the strongest client.
View details
Weighted prediction averaging
Uniform ensemble
Virtual prediction ensembleArithmetic mean of the selected local predictions. The simplest deployable prediction-level ensemble and a prediction-level counterpart to FedAvg's model pooling.
View details
Sample-weighted ensemble
Virtual prediction ensembleWeights each model by its number of local training samples, on the common assumption that clients with more data should contribute more.
View details
Validation-weighted ensemble
Virtual prediction ensembleWeights each model by inverse validation error, with the weights clamped so a single small or noisy client cannot dominate. Uses no test labels.
View details
Precision-weighted ensemble
Virtual prediction ensembleWeights each model by the inverse variance of its validation errors, so the most consistent (least erratic) model contributes most. Distinct from the accuracy-based weighting: a model can be biased yet precise, or accurate on average yet erratic. Uses no test labels.
View details
Learned on held-out data
Stacked ensemble
Virtual prediction ensembleLearns each model's weight by least squares on the held-out validation split (non-negative, summing to one), then applies those weights to the test predictions. The data-driven counterpart of the heuristic weighted ensembles.
View details
Contribution-weighted ensemble
Virtual prediction ensembleWeights each model by its Shapley-value contribution to the ensemble's accuracy on the held-out validation split, computed exactly over every client coalition. The game-theoretic, fairness-aware counterpart of the heuristic weighted ensembles: a client that does not improve the ensemble earns little weight. Uses no test labels.
View details
Calibrated ensemble
Virtual prediction ensembleFits a scale and offset (a·ŷ+b) on the held-out validation split and applies it to the uniform ensemble, correcting the systematic under-prediction that otherwise suppresses late-wear recall.
View details
Isotonic calibration
Virtual prediction ensembleFits a free monotonic map from the uniform ensemble to the target on the validation split (pool-adjacent-violators), then applies it to the test ensemble. Captures curved miscalibration that the affine fit cannot.
View details
Per-client bias correction
Virtual prediction ensembleEstimates each client model's systematic offset on its own validation rows and subtracts it before owner-local routing. Targets heterogeneous per-site bias rather than one global correction.
View details
Conformal intervals
Virtual prediction ensembleKeeps the uniform-ensemble point estimate but adds a split-conformal prediction interval from the validation residuals, with a distribution-free coverage target. The only method that quantifies uncertainty instead of a single value.
View details
Robust pooling
Median ensemble
Virtual prediction ensembleTakes the per-sample median across local predictions, so a single extreme or mis-calibrated model cannot pull the result. Represents the robust family (trimmed mean, winsorized mean, Huber).
View details
Geometric-median ensemble
Virtual prediction ensembleFinds the single consensus prediction vector that minimizes the total Euclidean distance to every model's prediction vector — the multivariate L1 median, found by Weiszfeld iteration. A model whose whole prediction vector drifts from the consensus is automatically down-weighted. This is the prediction-level analog of Robust Federated Averaging (RFA); unlike the per-sample median it yields one fixed set of client weights.
View details
Safety-oriented & decision-level
Conservative (max)
Virtual prediction ensembleTakes the highest local prediction per sample. Underestimating a worn tool is more costly than a small early-life error, so a safety-first policy leans high.
View details
Majority threshold vote
Decision-level ensembleEach model votes worn / not-worn at the threshold and the majority wins. Shows the difference between aggregating wear values and aggregating the replacement decision.
View details
Tree-ensemble aggregation
Federated tree-bagging committee
Prediction-equivalent committeeTreats each client's full XGBoost ensemble as one committee member. Reported as a prediction-equivalent committee; no native merged booster file is produced.
View details
Interactive policy weighting
Manual policy weights
Virtual prediction ensembleSet each client's weight by hand and watch the effect on accuracy, fairness and cost in real time. Useful for exploring governance and trust policies.
View details
Cost-aware weighting
Virtual prediction ensembleDown-weights models that are large or expensive to transfer, connecting aggregation quality to the model-exchange cost of the sovereign data space.
View details
Input-conditional routing
Nearest-expert routing
Virtual prediction ensembleFor every sample, picks the site whose training data it most resembles (nearest centroid in feature space) and uses that model alone. A label-free way to identify the originating site; where the sites are well separated it recovers owner-local accuracy without ever seeing the owner label.
View details
Distance-weighted experts
Virtual prediction ensembleBlends the site models per sample with weights that fall off with feature-space distance (inverse-distance, normalized). A soft version of routing: it hedges between experts instead of committing to one, trading a little accuracy for robustness when a sample sits between distributions.
View details
Federated training (precomputed)
Histogram-based federated boosting
Trained federated modelEach site sends per-feature gradient histograms instead of raw rows; the server sums them to choose every split, so the trained ensemble matches a pooled-data model while raw traces stay local. Trained by the pipeline and replayed here.
View details
Cyclic boosting
Trained federated modelA single boosted ensemble visits the sites in turn, adding trees on each one's data before handing the model on. The result depends on the visit order, so every permutation of the participating sites was trained ahead of time and can be selected for comparison.
View details
Distilled student model
Trained federated modelA single compact model is trained to imitate a teacher ensemble's predictions, yielding one small deployable artifact instead of three. Here the teacher is a uniform committee of the local models — a design choice that shapes the student.
View details
Compare all methods
Each aggregation method is one point. Choose a metric for each axis and the machine test data to score them on. Every method still combines all three site models; only the evaluation set changes.
24 of 24 methods selected
Points are scored at the 120 µm wear threshold over the selected machines' held-out test rows. A method that does not define a chosen metric drops out for that axis; the decision-only vote, for example, has no regression error. Cost metrics — communication size, artifact size, decision trees and latency — are each method's deployment footprint and do not vary with the machines scored; the communication- and artifact-size axes are logarithmic.
Reading the benchmark
The numbers above reward a careful read. Two points matter before you rank one method over another.
Why some methods show 0 % precision, recall and F1
Precision, recall and F1 score the worn / not-worn decision, and a tool only counts as predicted-worn once its predicted wear reaches the threshold. The blending ensembles — the uniform, sample-, validation-, precision-, stacked and contribution-weighted ensembles, the median and the geometric median — systematically under-predict the rare high-wear tools. Their highest prediction across the whole test set sits around 116 µm, just under the default 120 µm line, so they label every tool not-worn. With no predicted-worn tools there are no true positives: recall and F1 are exactly zero, and precision is reported as zero (it is really 0 ÷ 0).
The zero is informative rather than a fault. Averaging minimizes the average error, but it pulls the extremes toward the center, and that is what erases the safety-critical worn / not-worn decision: a single site model does reach 124–141 µm on the worst tools, yet blending it with the lower models drags the result back below the line. Methods that keep one high prediction (owner-local, nearest-expert routing, conservative-max), rescale the output (the calibrated ensembles) or vote on the decision itself (the majority threshold vote) recover a non-zero recall.
To see it move, drag the wear-decision threshold down towards 110 µm and the blending ensembles start to cross it, or compare them with the affine- and isotonic-calibrated ensembles, which lift late-wear recall above zero by rescaling the under-prediction upward.
Small gaps sit inside the noise
Every figure here is a single point estimate from three sites, one held-out test split and one random seed. With only three clients, a gap of a few tenths of a micrometer between two methods can sit inside ordinary run-to-run and split variation. Read the scatter and the small deltas against centralized and best-local as rough tiers rather than a strict ranking: a difference far smaller than the spread between the machines is unlikely to survive a different seed or split.
The offline pipeline puts a figure on this uncertainty with 95 % bootstrap confidence intervals on the headline metrics, resampling the test set about a thousand times. They are computed but kept off this page to keep the comparison readable. The rule of thumb: trust the direction of a large gap, discount a small one.
Concepts behind the catalog
Three organizing ideas the methods above quietly rely on. The same three explain what the catalog deliberately leaves out, below.
One global model, or one per client?
Classic federated learning aims for a single global model. But when each site's data differs (non-IID) — here, a different machine — one model rarely serves everyone best. Personalized FL instead tailors a model to each client.
Several methods here are the prediction-level shadow of that idea: owner-local keeps each site on its own model, the routing methods pick or blend experts per sample, per-client bias correction removes each site's own offset, and the manual weights let you mix a local and a shared view. The neural-network recipes (Per-FedAvg, Ditto, FedRep) do not transfer to trees, but the underlying question — globalize or personalize — is the same.
One-shot vs. iterative federation
Textbook federated learning is iterative: clients and server exchange updates over many communication rounds until a global model converges (FedAvg and its variants).
This demonstrator is one-shot: each site trains once, then the models (or their predictions) are combined a single time. One-shot federation is its own research line — cheaper and simpler, at the cost of the back-and-forth refinement that iterative rounds buy. The cost view measures bytes per exchange; the number of rounds is the other half of communication cost, and here it is one.
Why "data stays local" still needs protecting
Keeping raw data on the shop floor is not the whole privacy story: the model updates that are shared can themselves leak information, and the aggregator must be trusted.
Secure aggregation masks updates so the server sees only their sum; differential privacy adds calibrated noise for a formal guarantee; robust aggregation defends against malicious clients. These wrap the training methods (such as the histogram-aggregation model) rather than the prediction combining, so they are described rather than run here. They also distinguish horizontal federation (sites share samples, as in this demonstrator) from vertical federation (sites share features), where encryption like SecureBoost's becomes essential.
Methods not implemented here
The concepts above also mark the edges of this setting. Most of federated aggregation assumes neural-network weights, raw cross-client training, or many communication rounds — none of which fit a one-shot ensemble of XGBoost models combined from stored predictions. These are the notable methods left out, and why.
Parameter-averaging family (FedAvg and friends)
FedAvg (plain parameter averaging)
The canonical baseline: average each client's model weights element by element. A boosted-tree ensemble is a sequence of split decisions, not an averageable weight vector, so averaging XGBoost files yields no meaningful model. The prediction-level ensembles in this demonstrator are the tree-compatible stand-in.
FedProx
Adds a proximal term that anchors each client's local update to the global model to tame client drift under non-IID data — the central problem in this by-machine setting. It still operates on a differentiable weight vector, which boosted-tree split structures do not have.
Server optimizers (FedOpt / FedAdam / SCAFFOLD)
Replace the plain weight average with an adaptive server-side optimizer or a variance-reduction correction. All assume gradients or weights to aggregate, so none apply to tree ensembles.
Matched averaging (FedMA / PFNM)
Before averaging neural networks you must first match neurons across clients, because weights are only defined up to permutation. Trees have no neurons to match — there is nothing to align, which is the deeper reason parameter averaging fails for them.
Weight merging (model soups, Git Re-Basin)
Modern recipes for averaging already-trained networks by aligning their weight symmetries. They need a weight space with structural correspondence, which tree ensembles lack.
Robust & Byzantine-resilient aggregation
Coordinate-wise trimmed mean
Drops the highest and lowest prediction per sample before averaging. With only three sites, dropping one high and one low leaves the middle value, so it collapses exactly to the median ensemble already offered.
Krum / Multi-Krum
Selects the update most agreed with by its neighbors to resist malicious clients. Krum needs 2f+2 < n participants; with three honest sites it degenerates to a near-trivial choice, so it is conceptual here rather than runnable.
Bulyan
Stacks Krum and the trimmed mean for stronger Byzantine tolerance, requiring even more participants (n ≥ 4f+3). Out of range for three sites.
Vertical & secure tree federation
SecureBoost (vertical federated trees)
The vertical-FL counterpart: parties hold different features for the same samples and exchange encrypted gradients and hessians so the label-holder can choose splits. This demonstrator is horizontal — each site holds whole samples — so no single site can emit a full prediction vector to combine.
Federated random forest (forest merge)
Pools independently grown trees into one forest. Averaging the per-site forest outputs is exactly the tree-ensemble committee already offered, so a separate method would only duplicate it — though, unlike cyclic boosting, a forest merge is order-free.
Privacy-preserving & systems-level aggregation
Secure aggregation
Cryptographic masking so the server learns only the summed update, never any single client's. It wraps the histogram-aggregation training method rather than combining finished predictions, so there is nothing for the browser to recompute. See the privacy concept card above.
Differential privacy (DP-FedAvg / DP-GBDT)
Clips each contribution and adds calibrated noise for a formal privacy guarantee. Also a wrapper on training — applied to the gradient histograms — not to the frozen prediction vectors the browser holds.
Asynchronous aggregation (FedAsync / FedBuff)
Aggregates client updates as they arrive, weighting by staleness, to handle stragglers and scale. A scheduling property of multi-round training; this demonstrator runs a single combine step.
Communication compression (quantization / sparsification)
Shrinks each transmitted update via quantization or sparsification. It changes how many bytes cross the boundary, not how predictions are combined — the cost view above is where this would show up.
Personalization & distillation variants
Personalized FL (Per-FedAvg / pFedMe / Ditto / FedRep)
Learn a tailored model per client instead of one global model. The prediction-level shadows of this idea are already here — owner-local, routing, per-client bias correction, the manual local/global mix — but the neural-network mechanisms themselves do not transfer to trees. See the personalization concept card above.
FedMD (iterated co-distillation)
Clients repeatedly teach one another via soft labels on a shared public set across many rounds. The single-round version is the distilled student already offered; the iterated loop would need cross-client retraining rounds, not a browser recompute.
Learned mixture-of-experts gate
A gating network trained to predict the best expert per input — the learned generalization of the heuristic distance gates in the routing family. It needs the validation-split feature distances (not currently exported) or an in-browser classifier, so it is left as a future computable addition.
Sovereign by design
What ties the whole catalog together — and where its sovereignty stops.
Open any method above and its detail card names what crosses the site boundary: nothing, a single scalar, a prediction vector, validation residuals, gradient histograms, or one model handed from site to site. That column is the spine of the catalog. Every method works on exported artifacts (prediction vectors, validation scores, training sizes, model metadata), never on raw process signals; even the held-out test data this page benchmarks on is a processed export, derived from the signals rather than the signals themselves. No raw data is pooled, which is what lets the whole comparison run in your browser with nothing sent back.
Keeping raw data on the shop floor is where sovereignty starts, not where it ends. The predictions and models a site exports still reach the aggregator and can leak information about the data behind them, so this is a documented boundary, not a privacy guarantee. The secure-aggregation and differential-privacy concepts above are the layer that would close that gap in a real deployment.
Two reference points sit outside the sovereign set. The centralized model trains on the pooled raw data: it shows what dropping the boundary would buy, not a guaranteed best score, and one sovereign method here already edges past it. The three sites are machine partitions of one public dataset standing in for independent operators, so the page shows the mechanism rather than a claim that it carries over to other controllers or materials.