Technical University of Munich | Advisors: Lars Linden & Celine Stauch (LMU Munich)
We present a machine-learning benchmark for non-resonant vector-boson- fusion Higgs-pair production in the semileptonic channel HH → bbW W → bbqqℓν. From truth-level Monte Carlo samples, we build a selected and matched event-level dataset and represent each event with kinematic features for the main physics objects and their combinations. We then study whether this representation can distinguish different values of the coupling scan parameter cvv, with focus on the binary separation of the Standard Model reference point cvv = 1 from the remaining sampled classes. We find that this task is learnable with the available feature set. The full representation performs best overall, while several nonlinear models reach similar accuracy. In contrast, scaling and oversampling show mixed effects, and ensemble methods do not clearly improve on the best individual models. These results show that the matched event representation retains useful information for coupling classification in semileptonic VBF Higgs-pair events.
VBF Higgs-pair production is a rare process predicted by the Standard Model whose rate is
highly sensitive to the quartic coupling constant κ2V
(the cvv parameter). Since VBF HH production has not yet been observed, measuring
or constraining κ2V requires ML-assisted discrimination of events produced at the
SM point (cvv = 1) from events produced at anomalous coupling values.
We study the semileptonic decay channel HH → bb WW* → bb qq ℓν, producing two b-jets, two light jets, a charged lepton, and a neutrino. Events are processed through a full Delphes detector simulation and jet matching pipeline. The coupling scan covers cvv ∈ {0.0, 0.5, 1.0, 1.5, 2.0, 3.0}.
| κ2V value | Class label | Matched events | Survival fraction |
|---|---|---|---|
| 0.0 | C0.0 | 243 917 | 12.20% |
| 0.5 | C0.5 | 180 456 | 9.02% |
| 1.0 (SM) | C1 | 19 796 | 0.40% |
| 1.5 | C1.5 | 209 702 | 20.97% |
| 2.0 | C2.0 | 204 740 | 20.47% |
| 3.0 | C3.0 | 185 928 | 18.59% |
Because all non-SM classes are nearly indistinguishable from each other (5-class accuracy ≈ 20%, equal to random guessing), the task is reduced to the binary problem: C1 vs. Cnot 1.
Each event is described by kinematic four-vector observables of the reconstructed physics objects, grouped by their role in the topology. In total 52 features are used (after dropping azimuthal angles and some jet-level duplicates that provide no discriminating power). The dataset (~1M events, CSV) is publicly available for download. ↓ Download dataset
Kinematics of the two forward tagging jets (pT, η, m, E, ΔR, jet-level).
System momentum and individual quark daughters from the W → qq decay.
Charged lepton and neutrino four-vector components.
Jet-level and b-jet daughter kinematics of the Higgs decaying to b-quarks.
Derived quantities built from multiple objects: HH mass, ΔR distances, invariant masses.
Select a particle and a property to explore the per-class distributions. C1 (SM, cvv=1) is shown in red, Cnot1 in blue. Most variables show substantial overlap — the coupling signal is a small shape distortion. The clearest separation appears in pseudorapidity (η) of the VBF tagging jets.
All classical ML models are implemented with scikit-learn and run through the same config-driven framework. The PyTorch MLP uses residual connections, batch normalisation, and optional SE blocks. DeepSets operates on sets of events.
After sweeping all preprocessing combinations we identify the single best configuration per model. Toggle between the chart and a summary table below.
Feature scaling transforms each column to a common numerical range before training. We compare 7 strategies: Standard (STD), MinMax, MaxAbs, Robust, Yeo-Johnson, Quantile (normal), and Quantile (uniform). For neural networks and distance-based methods, scaling is critical because gradient updates are dominated by large-magnitude features otherwise. Tree-based models are theoretically invariant but can still be affected in practice through interaction with regularization.
Per-sample (row-wise) normalization scales each individual event vector to unit norm after column-wise scaling. We test four variants: no normalization, L1 norm, L2 norm, and Max norm. The idea is to remove scale differences between events, which can help when the overall magnitude of a feature vector carries no class-discriminating information.
We compare three feature sets: all features (52 variables including combination observables), old features (basic kinematic four-vectors only, no derived quantities), and new features (the combination observables alone). The combination features are derived quantities involving multiple reconstructed objects — such as the HH invariant mass, ΔR distances between systems, and multi-object invariant masses — motivated by their expected sensitivity to the coupling structure.
The Standard Model class (cvv = 1) has only 0.4% matching survival, leading to a severe class imbalance. We compare three strategies: undersampling (reducing the majority class to match the minority), 3× oversampling (SMOTE-style duplication of the minority class to 3× its original size), and 10× oversampling.
Ensemble methods combine predictions from multiple individually trained models via majority voting. The idea is that if models make different errors, a vote can cancel individual mistakes and improve overall accuracy. We test diverse combinations of the best-performing models (MLP, GBT, HistGBT, Random Forest, Logistic Regression) and measure both accuracy and macro F1 of the ensemble versus the individual member models.
DeepSets is a permutation-invariant architecture that classifies a set of N events simultaneously rather than one at a time. Each event is encoded independently through a shared MLP, then the encoded representations are aggregated (via max pooling, attention pooling, or self-attention) before a final classification head. Because all N events share the same underlying coupling parameter, using multiple events per prediction provides a statistical averaging effect that makes the coupling signal much clearer.
| Set size N | Accuracy | Precision | Recall |
|---|---|---|---|
| 1 | 77.58% | 77.57% | 77.56% |
| 2 | 86.09% | 85.22% | 85.21% |
| 3 | 90.27% | 90.48% | 90.47% |
| 5 | 95.51% | 95.28% | 95.28% |
| 10 | 99.11% | 98.82% | 98.82% |
This study demonstrates that binary discrimination of VBF HH events at the Standard Model coupling point from all other coupling values is achievable with classical ML, but is genuinely difficult due to extreme class imbalance and limited signal in single-event representations.
Standard scaling and 3× minority oversampling form a robust preprocessing baseline. Adding combination features (derived multi-object quantities) gives small but consistent improvements. No ensemble combination outperforms the best individual model, confirming that all methods exploit the same feature correlations.
The most striking result is from DeepSets: classifying sets of 10 events simultaneously reaches 99.1% accuracy, compared to 77.6% for a single event. This suggests that while the coupling signal in a single event is weak, it becomes highly statistically significant when integrated over multiple events — exactly the situation in a real experimental analysis.
All results are based on truth-level simulation without detector effects. No held-out test set was used; all reported numbers come from the validation split. Future work could apply the framework to detector-level data, use larger datasets, or investigate whether the strong DeepSets results hold under more realistic experimental conditions.
@techreport{benchmark2025vbf,
author = {Foerster, Felix and Schneider, Lars and Mesner, Johannes and Linden, Lars and Stauch, Celine},
title = {A Machine-Learning Benchmark for Semileptonic VBF Higgs-Pair
Events in a Coupling Scan},
institution = {Technical University of Munich / LMU Munich},
year = {2025},
url = {https://spatenfe.github.io/vbf_event_classifier/}
}