David Holzmüller
Postdoc in machine learning with Francis Bach &
@GaelVaroquaux: neural networks, tabular data, uncertainty, active learning, atomistic ML, learning theory.
https://dholzmueller.github.io
- I got 3rd out of 691 in a tabular kaggle competition – with only neural networks! 🥉 My solution is short (48 LOC) and relatively general-purpose – I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link below👇
- [Not loaded yet]
- Thanks!
- Solution write-up with additional insights: kaggle.com/competitions... 🥈2nd place used stacking with diverse models 🥇1st place found a larger dataset
- Excited to have co-contributed the SquashingScaler, which implements the robust numerical preprocessing from RealMLP!
- [Not loaded yet]
- Is it because mathematicians think in terms of the number of assumptions that are satisfied, while physicists think in terms of the number of things that satisfy them?
- 🚨ICLR poster in 1.5 hours, presented by @danielmusekamp.bsky.social : Can active learning help to generate better datasets for neural PDE solvers? We introduce a new benchmark to find out! Featuring 6 PDEs, 6 AL methods, 3 architectures and many ablations - transferability, speed, etc.!
- Poster: iclr.cc/virtual/2025... Hall 3 + Hall 2B #32, 10am Singapore time Paper: arxiv.org/abs/2408.01536
- Practitioners are often sceptical of academic tabular benchmarks, so I am elated to see that our RealMLP model outperformed boosted trees in two 2nd place Kaggle solutions, for a $10,000 forecasting challenge and a research competition on survival analysis.
- Links: www.kaggle.com/competitions... www.kaggle.com/competitions... Link to the repo: github.com/dholzmueller... PS: The newest pytabkit version now includes multiquantile regression for RealMLP and a few other improvements. bsky.app/profile/dhol...
- A new tabular classification benchmark provides another independent evaluation of our RealMLP. RealMLP is the best classical DL model, although some other recent baselines are missing. TabPFN is better on small datasets and boosted trees on larger datasets, though.
- The benchmark is limited to classification with AUC as a metric, which is one of RealMLP’s weaker points. Datasets are from the CC-18 benchmark, and the benchmark uses nested cross-validation unlike many other benchmarks. Link: arxiv.org/abs/2402.039...
- [Not loaded yet]
- What about work on adaptive learning rates (in the sense of convergence rates, not step sizes) that studies methods with hyperparameter optimization on a holdout set to achieve optimal/good convergence rates simultaneously for different classes of functions? E.g. projecteuclid.org/journals/ann...
- Early stopping on validation loss? This leads to suboptimal calibration and refinement errors—but you can do better! With @dholzmueller.bsky.social, Michael I. Jordan, and @bachfrancis.bsky.social, we propose a method that integrates with any model and boosts classification performance across tasks.
- [Not loaded yet]
- By the way, I think an intercept in this case is necessary because the logistic regression model does not have an intercept. For more realistic models that can learn an intercept themselves, I think an intercept for TS is probably not very important.
- [Not loaded yet]
-
View full threadThe library offers the same for XGBoost and LightGBM. Plus, the library includes some of the best tabular DL models like RealTabR, TabR, RealMLP, and TabM that could also be interesting to try. (ModernNCA is also very good but not included.)
- Finally, if you just want to have the best performance for a given (large) time budget, AutoGluon combines many tabular models. It does not include some of the latest models (yet), but has a very good CatBoost, for example, and will likely outperform individual models.
- Interesting! Would be cool to have these datasets on OpenML as well so they are easy to use in tabular benchmarks. Here are some more recommendations for stronger tabular baselines: 1. For CatBoost and XGBoost, you'd want at least early stopping to select the best iteration.
- Using my library github.com/dholzmueller... you could, for example, use CatBoost_TD_Regressor(n_cv=5), which will use better default parameters for regression, train five models in a cross-validation setup, select the best iteration for each, and ensemble them.
- I think Dirichlet scaling (or the binary version Beta scaling) also includes an intercept but I'm not sure. In my experience it's very slow, though, and not better than temperature scaling at least on smaller datasets (~1K-10K calibration samples).
- github.com/EFS-OpenSour... has some calibration methods like this implemented, but their temperature scaling MLE version has a bug where it doesn't optimize, so I didn't include it in our benchmark.
- I couldn't find them in @dholzmueller.bsky.social's github.com/dholzmueller...
- There is an adapter for Dirichlet scaling, which is basically regularized matrix scaling. (Except that matrix scaling can exploit shifts in the logits, which a true post-hoc calibration method like Dirichlet scaling can't IIUC).
- The first independent evaluation of our RealMLP is here! On a recent 300-dataset benchmark with many baselines, RealMLP takes a shared first place overall. 🔥 Importantly, RealMLP is also relatively CPU-friendly, unlike other SOTA DL models (including TabPFNv2 and TabM). 🧵 1/
-
View full threadThe benchmark: arxiv.org/abs/2407.00956 RealMLP: github.com/dholzmueller... 5/ bsky.app/profile/dhol...
- In case anyone is wondering about the name RealMLP, it is motivated by the “Real MVP” meme (which probably also inspired the RealNVP method). 6/6
- When including more baselines, RealMLP’s average rank slightly improves to make it the top-performing method overall, with a fifth place on binary classification, first place on multi-class, and second place on regression. 3/
- It is surprising how many DL methods perform worse than the simple MLP baseline by Gorishniy, @puhsu.bsky.social et al. This highlights the benchmarking problems in the field (and potentially the difficulty in using many of these models correctly). The situation is slowly improving. 4/
- Some caveats: All DL models are trained with a batch size of 1024, while we recommend using 256 for RealMLP on medium-sized datasets. Other choices (selection of datasets, not using bagging, choice of metrics, search spaces for baselines) can of course also influence results. 2/
- Join us on 27 Feb in Amsterdam for the ELLIS workshop on Representation Learning and Generative Models for Structured Data ✨ sites.google.com/view/rl-and-... Inspiring talks by @eisenjulian.bsky.social, @neuralnoise.com, Frank Hutter, Vaishali Pal, TBC. We welcome extended abstracts until 31 Jan!
- Is "classical" supervised tabular learning also part of the workshop?
- by "classical" I mean deep learning models, just for supervised learning
- I'll present our paper in the afternoon poster session at 4:30pm - 7:30 pm in East Exhibit Hall A-C, poster 3304!
- We wrote a benchmark paper with many practical insights on (the benefits of) active learning for training neural PDE solvers. 🚀 I was happy to be a co-advisor on this project - most of the credit goes to Daniel and Marimuthu.
- I'll be at #NeurIPS2024 next week to present this paper (Thu afternoon) as well as a workshop paper on active learning for neural PDE solvers. Let me know if you'd like to chat about tabular data, uncertainty, active learning, etc.!
- If you have train+validation data, should you refit on the whole data with the stopping epoch found on the train-validation split? In the quoted paper, we did an experiment including 5-fold ensembles on a 5-fold cross-validation splits (bagging) and with refitting. (short 🧵)
-
View full threadIt is reassuring that the best (average or individual) stopping epoch from bagging works well for RealMLP in the refitting setting, where no validation set is available. It would be interesting to see if this holds up in the non-iid setting with time-based splits. 3/
- We also have results for LightGBM with our tuned default hyperparameters (LGBM-TD), but they are somewhat similar and the behavior might depend on the “subsample” hyperparameter (which is related to bagging). 4/
- The result? Refitting is a bit better, but only if you fit an ensemble during refitting. But: it’s slower, you don’t get validation scores for the refitted models, the result might change with more folds, and tuning the hyperparameters on the CV scores may favor bagging. 2/
- [Not loaded yet]
- [Not loaded yet]
- No problem, and thanks for thinking of me 🙂
- [Not loaded yet]
- Not really...
- PyTabKit 1.1 is out! - Includes TabM and provides a scikit-learn interface - some baseline NN parameter names are renamed (removed double-underscores) - other small changes, see the readme. github.com/dholzmueller...
- [Not loaded yet]
- Yes
- One thing I learned from this project is that accuracy is a quite noisy metric. With small validation sets (~1K samples), hyperparameter opt. using AUROC instead of accuracy can yield better accuracy on the test set. We also did some experiments on metrics for early stopping. 🧵
- For early stopping on boosted trees, using accuracy as the val metric requires high patience. Brier loss yields similar test accuracy for high patience but is less sensitive to patience. Cross-entropy (the train metric) is even less sensitive but not as good for test accuracy. 2/
- A reason for the different sensitivities may also be that val metrics that are more similar to the train loss are more likely to decrease monotonically, and therefore have less risk of stopping too early. For regression with MSE we found little sensitivity to the patience. 3/
- [Not loaded yet]
- I did a small test with TabM-mini and 5-fold bagging, only default parameters with numerical embeddings. It seems that it's roughly comparable with RealMLP. But then maybe RealMLP can benefit more from additional ensembling or the two could be combined. A fair comparison with ensembling is hard.
- By demand, I've created the final starter pack in my ML Personality Starter Pack Series. I'm uncertain who belongs in this starter pack and so if you think you better fit in the Grumpy ML or Unreasonably Upbeat ML starter packs, let me know. (Self) nominations welcome go.bsky.app/5Suyk58at://did:plc:ir4q5yrovw6q55newvfievxk/app.bsky.graph.starterpack/3lbkxfuzuih2w
- I think I would fit here 🙂
- [Not loaded yet]
-
View full threadIn our experiments, the last-layer and RP features perform similarly well. But we find that predictive uncertainties computed with RP features are often more strongly correlated with the actual error than the last-layer uncertainties. 14/
- Overall, we find that the considered batch active learning methods typically outperform conventional active learning methods. Out of the former methods, LCMD is more tailored towards average-case performance, while MaxDet and MaxDist are better for worst-case performance. 15/
- [Not loaded yet]
-
View full threadWe updated our deep batch active learning paper and added BAIT, another selection method from the literature. For regression, BAIT is related to Bayesian V-optimal design and greedily selects points to minimize the average posterior variance on the train and pool set.
- When combined with our sketched full-gradient kernel, BAIT is competitive on all metrics, especially in the first few steps. In the end, LCMD still performs slightly better for RMSE and MAE.
- [Not loaded yet]
- Coauthors: @ViktorZaverkin, @GroupKaestner, Ingo Steinwart Funding: @SimTechStuttga2 @Uni_Stuttgart @studienstiftung
- [Not loaded yet]
- In our previous work, we showed that the finite-width Neural Tangent Kernel (NTK) can be somewhat efficiently computed for fully-connected NNs. This is no longer true for atomistic NNs. However, the randomly projected finite-width NTK (RP) is still efficiently computable. 13/