See full post

David Holzmüller

Followers · Following

Postdoc in machine learning with Francis Bach & @GaelVaroquaux: neural networks, tabular data, uncertainty, active learning, atomistic ML, learning theory. https://dholzmueller.github.io

Joined November 2024

Posts Replies Media Original posts Likes

David Holzmüller handle.invalid · Jul 29, 2025
I got 3rd out of 691 in a tabular kaggle competition – with only neural networks! 🥉 My solution is short (48 LOC) and relatively general-purpose – I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link below👇

View on Bluesky Download image Show all post labels

ducnh279 ducnh279.bsky.social · Jul 30, 2025
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Jul 30, 2025
Replying to ducnh279
Thanks!

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Jul 29, 2025
Replying to David Holzmüller
Solution write-up with additional insights: kaggle.com/competitions... 🥈2nd place used stacking with diverse models 🥇1st place found a larger dataset
Prediction interval competition II: House price

Create a regression model for a house sale price having the narrowest overall prediction intervals

kaggle.com

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Jul 24, 2025
Excited to have co-contributed the SquashingScaler, which implements the robust numerical preprocessing from RealMLP!
- Skrub handle.invalid · Jul 24, 2025
  ⚡ Release 0.6.0 is now out! ⚡ 🚀 Major update! Skrub DataOps, various improvements for the TableReport, new tools for applying transformers to the columns, and a new robust transformer for numerical features are only some of the features included in this release.
View on Bluesky Show all post labels

Sam Power handle.invalid · Jul 19, 2025
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Jul 23, 2025
Is it because mathematicians think in terms of the number of assumptions that are satisfied, while physicists think in terms of the number of things that satisfy them?

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Apr 24, 2025
🚨ICLR poster in 1.5 hours, presented by @danielmusekamp.bsky.social : Can active learning help to generate better datasets for neural PDE solvers? We introduce a new benchmark to find out! Featuring 6 PDEs, 6 AL methods, 3 architectures and many ablations - transferability, speed, etc.!

View on Bluesky Download image Show all post labels
David Holzmüller handle.invalid · Apr 24, 2025
Poster: iclr.cc/virtual/2025... Hall 3 + Hall 2B #32, 10am Singapore time Paper: arxiv.org/abs/2408.01536
Active Learning for Neural PDE Solvers

Solving partial differential equations (PDEs) is a fundamental problem in science and engineering. While neural PDE solvers can be more efficient than established numerical solvers, they often require...

arxiv.org

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Mar 10, 2025
Practitioners are often sceptical of academic tabular benchmarks, so I am elated to see that our RealMLP model outperformed boosted trees in two 2nd place Kaggle solutions, for a $10,000 forecasting challenge and a research competition on survival analysis.
Rohlik Sales Forecasting Challenge

Use historical product sales data to predict future sales.

kaggle.com

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Mar 10, 2025
Links: www.kaggle.com/competitions... www.kaggle.com/competitions... Link to the repo: github.com/dholzmueller... PS: The newest pytabkit version now includes multiquantile regression for RealMLP and a few other improvements. bsky.app/profile/dhol...
- David Holzmüller handle.invalid · Nov 18, 2024
  Can deep learning finally compete with boosted trees on tabular data? 🌲 In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters. Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵
View on Bluesky Show all post labels

David Holzmüller handle.invalid · Mar 4, 2025
A new tabular classification benchmark provides another independent evaluation of our RealMLP. RealMLP is the best classical DL model, although some other recent baselines are missing. TabPFN is better on small datasets and boosted trees on larger datasets, though.

View on Bluesky Download image Show all post labels
David Holzmüller handle.invalid · Mar 4, 2025
The benchmark is limited to classification with AUC as a metric, which is one of RealMLP’s weaker points. Datasets are from the CC-18 benchmark, and the benchmark uses nested cross-validation unlike many other benchmarks. Link: arxiv.org/abs/2402.039...
Is Deep Learning finally better than Decision Trees on Tabular Data?

Tabular data is a ubiquitous data modality due to its versatility and ease of use in many real-world applications. The predominant heuristics for handling classification tasks on tabular data rely on ...

arxiv.org

View on Bluesky Show all post labels

Ben Recht beenwrekt.bsky.social · Feb 11, 2025
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Feb 12, 2025
What about work on adaptive learning rates (in the sense of convergence rates, not step sizes) that studies methods with hyperparameter optimization on a holdout set to achieve optimal/good convergence rates simultaneously for different classes of functions? E.g. projecteuclid.org/journals/ann...

View on Bluesky Show all post labels

Eugene Berta eberta.bsky.social · Feb 3, 2025
Early stopping on validation loss? This leads to suboptimal calibration and refinement errors—but you can do better! With @dholzmueller.bsky.social, Michael I. Jordan, and @bachfrancis.bsky.social, we propose a method that integrates with any model and boosts classification performance across tasks.

View on Bluesky Download image (1)Download image (2)Show all post labels

Olivier Grisel ogrisel.bsky.social · Feb 5, 2025
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Feb 8, 2025
Replying to Olivier Grisel
By the way, I think an intercept in this case is necessary because the logistic regression model does not have an intercept. For more realistic models that can learn an intercept themselves, I think an intercept for TS is probably not very important.

View on Bluesky Show all post labels

macinchem macinchem.bsky.social · Feb 6, 2025
[Not loaded yet]

View on Bluesky Show all post labels
View full thread
David Holzmüller handle.invalid · Feb 7, 2025
The library offers the same for XGBoost and LightGBM. Plus, the library includes some of the best tabular DL models like RealTabR, TabR, RealMLP, and TabM that could also be interesting to try. (ModernNCA is also very good but not included.)

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Feb 7, 2025
Finally, if you just want to have the best performance for a given (large) time budget, AutoGluon combines many tabular models. It does not include some of the latest models (yet), but has a very good CatBoost, for example, and will likely outperform individual models.

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Feb 7, 2025
Replying to macinchem
Interesting! Would be cool to have these datasets on OpenML as well so they are easy to use in tabular benchmarks. Here are some more recommendations for stronger tabular baselines: 1. For CatBoost and XGBoost, you'd want at least early stopping to select the best iteration.

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Feb 7, 2025
Using my library github.com/dholzmueller... you could, for example, use CatBoost_TD_Regressor(n_cv=5), which will use better default parameters for regression, train five models in a cross-validation setup, select the best iteration for each, and ensemble them.

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Feb 5, 2025
Replying to David Holzmüller
I think Dirichlet scaling (or the binary version Beta scaling) also includes an intercept but I'm not sure. In my experience it's very slow, though, and not better than temperature scaling at least on smaller datasets (~1K-10K calibration samples).

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Feb 5, 2025
github.com/EFS-OpenSour... has some calibration methods like this implemented, but their temperature scaling MLE version has a bug where it doesn't optimize, so I didn't include it in our benchmark.

View on Bluesky Show all post labels

Olivier Grisel ogrisel.bsky.social · Feb 5, 2025
Replying to Olivier Grisel
I couldn't find them in @dholzmueller.bsky.social's github.com/dholzmueller...
GitHub - dholzmueller/probmetrics: Probabilistic metrics and post-hoc calibration

Probabilistic metrics and post-hoc calibration. Contribute to dholzmueller/probmetrics development by creating an account on GitHub.

github.com

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Feb 5, 2025
There is an adapter for Dirichlet scaling, which is basically regularized matrix scaling. (Except that matrix scaling can exploit shifts in the logits, which a true post-hoc calibration method like Dirichlet scaling can't IIUC).

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Jan 16, 2025
The first independent evaluation of our RealMLP is here! On a recent 300-dataset benchmark with many baselines, RealMLP takes a shared first place overall. 🔥 Importantly, RealMLP is also relatively CPU-friendly, unlike other SOTA DL models (including TabPFNv2 and TabM). 🧵 1/

View on Bluesky Download image Show all post labels
View full thread
David Holzmüller handle.invalid · Jan 16, 2025
The benchmark: arxiv.org/abs/2407.00956 RealMLP: github.com/dholzmueller... 5/ bsky.app/profile/dhol...
- David Holzmüller handle.invalid · Nov 18, 2024
  Can deep learning finally compete with boosted trees on tabular data? 🌲 In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters. Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵
View on Bluesky Show all post labels
David Holzmüller handle.invalid · Jan 16, 2025
In case anyone is wondering about the name RealMLP, it is motivated by the “Real MVP” meme (which probably also inspired the RealNVP method). 6/6

View on Bluesky Download image Show all post labels

David Holzmüller handle.invalid · Jan 16, 2025
Replying to David Holzmüller
When including more baselines, RealMLP’s average rank slightly improves to make it the top-performing method overall, with a fifth place on binary classification, first place on multi-class, and second place on regression. 3/

View on Bluesky Download image Show all post labels
David Holzmüller handle.invalid · Jan 16, 2025
It is surprising how many DL methods perform worse than the simple MLP baseline by Gorishniy, @puhsu.bsky.social et al. This highlights the benchmarking problems in the field (and potentially the difficulty in using many of these models correctly). The situation is slowly improving. 4/

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Jan 16, 2025
Replying to David Holzmüller
Some caveats: All DL models are trained with a batch size of 1024, while we recommend using 256 for RealMLP on medium-sized datasets. Other choices (selection of datasets, not using bagging, choice of metrics, search spaces for baselines) can of course also influence results. 2/

View on Bluesky Show all post labels

Madelon Hulsebos madelonhulsebos.bsky.social · Jan 7, 2025
Join us on 27 Feb in Amsterdam for the ELLIS workshop on Representation Learning and Generative Models for Structured Data ✨ sites.google.com/view/rl-and-... Inspiring talks by @eisenjulian.bsky.social, @neuralnoise.com, Frank Hutter, Vaishali Pal, TBC. We welcome extended abstracts until 31 Jan!

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Jan 13, 2025
Is "classical" supervised tabular learning also part of the workshop?

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Jan 14, 2025
by "classical" I mean deep learning models, just for supervised learning

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Dec 12, 2024
I'll present our paper in the afternoon poster session at 4:30pm - 7:30 pm in East Exhibit Hall A-C, poster 3304!
- David Holzmüller handle.invalid · Nov 18, 2024
  Can deep learning finally compete with boosted trees on tabular data? 🌲 In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters. Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵
View on Bluesky Show all post labels

David Holzmüller handle.invalid · Dec 11, 2024
We wrote a benchmark paper with many practical insights on (the benefits of) active learning for training neural PDE solvers. 🚀 I was happy to be a co-advisor on this project - most of the credit goes to Daniel and Marimuthu.
- Daniel Musekamp handle.invalid · Dec 11, 2024
  Neural surrogates can accelerate PDE solving but need expensive ground-truth training data. Can we reduce the training data size with active learning (AL)? In our NeurIPS D3S3 poster, we introduce AL4PDE, an extensible AL benchmark for autoregressive neural PDE solvers. 🧵
View on Bluesky Show all post labels

David Holzmüller handle.invalid · Dec 3, 2024
I'll be at #NeurIPS2024 next week to present this paper (Thu afternoon) as well as a workshop paper on active learning for neural PDE solvers. Let me know if you'd like to chat about tabular data, uncertainty, active learning, etc.!
- David Holzmüller handle.invalid · Nov 18, 2024
  Can deep learning finally compete with boosted trees on tabular data? 🌲 In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters. Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵
View on Bluesky Show all post labels

David Holzmüller handle.invalid · Nov 29, 2024
If you have train+validation data, should you refit on the whole data with the stopping epoch found on the train-validation split? In the quoted paper, we did an experiment including 5-fold ensembles on a 5-fold cross-validation splits (bagging) and with refitting. (short 🧵)
- David Holzmüller handle.invalid · Nov 18, 2024
  Can deep learning finally compete with boosted trees on tabular data? 🌲 In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters. Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵
View on Bluesky Show all post labels
View full thread
David Holzmüller handle.invalid · Nov 29, 2024
It is reassuring that the best (average or individual) stopping epoch from bagging works well for RealMLP in the refitting setting, where no validation set is available. It would be interesting to see if this holds up in the non-iid setting with time-based splits. 3/

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 29, 2024
We also have results for LightGBM with our tuned default hyperparameters (LGBM-TD), but they are somewhat similar and the behavior might depend on the “subsample” hyperparameter (which is related to bagging). 4/

View on Bluesky Download image Show all post labels

David Holzmüller handle.invalid · Nov 29, 2024
Replying to David Holzmüller
The result? Refitting is a bit better, but only if you fit an ensemble during refitting. But: it’s slower, you don’t get validation scores for the refitted models, the result might change with more folds, and tuning the hyperparameters on the CV scores may favor bagging. 2/

View on Bluesky Download image Show all post labels

miniapeur miniapeur.bsky.social · Nov 29, 2024
[Not loaded yet]

View on Bluesky Show all post labels

Marvin Schmitt marvin-schmitt.com · Nov 29, 2024
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 29, 2024
Replying to Marvin Schmitt
No problem, and thanks for thinking of me 🙂

View on Bluesky Show all post labels

Marvin Schmitt marvin-schmitt.com · Nov 29, 2024
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 29, 2024
Replying to Marvin Schmitt
Not really...

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Nov 25, 2024
PyTabKit 1.1 is out! - Includes TabM and provides a scikit-learn interface - some baseline NN parameter names are renamed (removed double-underscores) - other small changes, see the readme. github.com/dholzmueller...
- David Holzmüller handle.invalid · Nov 18, 2024
  Can deep learning finally compete with boosted trees on tabular data? 🌲 In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters. Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵
View on Bluesky Show all post labels

ronrich ronrich.bsky.social · Nov 27, 2024
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 27, 2024
Replying to ronrich
Yes

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Nov 27, 2024
One thing I learned from this project is that accuracy is a quite noisy metric. With small validation sets (~1K samples), hyperparameter opt. using AUROC instead of accuracy can yield better accuracy on the test set. We also did some experiments on metrics for early stopping. 🧵
- David Holzmüller handle.invalid · Nov 18, 2024
  Can deep learning finally compete with boosted trees on tabular data? 🌲 In our NeurIPS 2024 paper, we introduce RealMLP, a NN with improvements in all areas and meta-learned default parameters. Some insights about RealMLP and other models on large benchmarks (>200 datasets): 🧵
View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 27, 2024
For early stopping on boosted trees, using accuracy as the val metric requires high patience. Brier loss yields similar test accuracy for high patience but is less sensitive to patience. Cross-entropy (the train metric) is even less sensitive but not as good for test accuracy. 2/

View on Bluesky Download image Show all post labels
David Holzmüller handle.invalid · Nov 27, 2024
A reason for the different sensitivities may also be that val metrics that are more similar to the train loss are more likely to decrease monotonically, and therefore have less risk of stopping too early. For regression with MSE we found little sensitivity to the patience. 3/

View on Bluesky Download image Show all post labels

Olivier Grisel ogrisel.bsky.social · Nov 25, 2024
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 26, 2024
Replying to Olivier Grisel
I did a small test with TabM-mini and 5-fold bagging, only default parameters with numerical embeddings. It seems that it's roughly comparable with RealMLP. But then maybe RealMLP can benefit more from additional ensembling or the two could be combined. A fair comparison with ensembling is hard.

View on Bluesky Show all post labels

Dan Roy roydanroy.bsky.social · Nov 22, 2024
By demand, I've created the final starter pack in my ML Personality Starter Pack Series. I'm uncertain who belongs in this starter pack and so if you think you better fit in the Grumpy ML or Unreasonably Upbeat ML starter packs, let me know. (Self) nominations welcome go.bsky.app/5Suyk58
at://did:plc:ir4q5yrovw6q55newvfievxk/app.bsky.graph.starterpack/3lbkxfuzuih2w

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 24, 2024
I think I would fit here 🙂

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Nov 19, 2024
[Not loaded yet]

View on Bluesky Show all post labels
View full thread
David Holzmüller handle.invalid · Nov 19, 2024
In our experiments, the last-layer and RP features perform similarly well. But we find that predictive uncertainties computed with RP features are often more strongly correlated with the actual error than the last-layer uncertainties. 14/

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 19, 2024
Overall, we find that the considered batch active learning methods typically outperform conventional active learning methods. Out of the former methods, LCMD is more tailored towards average-case performance, while MaxDet and MaxDist are better for worst-case performance. 15/

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Nov 19, 2024
[Not loaded yet]

View on Bluesky Show all post labels
View full thread
David Holzmüller handle.invalid · Nov 19, 2024
We updated our deep batch active learning paper and added BAIT, another selection method from the literature. For regression, BAIT is related to Bayesian V-optimal design and greedily selects points to minimize the average posterior variance on the train and pool set.

View on Bluesky Download image Show all post labels
David Holzmüller handle.invalid · Nov 19, 2024
When combined with our sketched full-gradient kernel, BAIT is competitive on all metrics, especially in the first few steps. In the end, LCMD still performs slightly better for RMSE and MAE.

View on Bluesky Download image Show all post labels

David Holzmüller handle.invalid · Nov 19, 2024
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 19, 2024
Coauthors: @ViktorZaverkin, @GroupKaestner, Ingo Steinwart Funding: @SimTechStuttga2 @Uni_Stuttgart @studienstiftung

View on Bluesky Show all post labels

David Holzmüller handle.invalid · Nov 19, 2024
[Not loaded yet]

View on Bluesky Show all post labels
David Holzmüller handle.invalid · Nov 19, 2024
In our previous work, we showed that the finite-width Neural Tangent Kernel (NTK) can be somewhat efficiently computed for fully-connected NNs. This is no longer true for atomistic NNs. However, the randomly projected finite-width NTK (RP) is still efficiently computable. 13/

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙