Mind the Data, Measuring the Performance Gap Between Tree Ensembles and Deep Learning on Tabular DataShow others and affiliations
2024 (English)In: Advances in Intelligent Data Analysis XXII: Proceedings, Part I / [ed] Ioanna Miliou; Nico Piatkowski; Panagiotis Papapetrou, Heidelberg: Springer Berlin/Heidelberg, 2024, Vol. 14641, p. 65-76Conference paper, Published paper (Refereed)
Abstract [en]
Recent machine learning studies on tabular data show that ensembles of decision tree models are more efficient and performant than deep learning models such as Tabular Transformer models. However, as we demonstrate, these studies are limited in scope and do not paint the full picture. In this work, we focus on how two dataset properties, namely dataset size and feature complexity, affect the empirical performance comparison between tree ensembles and Tabular Transformer models. Specifically, we employ a hypothesis-driven approach and identify situations where Tabular Transformer models are expected to outperform tree ensemble models. Through empirical evaluation, we demonstrate that given large enough datasets, deep learning models perform better than tree models. This gets more pronounced when complex feature interactions exist in the given task and dataset, suggesting that one must pay careful attention to dataset properties when selecting a model for tabular data in machine learning – especially in an industrial setting, where larger and larger datasets with less and less carefully engineered features are becoming routinely available. © The Author(s)
Place, publisher, year, edition, pages
Heidelberg: Springer Berlin/Heidelberg, 2024. Vol. 14641, p. 65-76
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 14641
Keywords [en]
Gradient boosting, Tabular data, Tabular Transformers
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:hh:diva-53352DOI: 10.1007/978-3-031-58547-0_6Scopus ID: 2-s2.0-85192227414ISBN: 9783031585463 (print)OAI: oai:DiVA.org:hh-53352DiVA, id: diva2:1865781
Conference
22nd International Symposium on Intelligent Data Analysis, IDA 2024, Stockholm, Sweden, April 24–26, 2024
2024-06-052024-06-052024-06-05Bibliographically approved