Published March 3, 2026 | Version v1
Dataset Open

High-quality, high-information datasets for universal atomistic machine learning

  • 1. ROR icon École Polytechnique Fédérale de Lausanne
  • 2. ROR icon University of Cambridge
  • 3. ROR icon Max Planck Institute for the Structure and Dynamics of Matter

* Contact person

Description

The quality, consistency, and information content of training data is often what determines the practical value of machine-learning models for atomistic simulations. Yet, many widely used electronic-structure databases are assembled having materials screening as primary goal rather than robust force-field learning, are limited in their scope to a specific class of chemical compounds, and/or employ inconsistent DFT functionals and settings. Here we introduce MAD-1.5, a highly curated dataset designed explicitly for training broadly applicable atomistic models across the periodic table at high levels of theory. MAD-1.5 extends the MAD dataset with targeted enrichment strategies that improve the coverage of chemical space to 102 elements while keeping the total number of configurations compact. All structures are computed with a single, standardized all-electron DFT workflow using the r2SCAN meta-GGA functional and consistent convergence settings, ensuring uniformity across chemically heterogeneous systems. The dataset encompasses molecules, clusters, bulk crystals, surfaces, and low-dimensional structures, and its quality and consistency are further enhanced by outlier removal using uncertainty quantification. We demonstrate the high accuracy that can be achieved with the proposed dataset by training PET-MAD-1.5, a generally applicable r2SCAN interatomic potential that covers 102 elements in the periodic table and achieves exceptional levels of benchmark accuracy and stability in challenging simulation protocols.

Files

File preview

All files

Files (684.6 MiB)

Name Size
md5:24bf9cb988bc197477cde6b910e29f40
1.5 KiB Download
md5:e6db1965a093007a892f0f7033590131
291.6 MiB Download
md5:38803179ed2bc815eb786dea410383b9
4.6 MiB Download
md5:4ee0ece5c0206ec446a4d5e9b8476539
37.8 MiB Download
md5:59311df14fe3f40a018835593fbbb19b
312.9 MiB Download
md5:468f8dba968e7001113717edb94caef4
37.7 MiB Download
md5:3b4a7a1f87c0a8151dbd6fca65bdf656
2.7 KiB Preview Download

References

Preprint (Paper in which the dataset is described)
High-quality, high-information datasets for universal atomistic machine learning, C. Malosso, F. Bigi, P. Pegolo, J.W. Abbott, P. Loche, M. Rossi, M. Ceriotti, A. Mazitov, arXiv (2026), doi: 10.48550/arXiv.2603.02089