Published June 26, 2025 | Version v1
Dataset Open

Massive Atomic Diversity: a compact universal dataset for atomistic machine learning

  • 1. Laboratory of Computational Science and Modeling, Institut des Matériaux, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
  • 2. PSI Center for Scientific Computing, Theory and Data, 5232 Villigen PSI, Switzerland
  • 3. National Centre for Computational Design and Discovery of Novel Materials (MARVEL), 5232 Villigen PSI, Switzerland
  • 4. BASF SE, Carl-Bosch-Strasse 38, 67056 Ludwigshafen, Germany

* Contact person

Description

The development of machine-learning models for atomic-scale simulations has benefitted tremendously from the large databases of materials and molecular properties computed in the past two decades using electronic-structure calculations. More recently, these databases have made it possible to train “universal” models that aim at making accurate predictions for arbitrary atomic geometries and compositions. The construction of many of these databases was however in itself aimed at materials discovery, and therefore targeted primarily to sample stable, or at least plausible, structures and to make the most accurate predictions for each compound – e.g. adjusting the calculation details to the material at hand. Here we introduce a dataset designed specifically to train models that can provide reasonable predictions for arbitrary structures, and that therefore follows a different philosophy. Starting from relatively small sets of stable structures, the dataset is built to contain “massive atomic diversity” (MAD) by aggressively distorting these configurations, with near-complete disregard for the stability of the resulting configurations. The electronic structure details, on the other hand, are chosen to maximize consistency rather than to obtain the most accurate prediction for
a given structure, or to minimize computational effort. The MAD dataset we present here, despite containing fewer than 100k structures, has already been shown to enable training universal interatomic potentials that are competitive with models trained on traditional datasets with two to three orders of magnitude more structures. We describe in detail the philosophy and details of the construction of the MAD dataset. We also introduce a low-dimensional structural latent space that allows us to compare it with other popular datasets, and that can also be used as a general-purpose materials cartography tool.

Files

File preview

files_description.md

All files

Files (16.4 GiB)

Name Apps Size
md5:f223f85f790c009f4808bf445351ce84
2.0 KiB Preview Download
md5:14a9079a5141b77afd467d64d7ab4ba6
5.8 MiB Download
md5:5703460edacad3e1e7b9b424faac2e03
6.2 MiB Download
md5:74cc63b6feb3d1ec58578cb970f37ac6
4.9 MiB Download
md5:920c27332ee3ef4a8346460946102406
483.1 KiB Download
md5:268025e576e35c611f38a391ff0cfe83
119.1 MiB Download
md5:72c96e8be0cece6770ab56fc753a437c
181.2 MiB Download
md5:923732790755a64d5b9b875e7ae7aece
1.1 GiB Download
md5:7c732b634ef59ebf8621c7cae1bdea04
9.5 GiB Download
md5:89f0da3e1aa770e692854e30aaff06af
341.9 MiB Download
md5:e29a600fa9dcfbeb97b04c56c85667ee
3.6 GiB Download
md5:2ec23c10d75d108f4e3d0bf96ee80f5c
1.2 GiB Download
md5:edc8edf1da3afada556c0cfd48ffe9a9
52.1 MiB Download
md5:caf11272502510567a01fbfc835724c1
4.8 MiB Download
md5:07afed6eea7a2c62ced8cbd034b9b1b5
29.2 MiB Download
md5:e43bfbda733388f68f1b9c7c60752556
232.6 MiB Download
md5:549216b081309099c9d85c2a8a782e44
29.0 MiB Download
md5:86ab7fabb435854620d5973cda797571
12.2 KiB Preview Download

References

Preprint (Preprint in which the MAD dataset is described)
A. Mazitov, S. Chorna, G. Fraux, M. Bercx, G. Pizzi, S. De, and M. Ceriotti, arXiv preprint arXiv:2506.19674, (2025), doi: 10.48550/arXiv.2506.19674