Massive Atomic Diversity (MAD) Dataset

This data record contains the calculations data of the MAD dataset - a light-weight dataset for atomistic machine learning, specifically designed to create universal interatomic potentials (originally published in Ref. 1). This data is distributed through the provenance graphs of the DFT calculations obtained with AIIDA, which can be used to reproduce the dataset and to study the DFT calculations details. In addition, the record also contain human-readable versions of the main dataset components, as well as interactive visualizations of the dataset.

Table of Contents

  1. Dataset Overview
  2. MAD Benchmark Overview
  3. Data Format and Usage
  4. DFT Calculations Details
  5. References

Dataset Overview

The dataset consists of 95595 structures of 85 elements (with atomic numbers ranging from 1 to 86, except Astatine) and contains 8 diverse subsets that cover organic and inorganic domains.

  1. MC3D (33,596 structures, 738,484 atoms): Bulk crystals from Materials Cloud 3D crystals database
  2. MC3D-rattled (30,044 structures, 599,675 atoms): Rattled analogs of MC3D crystals, with Gaussian noise added to atomic positions
  3. MC3D-random (2,800 structures, 25,095 atoms): Artificial structures from MC3D, with randomized atomic species from 85 elements
  4. MC3D-surfaces (5,589 structures, 205,185 atoms): Surface slabs generated from MC3D, cleaved along random low-index crystallographic planes
  5. MC3D-clusters (9,071 structures, 44,829 atoms): Nanoclusters (2-8 atoms), cut from MC3D and MC3D-rattled environments
  6. MC2D (2,676 structures, 43,225 atoms): Two-dimensional crystals from Materials Cloud 2D database, with 2D periodic boundary conditions
  7. SHIFTML-molcrys (8,578 structures, 852,044 atoms): Curated molecular crystals from Cambridge Structural Database, with 3D periodic boundary conditions
  8. SHIFTML-molfrags (3,241 structures, 72,120 atoms): Neutral molecular fragments from SHIFTML dataset, with 3D periodic boundary conditions

MAD Benchmark Overview

The MAD dataset is distributed together with an accompanied benchmark for the atomistic machine learning models assessement. This benchmark was originally created in Ref. 1, and contains samples from the following datasets, recomputed with MAD DFT settings (and MPtrj dataset DFT settings, if needed):

  1. MAD
  2. MPtrj from Ref. 2
  3. WBM (AKA Matbench) from Ref. 3
  4. Alexandria from Ref. 4
  5. MD22 from Ref. 5
  6. OC2020 from Ref. 6
  7. SPICE from Ref. 7

Chemiscope Visualizations

The data record is accompanied by the Chemiscope visualizations of a few Figures from the dataset paper (Ref. XXX), which contain the information on the landmark points for the SketchMap projections of the dataset features computed with the PET-MAD model, as well as the projections of the overall MAD dataset and the MAD Benchmark subsets. Additionally, these files contain the structure and energy information for each structure.

Data Format and Usage

Dataset structures files

All the subsets are split into train, validataion and test sets with 80:10:10 ratios, and stored in three main files in extxyz format: mad-train.xyz, mad-val.xyz and mad-test.xyz. All the structures are accompanied by DFT-calculated energies (in eV), forces (in eV/Å) and stresses in the case of the periodic structures (in eV/ų).

The structure files can be read using the ase.io.read function from the ASE Python package:

from ase.io import read

train_atoms = read('mad-train.xyz', index=':') # List[ase.Atoms]
atoms = train_atoms[0]

energy = atoms.get_potential_energy() # in eV
forces = atoms.get_forces() # in eV/Å
stress = atoms.get_stress() # in eV/ų (only for periodic structures)

Each created ase.Atoms object contains the information on a subset and train-val-test split in the atoms.info dictionary.

subset = atoms.info['subset'] # MC3D, MC3D-rattled, etc.
split = atoms.info['split'] # train, val or test

MAD Benchmark files

All the structures from the MAD Benchmark are stored in the mad-benchmark directory in two files, using extxyz format: mad-bench-mad-settings.xyz and mad-bench-mptrj-settings.xyz.

The first file contains all the structures recomputed with MAD dataset DFT settings, the second file - a set of structures computed with MPtrj dataset DFT settings. Since some of the subsets (like MPtrj, Matbench and Alexandria) were already compatible with the MPtrj dataset settings, only the OC2020, SPICE and MD22 subsets were recomputed with MPtrj settings. The origin metadata can be found in the atoms.info dictionary of the ase.Atoms objects:

origin_dataset = atoms.info['dataset'] # Alexandria, MD22, MPtrj, etc.
origin_subset = atoms.info['subset'] # MC3D, MC3D-rattled, etc. in the case of MAD dataset

Chemiscope visualization files

Corresponding Chemiscope visualization files are stored in the gzipped json files:

  1. mad-landmarks.chemiscope.json.gz - contains the information on the landmark points for the SketchMap projections of the dataset features computed with the PET-MAD model (Figure 5 from Ref. XXX)
  2. mad-subsets.chemiscope.json.gz - contains the full dataset visulaization performed done using SketchMap projections of the PET-MAD model features (Figure 6 from Ref. XXX)
  3. mad-bench.chemiscope.json.gz - contains the visualization of the MAD Benchmark performed with the PET-MAD model and SketchMap (Figure 7 from Ref. XXX)

These files can be visualized with the Materials Cloud Archive interactive app, or using a chemiscope Python package directly from the Jupyter Notebook:

import chemiscope

chemiscope.show_input('mad-landmarks.chemiscope.json.gz')

AiiDA provenance graphs

Additionally, this data record contains the provenance graphs of the DFT calculations obrained with AiiDA and stored in the aiida archive format. The core archives of the MAD dataset are stored in the root folder to facilitate the interactive apps avalable in the Materials Cloud Archive. The auxiliary archives are stored in the mad-aiida-aux.zip, and contain additional calculations done for MAD Benchmark, as well as some auxiliary calculations.

  1. Root directory:
    • mad-mc3d.aiida - provenance graphs of the MC3D subset
    • mad-mc3d-rattle.aiida - provenance graphs of the MC3D-rattled subset
    • mad-mc3d-random.aiida - provenance graphs of the MC3D-random subset
    • mad-mc3d-surfaces.aiida - provenance graphs of the MC3D-surfaces subset
    • mad-mc3d-clusters.aiida - provenance graphs of the MC3D-clusters subset
    • mad-mc2d.aiida - provenance graphs of the MC2D subset
    • mad-shiftml-molcrys.aiida - provenance graphs of the SHIFTML-molcrys subset
    • mad-shiftml-molfrags.aiida - provenance graphs of the SHIFTML-molfrags subset
  2. mad-aiida-aux.zip content:
    • dimers.aiida - provenance graph of the dimer curves calculations
    • single-atoms.aiida - provenance graph of the isolated atoms calculations
    • alexandria-sample.aiida - provenance graph of the Alexandria dataset sample, recomputed with MAD dataset settings
    • mptrj-sample.aiida - provenance graph of the MPtrj dataset sample, recomputed with MAD dataset settings
    • matbench-sample.aiida - provenance graph of the WBM dataset sample, (AKA Matbench), recomputed with MAD dataset settings
    • MD22-sample.aiida - provenance graph of the MD22 sample, recomputed with MAD dataset settings
    • OC2020-sample.aiida - provenance graph of the OC2020 sample, recomputed with MAD dataset settings
    • BTO.aiida - provenance graph of the recomputed BaTiO3 dataset from Ref. 8, used in PET-MAD paper calculations
    • GaAs.aiida - provenance graph of the GaAs dataset from Ref. 9, used in PET-MAD paper calculations
    • HEA25S-sample.aiida - provenance graph of the recomputed HEA25S sample from Ref. 10, used in PET-MAD paper calculations
    • LiPS.aiida - provenance graph of the recomputed Li3PS4 dataset from Ref. 11, used in PET-MAD paper calculations
    • auxiliary.aiida - provenance graph of the auxiliary calculations used for finding proper DFT parameters for the MAD dataset, cutoff and k-poing convergence studies, vacuum size convergence studies, etc.

In order to read and process the AIIDA provenance graphs, you need to install the AIIDA package and later used the verdi archive import command to import the archives into the local AIIDA database. Please refer to the AIIDA documentation for installation instructions and more details. Once the AIIDA package is installed, you can use the following command to import the archives:

verdi archive import mad-aiida/core/mad-ideal.aiida

Optionally, you can directly provide the link to the archive file in the Materials Cloud Archive:

verdi archive import https://archive.materialscloud.org/record/XXXXX/mad-aiida/core/mad-ideal.aiida

where XXXXX is the Materials Cloud Archive record ID.

DFT Calculations Details

All structures were calculated using:

  • Quantum ESPRESSO v7.2 code
  • PBEsol exchange-correlation functional
  • SSSP Efficiency v1.2 pseudopotentials
  • Energy cutoff: 110 Ry for wavefunctions, 1320 Ry for charge density
  • k-point sampling: Gamma-centered, with a line density of 0.125 Å⁻¹
  • Energy convergence threshold for the SCF calculation: 1e-8 Ry
  • Smearing width for the Marzari-Vanderbilt-DeVita-Payne cold smearing function: 0.01 Ry
  • Vacuum size for non-periodic dimensions: 25 Å
  • Coulomb potential truncation for 2D structures: Sohier-Calandra-Mauri method
  • Coulomb potential truncation for 0D structures: Martyna-Tuckerman method
  • Isolated atom energies baseline is removed from the energies of the structures

References

  1. Mazitov, Arslan, et al. PET-MAD, a universal interatomic potential for advanced materials modeling. arXiv preprint arXiv:2503.14118 (2025)
  2. Deng, B., et al. CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling. Nat Mach Intell 5, 1031–1041 (2023)
  3. Wang, HC., Botti, S. & Marques, M.A.L. Predicting stable crystalline compounds using chemical similarity. npj Comput Mater 7, 12 (2021)
  4. J. Schmidt, et al. Machine-Learning-Assisted Determination of the Global Zero-Temperature Phase Diagram of Materials. Adv. Mater. 2023, 35, 2210788
  5. Stefan Chmiela et al., Accurate global machine learning force fields for molecules with hundreds of atoms.Sci. Adv.9,eadf0873 (2023)
  6. Lowik Chanussot, et al., Open Catalyst 2020 (OC20) Dataset and Community Challenges, ACS Catalysis 2021 11 (10), 6059-6072
  7. Eastman, P., et al. SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials. Sci Data 10, 11 (2023)
  8. Lorenzo Gigli, et al., Thermodynamics and dielectric response of BaTiO₃ by data-driven modeling, Materials Cloud Archive, 2022.88 (2022)
  9. Giulio Imbalzano, Michele Ceriotti, Modeling the Ga/As binary system across temperatures and compositions from first principles, Materials Cloud Archive, 2021.226 (2021)
  10. Arslan Mazitov, et al., Surface segregation in high-entropy alloys from alchemical machine learning: dataset HEA25S, Materials Cloud Archive, 2024.43 (2024)
  11. Lorenzo Gigli, Davide Tisi, Federico Grasselli, Michele Ceriotti, Mechanism of charge transport in lithium thiophosphate, Materials Cloud Archive, 2024.41 (2024)