The Massive Atomic Diversity (MAD) - 1.5 dataset

Data

Structures and targets computed at r2SCAN and PBE levels of theory that comprise the dataset are provided in extended XYZ format. Energies (specifically total energy under key energy, and atomization energy under key atomization_energy), stresses (stress), and lattice parameters (Lattice) are stored in the file headers, and atom types (species), positions (pos), and forces (forces) are stored as space-separated entries. Cartesian coordinates, energies, forces, and stresses are given in A, eV, eV/A, and eV/A^3, respectively. Also stored in the header are the name of the subset to which each structure belongs (subset) and the numeric index of the structure within its subset (frame_id). Together these uniquely identify each structure in the dataset.

r2SCAN

The MAD-1.5 dataset is computed in its entirety at the r2SCAN level of theory.

Specifically, the record contains:

  • mad-1.5-r2scan-train.xyz -> the training split used in trainng PET-MAD-1.5 models

  • mad-1.5-r2scan-val.xyz -> the validation split used in training PET-MAD-1.5 models

  • mad-1.5-r2scan-test.xyz -> the test split used in evaluating PET-MAD-1.5 models

  • mad-1.5-r2scan-llpr-rejected.xyz -> the 8244 structures rejected in the LLPR-uncertainty-based cleaning step

PBE

Additionally we provide structures, energies, forces, and stresses (for periodic structures) from the MAD-1 subsets (MC3D, MC3D-rattled, MC3D-random, MC3D-surface, MC3D-cluster, MC2D, ShiftML-molcrys, ShiftML-molfrags) plus monomers and MC3D-random-extended from MAD-1.5, computed with the PBE functional but with all other DFT settings kept consistent with the r2SCAN calculations.

Targets computed with PBE were used in model training but targeted with separate heads than the r2SCAN targets, found to help training and result in lower force errors (more details can be found in the supporting preprint). Cross-validation splits used in training were consistent with the above r2SCAN splits.

Specifically, we provide the file:

  • mad-1.5-pbe.xyz -> a subset of the MAD-1.5 dataset computed with the PBE functional.

Supporting files

Other files made available in this data archive:

  • control.in -> the base FHIaims input file containing DFT settings used in calculating the MAD-1.5 dataset. These are used in conjunction with FHIaims "tight" basis sets and grids for all calculations. FHIaims 2020 species defaults are used unmodified for all 102 elements with a few exceptions: for a subset of lanthanide and actinide elements (Pr–Yb, Pu–No, excluding Ce), we removed the confined 5d/6d function because, although physically motivated, they were found to hinder SCF convergence in non-spin-polarized calculations.