This data record contains the inputs data, the training subsets, and the processed calculations results required for reproducing the results from the PET-MAD paper (Ref. [1]).
[!NOTE] In order to avoid duplication of data, the Massive Atomic Diversity (MAD) dataset itself is not included in this data record. More data and details on the contents of the dataset can be found here, and in the accompanying pre-print [2].
The data record is structured as follows:
data/:
datasets/: contains the raw datasets in the extxyz format used for training the
problem-specific models presented in the paper, and evaluating the PET-MAD model.
Namely, the datasets are:
BaTiO3: For predicting the dielectric response in the BaTiO3. Originally taken
from Ref. [3].GaAs: For predicting the melting point in the GaAs. Originally taken from Ref.
[4].HEA25S-subset: For predicting the surface segregation in the CoCrFeMnNi alloy. Originally
sampled from Ref. [5].Li3PS4: For predicting the ionic conductivity in the lithium thiophosphate. Originally taken
from Ref. [6].succinic-acid: For predicting the chemical shieldings in NMR spectroscopy. Originally taken
from Ref. [7].water: For predicting the heat capacity with Nuclear Quantum Effects. Originally taken
from Ref. [8].mad-bench: Benchmark dataset used for evaluating the PET-MAD, as well as
other universal MLIPs. Contains small subsets of the MAD [2],
MPtrj [9], WBM [10], Alexandria [11], and
OC2020 [12], SPICE [13], and MD22 [14]
datasets. All the data is provided in two DFT flavors: MPtrj-compatible
(mad-bench-mptrj-settings.xyz) and MAD-compatible (mad-bench-mad-settings.xyz).All the datasets except for mad-bench are provided with the train-val-test splits used in the
paper, and thus consist of the corresponding train.xyz, val.xyz, and test.xyz files.
eval/: contains the evaluation results of all the universal MLIPs considered in the paper on the
mad-bench dataset. Each models' predictions are stored in a separate extxyz file, with the
naming convention mad-bench-<model-name>-predictions.xyz. The models considered are: PET-MAD,
MACE-MP-0-L [15], MatterSim-5M [16], SevenNet-l3i5 [17],
and ORB-v2 [18].
figures_data/: contains the processed data required for reproducing the figures in the paper. This
data is supposed to be used along with the Jupyter Notebooks in the notebooks/ folder, that
contain the plotting routines.
models/: contains the trained PET-MAD model, as well as the problem-specific bespoke (pet-bespoke.pt)
and LoRA-finetuned (pet-lora.pt) models for each of the six materials considered in the paper.
Each model is saved in a TorchScipt format that ensures reproducibility and portability.
notebooks/: contains the Jupyter Notebooks used for generating the figures in the paper. Each notebook contains the plotting routines, and uses the processed data in the `figures_data/
figures/: contains the generated figures in PDF, SVG and PNG formats.
inputs/: contains the input files used for training and evaluating the PET-MAD model, as well as performing the problem-specific simulations for selected materials: BaTiO3, GaAs, CoCrFeMnNi, Li3PS4, succinic acid, and water.
All the structures in the datasets and eval folders are stored in the extxyz format
with the accompanied DFT-calculated energies (in eV), forces (in eV/Å) and stresses
in the case of the periodic structures (in eV/ų).
The structure files can be read using the ase.io.read function from the ASE Python
package:
from ase.io import read
train_atoms = read('data/datasets/BaTiO3/train.xyz', index=':') # List[ase.Atoms]
atoms = train_atoms[0]
energy = atoms.get_potential_energy() # in eV
forces = atoms.get_forces() # in eV/Å
stress = atoms.get_stress() # in eV/ų (only for periodic structures)
[!WARNING] This section is only provided for the reproducibility of the results in the paper. For practical applications, we recommend using the
pet-madpackage, which provides a more user-friendly interface for loading and using the PET-MAD model. More details can be found here.
All the models in the models folder are stored in the TorchScript format, and can be
loaded using the metatensor-torch package. Please note, that the models we provide
are using the custom neighborlist C++ extension from the pet-neighbors-convert package,
which needs to be dynamically linked before loading the models. This can be done as follows:
import pet_neighbors_convert # Dynamically links the custom C++ extension
from metatensor.torch.atomistic.ase_calculator import MetatensorCalculator
calc = MetatensorCalculator('models/BaTiO3/pet-mad-bespoke.pt') # Load the model
After loading the calculator object, it can be used to perform energy and force calculations on atomic structures.
from ase.io import read
atoms = read('data/datasets/BaTiO3/train.xyz', index=':')[0] # ase.Atoms
atoms.calc = calc
energy = atoms.get_potential_energy() # in eV
forces = atoms.get_forces() # in eV/Å
stress = atoms.get_stress() # in eV/ų (only for periodic structures)
All the structures from the MAD Benchmark are stored in the mad-benchmark directory in two files,
using extxyz format: mad-bench-mad-settings.xyz and mad-bench-mptrj-settings.xyz.
The first file contains all the structures recomputed with MAD dataset DFT settings, the second file -
a set of structures computed with MPtrj dataset DFT settings. Since some of the subsets (like MPtrj,
Matbench and Alexandria) were already compatible with the MPtrj dataset settings, only the OC2020,
SPICE and MD22 subsets were recomputed with MPtrj settings. The origin metadata can be found in the
atoms.info dictionary of the ase.Atoms objects:
origin_dataset = atoms.info['dataset'] # Alexandria, MD22, MPtrj, etc.
origin_subset = atoms.info['subset'] # MC3D, MC3D-rattled, etc. in the case of MAD dataset