PET-MAD, a lightweight universal interatomic potential for advanced materials modeling: data record.

This data record contains the inputs data, the training subsets, and the processed calculations results required for reproducing the results from the PET-MAD paper (Ref. [1]).

[!NOTE] In order to avoid duplication of data, the Massive Atomic Diversity (MAD) dataset itself is not included in this data record. More data and details on the contents of the dataset can be found here, and in the accompanying pre-print [2].

Data structure and description

The data record is structured as follows:

data/:

datasets/: contains the raw datasets in the extxyz format used for training the problem-specific models presented in the paper, and evaluating the PET-MAD model. Namely, the datasets are:
- BaTiO3: For predicting the dielectric response in the BaTiO3. Originally taken from Ref. [3].
- GaAs: For predicting the melting point in the GaAs. Originally taken from Ref. [4].
- HEA25S-subset: For predicting the surface segregation in the CoCrFeMnNi alloy. Originally sampled from Ref. [5].
- Li3PS4: For predicting the ionic conductivity in the lithium thiophosphate. Originally taken from Ref. [6].
- succinic-acid: For predicting the chemical shieldings in NMR spectroscopy. Originally taken from Ref. [7].
- water: For predicting the heat capacity with Nuclear Quantum Effects. Originally taken from Ref. [8].
- mad-bench: Benchmark dataset used for evaluating the PET-MAD, as well as other universal MLIPs. Contains small subsets of the MAD [2], MPtrj [9], WBM [10], Alexandria [11], and OC2020 [12], SPICE [13], and MD22 [14] datasets. All the data is provided in two DFT flavors: MPtrj-compatible (mad-bench-mptrj-settings.xyz) and MAD-compatible (mad-bench-mad-settings.xyz).
All the datasets except for mad-bench are provided with the train-val-test splits used in the paper, and thus consist of the corresponding train.xyz, val.xyz, and test.xyz files.
eval/: contains the evaluation results of all the universal MLIPs considered in the paper on the mad-bench dataset. Each models' predictions are stored in a separate extxyz file, with the naming convention mad-bench-<model-name>-predictions.xyz. The models considered are: PET-MAD, MACE-MP-0-L [15], MatterSim-5M [16], SevenNet-l3i5 [17], and ORB-v2 [18].
figures_data/: contains the processed data required for reproducing the figures in the paper. This data is supposed to be used along with the Jupyter Notebooks in the notebooks/ folder, that contain the plotting routines.

models/: contains the trained PET-MAD model, as well as the problem-specific bespoke (pet-bespoke.pt) and LoRA-finetuned (pet-lora.pt) models for each of the six materials considered in the paper. Each model is saved in a TorchScipt format that ensures reproducibility and portability.

notebooks/: contains the Jupyter Notebooks used for generating the figures in the paper. Each notebook contains the plotting routines, and uses the processed data in the `figures_data/

figures/: contains the generated figures in PDF, SVG and PNG formats.

inputs/: contains the input files used for training and evaluating the PET-MAD model, as well as performing the problem-specific simulations for selected materials: BaTiO3, GaAs, CoCrFeMnNi, Li3PS4, succinic acid, and water.

Data Format and Usage

Dataset structures files

All the structures in the datasets and eval folders are stored in the extxyz format with the accompanied DFT-calculated energies (in eV), forces (in eV/Å) and stresses in the case of the periodic structures (in eV/Å³).

The structure files can be read using the ase.io.read function from the ASE Python package:

from ase.io import read

train_atoms = read('data/datasets/BaTiO3/train.xyz', index=':') # List[ase.Atoms]
atoms = train_atoms[0]

energy = atoms.get_potential_energy() # in eV
forces = atoms.get_forces() # in eV/Å
stress = atoms.get_stress() # in eV/Å³ (only for periodic structures)

Models files

[!WARNING] This section is only provided for the reproducibility of the results in the paper. For practical applications, we recommend using the pet-mad package, which provides a more user-friendly interface for loading and using the PET-MAD model. More details can be found here.

All the models in the models folder are stored in the TorchScript format, and can be loaded using the metatensor-torch package. Please note, that the models we provide are using the custom neighborlist C++ extension from the pet-neighbors-convert package, which needs to be dynamically linked before loading the models. This can be done as follows:

import pet_neighbors_convert # Dynamically links the custom C++ extension
from metatensor.torch.atomistic.ase_calculator import MetatensorCalculator

calc = MetatensorCalculator('models/BaTiO3/pet-mad-bespoke.pt') # Load the model

After loading the calculator object, it can be used to perform energy and force calculations on atomic structures.

from ase.io import read

atoms = read('data/datasets/BaTiO3/train.xyz', index=':')[0] # ase.Atoms
atoms.calc = calc

energy = atoms.get_potential_energy() # in eV
forces = atoms.get_forces() # in eV/Å
stress = atoms.get_stress() # in eV/Å³ (only for periodic structures)

MAD Benchmark files

All the structures from the MAD Benchmark are stored in the mad-benchmark directory in two files, using extxyz format: mad-bench-mad-settings.xyz and mad-bench-mptrj-settings.xyz.

The first file contains all the structures recomputed with MAD dataset DFT settings, the second file - a set of structures computed with MPtrj dataset DFT settings. Since some of the subsets (like MPtrj, Matbench and Alexandria) were already compatible with the MPtrj dataset settings, only the OC2020, SPICE and MD22 subsets were recomputed with MPtrj settings. The origin metadata can be found in the atoms.info dictionary of the ase.Atoms objects: