This data record contains the calculations data of the MAD dataset - a light-weight dataset for atomistic machine learning, specifically designed to create universal interatomic potentials (originally published in Ref. 1). This data is distributed through the provenance graphs of the DFT calculations obtained with AIIDA, which can be used to reproduce the dataset and to study the DFT calculations details. In addition, the record also contain human-readable versions of the main dataset components, as well as interactive visualizations of the dataset.
The dataset consists of 95595 structures of 85 elements (with atomic numbers ranging from 1 to 86, except Astatine) and contains 8 diverse subsets that cover organic and inorganic domains.
The MAD dataset is distributed together with an accompanied benchmark for the atomistic machine learning models assessement. This benchmark was originally created in Ref. 1, and contains samples from the following datasets, recomputed with MAD DFT settings (and MPtrj dataset DFT settings, if needed):
The data record is accompanied by the Chemiscope visualizations of a few Figures from the dataset paper (Ref. XXX), which contain the information on the landmark points for the SketchMap projections of the dataset features computed with the PET-MAD model, as well as the projections of the overall MAD dataset and the MAD Benchmark subsets. Additionally, these files contain the structure and energy information for each structure.
All the subsets are split into train, validataion and test sets with 80:10:10
ratios, and stored in three main files in extxyz
format: mad-train.xyz
,
mad-val.xyz
and mad-test.xyz
. All the structures are accompanied by DFT-calculated
energies (in eV), forces (in eV/Å) and stresses in the case of the periodic structures
(in eV/ų).
The structure files can be read using the ase.io.read
function from the ASE Python
package:
from ase.io import read
train_atoms = read('mad-train.xyz', index=':') # List[ase.Atoms]
atoms = train_atoms[0]
energy = atoms.get_potential_energy() # in eV
forces = atoms.get_forces() # in eV/Å
stress = atoms.get_stress() # in eV/ų (only for periodic structures)
Each created ase.Atoms
object contains the information on a subset and train-val-test split in the atoms.info
dictionary.
subset = atoms.info['subset'] # MC3D, MC3D-rattled, etc.
split = atoms.info['split'] # train, val or test
All the structures from the MAD Benchmark are stored in the mad-benchmark
directory in two files,
using extxyz
format: mad-bench-mad-settings.xyz
and mad-bench-mptrj-settings.xyz
.
The first file contains all the structures recomputed with MAD dataset DFT settings, the second file -
a set of structures computed with MPtrj dataset DFT settings. Since some of the subsets (like MPtrj,
Matbench and Alexandria) were already compatible with the MPtrj dataset settings, only the OC2020,
SPICE and MD22 subsets were recomputed with MPtrj settings. The origin metadata can be found in the
atoms.info
dictionary of the ase.Atoms
objects:
origin_dataset = atoms.info['dataset'] # Alexandria, MD22, MPtrj, etc.
origin_subset = atoms.info['subset'] # MC3D, MC3D-rattled, etc. in the case of MAD dataset
Corresponding Chemiscope visualization files are stored in the gzipped json files:
mad-landmarks.chemiscope.json.gz
- contains the information on the landmark points
for the SketchMap projections of the dataset features computed with the PET-MAD model
(Figure 5 from Ref. XXX)mad-subsets.chemiscope.json.gz
- contains the full dataset visulaization performed
done using SketchMap projections of the PET-MAD model features (Figure 6 from Ref. XXX)mad-bench.chemiscope.json.gz
- contains the visualization of the MAD Benchmark
performed with the PET-MAD model and SketchMap (Figure 7 from Ref. XXX)These files can be visualized with the Materials Cloud Archive interactive
app, or using a chemiscope
Python package directly from the Jupyter Notebook:
import chemiscope
chemiscope.show_input('mad-landmarks.chemiscope.json.gz')
Additionally, this data record contains the provenance graphs of the DFT calculations
obrained with AiiDA and stored in the aiida
archive format. The core archives of the
MAD dataset are stored in the root folder to facilitate the interactive apps avalable
in the Materials Cloud Archive. The auxiliary archives are stored in the mad-aiida-aux.zip
,
and contain additional calculations done for MAD Benchmark, as well as some auxiliary calculations.
mad-mc3d.aiida
- provenance graphs of the MC3D subsetmad-mc3d-rattle.aiida
- provenance graphs of the MC3D-rattled subsetmad-mc3d-random.aiida
- provenance graphs of the MC3D-random subsetmad-mc3d-surfaces.aiida
- provenance graphs of the MC3D-surfaces subsetmad-mc3d-clusters.aiida
- provenance graphs of the MC3D-clusters subsetmad-mc2d.aiida
- provenance graphs of the MC2D subsetmad-shiftml-molcrys.aiida
- provenance graphs of the SHIFTML-molcrys subsetmad-shiftml-molfrags.aiida
- provenance graphs of the SHIFTML-molfrags subsetmad-aiida-aux.zip
content:dimers.aiida
- provenance graph of the dimer curves calculationssingle-atoms.aiida
- provenance graph of the isolated atoms calculationsalexandria-sample.aiida
- provenance graph of the Alexandria dataset sample,
recomputed with MAD dataset settingsmptrj-sample.aiida
- provenance graph of the MPtrj dataset sample,
recomputed with MAD dataset settingsmatbench-sample.aiida
- provenance graph of the WBM dataset sample,
(AKA Matbench), recomputed with MAD dataset settingsMD22-sample.aiida
- provenance graph of the MD22 sample,
recomputed with MAD dataset settingsOC2020-sample.aiida
- provenance graph of the OC2020 sample,
recomputed with MAD dataset settingsBTO.aiida
- provenance graph of the recomputed BaTiO3 dataset from Ref. 8,
used in PET-MAD paper calculationsGaAs.aiida
- provenance graph of the GaAs dataset from Ref. 9, used in PET-MAD paper
calculationsHEA25S-sample.aiida
- provenance graph of the recomputed HEA25S sample from Ref. 10,
used in PET-MAD paper calculationsLiPS.aiida
- provenance graph of the recomputed Li3PS4 dataset from Ref. 11, used in
PET-MAD paper calculationsauxiliary.aiida
- provenance graph of the auxiliary calculations used for finding proper
DFT parameters for the MAD dataset, cutoff and k-poing convergence studies, vacuum size
convergence studies, etc.In order to read and process the AIIDA provenance graphs, you need to install the AIIDA package
and later used the verdi archive import
command to import the archives into the local AIIDA database.
Please refer to the
AIIDA documentation
for installation instructions and more details.
Once the AIIDA package is installed, you can use the following command to import the archives:
verdi archive import mad-aiida/core/mad-ideal.aiida
Optionally, you can directly provide the link to the archive file in the Materials Cloud Archive:
verdi archive import https://archive.materialscloud.org/record/XXXXX/mad-aiida/core/mad-ideal.aiida
where XXXXX
is the Materials Cloud Archive record ID.
All structures were calculated using: