Datasets

The tarball datasets.tar.gz contains all data, including XYZ files, property CSV files, extended XYZ files (for MACE), and dataset splits. It includes three datasets: TM-GSspin+, tmPHOTO, and OctaKulik. For a detailed description, see README_DATA.md.

Molecular representations

The tarballs representation_TM-GSspinPlus.tar.gz, representation_tmPHOTO.tar.gz, and representation_OctaKulik.tar.gz contain NumPy arrays of molecular representations used in this work.

For OctaKulik:

HOMO_LUMO_gap/: contains representations for HOMO, LUMO, and gap using both low-spin and high-spin geometries and corresponding spin states
splitting/: contains representations for spin splitting using low-spin geometries and low-spin state

The subdirectory cMBDF_MODA_PC3_MAOC/ contains cMBDF, MODA, and PC3-MAOC representations discussed in the Supporting Information.
Files named refcode-{dataset}.txt provide the refcode ordering for the corresponding NumPy arrays.

MACE

The tarball MACE.tar.gz contains trained MACE models for intensive property prediction, along with SLURM job scripts and logs for all three datasets.

Models

For each dataset, models are organized by type:

MACE_equivariant/: Equivariant MACE (max_L = 2) with model="MACE"
MACE_invariant/: Invariant MACE (max_L = 0) with model="MACE"
AtomicDipolesMACE/: Equivariant dipole MACE (max_L = 2) with model="AtomicDipolesMACE"

File types

.job: SLURM job scripts (update the local path to your source-built MACE installation and the paths for --train_file, and --test_file)
.out: training and evaluation logs (optional, for reference)
*_stagetwo.model: Final trained models used for reported results and inference
Files with embedding in the name include charge and spin embeddings

3DMol

The tarball 3DMol.tar.gz contains trained 3DMol models and logs for each dataset.

File types

*best_checkpoint.pt: checkpoint of the best-performing model
*.log: training and evaluation log files
Files with emb in the name include charge and spin embeddings
Files with global in the name correspond to the global variant (full molecule)
Files with local in the name correspond to the local variant (metal center only)

Reproducibility

To reproduce results, rerun the COMMAND> shown in the corresponding log file
Update the following arguments as needed:
- --splitter: text or NumPy file containing test set indices
- --dataset: dataset loader path adjusted to your local setup