The tarball data.tar.gz contains all data, including XYZ files, property CSV files, extended XYZ files (for MACE), and dataset splits. It includes three datasets: TM-GSspinPlus, tmPHOTO, and Octa-MK.
The TM-GSspinPlus directory contains:
0-xyz/: DFT-optimized geometries with only hydrogen atoms optimized in the lowest-spin state (singlet or doublet)1-extended_xyz/: Extended XYZ files for MACE with 10 train/test splits for 10-fold cross-validation (CV)2-cv10-splits/: 10-fold CV index files (indices correspond to row numbers starting from 0 in TM-GSspinPlus_property.csv)TM-GSspinPlus_property.csv: Dataset properties, including:refcode: Cambridge Structural Database (CSD) refcodetotal_charge: Total molecular chargemultiplicity: Ground-state spin multiplicitysplitting: Vertical spin-splitting energy (kcal/mol)HOMO: HOMO energy (eV)LUMO: LUMO energy (eV)gap: HOMO-LUMO gap (eV)dipole_moment_Debye: Dipole moment magnitude (Debye)The tmPHOTO directory contains:
0-xyz/: GFN2-xTB optimized geometries (singlet state)1-extended_xyz/: Extended XYZ files for MACE with 10 train/test splits for 10-fold CV2-cv10-splits/: 10-fold CV index files (indices correspond to row numbers starting from 0 in tmPHOTO_property.csv)tmPHOTO_property.csv: Dataset properties, including:refcode: CSD refcodetotal_charge: Total molecular chargemultiplicity: Spin multiplicity used in DFT computations (all singlet, 1)HOMO: HOMO energy (Hartree)LUMO: LUMO energy (Hartree)gap: HOMO-LUMO gap (Hartree)dipole_moment_Debye: Dipole moment magnitude (Debye)The Octa-MK directory contains:
0-xyz/: DFT-optimized geometries in low-spin (*_ls.xyz) or high-spin (*_hs.xyz) states
1-extended_xyz/: Extended XYZ files for MACE
HOMO_LUMO_gap/: 10 train/test splits for HOMO, LUMO, and HOMO-LUMO gapsplitting/: 10 train/test splits for spin-splitting energy; low-spin optimized geometries and low-spin multiplicities are usedtrain_valid/: Train/validation split from the reference paperLS correspond to spin-splitting energy targets2-dataset_splits/: contains two subdirectories
HOMO_LUMO_gap/: indices corresponding to rows (0-based) in Octa-MK_property_HOMO_LUMO_gap.csvsplitting/: indices corresponding to rows (0-based) in Octa-MK_property_splitting.csv2-cv10-splits/ (10-fold CV indices in this work) and 3-train-valid-meyer/
(train/validation indices from the reference paper)Octa-MK_train_valid_merged_clean.csv:
The training data and
validation data from the reference paper were merged into a single dataset.
Refcodes were assigned in this work for convenience. For example, the complex
cr_3_[O-]#[C+]_[O-]#[C+]_[O-]#[C+]_[O-]#[C+]_[O-]#[C+]_[O-]#[C+] in the original training set
is assigned the refcode train_0116, with geometries train_0116_ls.xyz (low-spin optimized)
and train_0116_hs.xyz (high-spin optimized).
This file shares several columns with Octa-MK_property_splitting.csv, including multiplicity, low_spin, and high_spin.
Octa-MK_property_HOMO_LUMO_gap.csv:
refcode: Refcode assigned in this worktotal_charge: Total molecular chargemultiplicity: Spin multiplicity used in the computationHOMO: HOMO energy (eV)LUMO: LUMO energy (eV)gap: HOMO-LUMO gap (eV)Octa-MK_property_splitting.csv:
refcode: Refcode assigned in this worktotal_charge: Total molecular chargemultiplicity: Spin multiplicity of the energetically preferred spin state (low_spin if splitting > 0, high_spin if splitting < 0).splitting: Adiabatic spin-splitting energy (kcal/mol)low_spin: Spin multiplicity of the low-spin statehigh_spin: Spin multiplicity of the high-spin stateNote: To prepare extended XYZ files for MACE, all energy values are converted to eV.
The following unit conversions are used: