This repository contains all the reference quantum mechanical and machine learning outputs for four molecular and the periodic graphene system, as well as jupyeter notebooks to reproduce the figures in the manuscript D. Suman et al., arXiv:2504.01187 (2025).
The molecular datasets are: QM7, QM9, polyalkene-polyacene series and polyenoic acid series. For QM7 we FPS selected a subset of 1000 molecules and for QM9 we did a random subselection of 200 molecules containing only C, H, N and O atoms.
The data
folder includes computed Fock, Overlap and density matrices along with other derived properties for two different bases, STO-3G and def2-TZVP, as well as the outputs of different trained models.
For each dataset, the folder is structured as follows: ├── DATASET | ├── DATASET.xyz │ ├── ml │ │ ├── ridge │ │ │ ├── seed_250 │ │ │ ├── seed_42 │ │ │ └── seed_87 │ │ ├── target_def2-tzvp │ │ │ ├── eig │ │ │ ├── eig_dip │ │ │ ├── eig_dip_pol │ │ │ ├── eig_dip_pol_mbo │ │ │ └── property_model │ │ └── target_sto-3g │ │ ├── eig │ │ ├── eig_dip │ │ ├── eig_dip_pol │ │ ├── eig_dip_pol_mbo │ │ └── property_model │ └── reference │ ├── def2-tzvp │ │ ├── orbitals.hickle │ │ ├── dip.hickle │ │ ├── eva.hickle │ │ ├── fock.hickle │ │ ├── gap.hickle │ │ ├── mbo.hickle │ │ ├── overlap.hickle │ │ └── pol.hickle │ └── sto-3g │ ├── orbitals.hickle │ ├── dip.hickle │ ├── eva.hickle │ ├── fock.hickle │ ├── gap.hickle │ ├── mbo.hickle │ ├── overlap.hickle │ └── pol.hickle
Each folder includes molecular geometries in DATASET.xyz
, machine learning outputs under the ml
directory, and reference QM calculations in the reference
folder. The ml
directory contains a ridge
subfolder with ridge regression models trained to target the Hamiltonian matrices in the STO-3G basis using different random seeds and each of these seed folders consist of the best saved model and the predicted properties along with the test indices. It also includes two subdirectories, target_def2-tzvp
and target_sto-3g
, which correspond to indirect ml models that target derived properties from a Hamiltonian in two different bases namely the def2-TZVP and STO-3G. Each of these contains subfolders representing models that target different properties: eig
(eigenvalues), eig_dip
(eigenvalues + dipole moments), eig_dip_pol
(adding polarizability), and eig_dip_pol_mbo
(further adding Mayer bond orders), along with a property_model
directory that consists of predictions from models that are trained to directly predict dipole moment and polarizability. Each of these subfolders then have the same content as the ridge
folder. The reference
directory includes the QM data computed with the def2-tzvp and sto-3g basis sets, with files like eva.hickle (eigenvalues), dip.hickle (dipole moments), pol.hickle (polarizabilities), fock.hickle (Fock matrices), gap.hickle (HOMO-LUMO gaps), mbo.hickle (Mayer bond orders), and overlap.hickle (overlap matrices), as well as the corresponding basis.yaml for each basis set. All model data and reference data is stored in as hickle files (.hickle).
The data
folder also contains data for the periodic example in the graphene
folder, which has the reference data in the SZV and DZVP basis as well as the ml output for the test structure.
The trained models are saved in the trained_models
folder and it has similar structure to the data
folder.
The QM calculations to obtain the Fock, overlap matrices and other relevant derived properties at B3LYP/def2-TZVP and B3LYP/STO-3G level of theory have been run with PySCF. The script to run those calculations can be found in the scripts
folder as pyscf_ref_calc.py
.
We also present the python script needed for the training and evaluation of the model in scripts
folder as model_train.py
along with the virtual environment yaml file to install all the necessary pacakges to run the model. This script shows an example for training the eig_dip_pol
model, but the loss functions can be easily modified to include or exclude more properties as well. We also have a cookbook example that we encourage the interested users to try out at https://atomistic-cookbook.org/examples/hamiltonian-qm7/hamiltonian-qm7.html
There are jupyter notebooks that reproduce the most relevant figures and tables in the paper. These notebooks are contained in the paper_figures_tables
folder and are prefixed by the figure numbers. They use the saved dataset reported above and reproduce the figures.