This repository contains all the reference quantum mechanical and machine learning outputs for four molecular and the periodic graphene system, as well as jupyeter notebooks to reproduce the figures in the manuscript D. Suman et al., arXiv:2504.01187 (2025).

Hamiltonian and other derived electronic observables dataset repository

The molecular datasets are: QM7, QM9, polyalkene-polyacene series and polyenoic acid series. For QM7 we FPS selected a subset of 1000 molecules and for QM9 we did a random subselection of 200 molecules containing only C, H, N and O atoms.

The data folder includes computed Fock, Overlap and density matrices along with other derived properties for two different bases, STO-3G and def2-TZVP, as well as the outputs of different trained models.

  • ridge : symmetry adapted ridge regression model to predict Hamiltonian matrix elements in the STO-3G basis.
  • target_sto-3g : Indirect models that target derived properties and property-surrogate models that target dipole moment and polarizability computed in the STO-3G basis
  • target_def2-tzvp : Indirect models and property-surrogate models that target derived properties and property-surrogate models that target dipole moment and polarizability computed in the def2-TZVP basis

For each dataset, the folder is structured as follows: ├── DATASET | ├── DATASET.xyz │   ├── ml │   │   ├── ridge │   │   │   ├── seed_250 │   │   │   ├── seed_42 │   │   │   └── seed_87 │   │   ├── target_def2-tzvp │   │   │   ├── eig │   │   │   ├── eig_dip │   │   │   ├── eig_dip_pol │   │   │   ├── eig_dip_pol_mbo │   │   │   └── property_model │   │   └── target_sto-3g │   │   ├── eig │   │   ├── eig_dip │   │   ├── eig_dip_pol │   │   ├── eig_dip_pol_mbo │   │   └── property_model │   └── reference │   ├── def2-tzvp │   │   ├── orbitals.hickle │   │   ├── dip.hickle │   │   ├── eva.hickle │   │   ├── fock.hickle │   │   ├── gap.hickle │   │   ├── mbo.hickle │   │   ├── overlap.hickle │   │   └── pol.hickle │   └── sto-3g │   ├── orbitals.hickle │   ├── dip.hickle │   ├── eva.hickle │   ├── fock.hickle │   ├── gap.hickle │   ├── mbo.hickle │   ├── overlap.hickle │   └── pol.hickle

Each folder includes molecular geometries in DATASET.xyz, machine learning outputs under the ml directory, and reference QM calculations in the reference folder. The ml directory contains a ridge subfolder with ridge regression models trained to target the Hamiltonian matrices in the STO-3G basis using different random seeds and each of these seed folders consist of the best saved model and the predicted properties along with the test indices. It also includes two subdirectories, target_def2-tzvp and target_sto-3g, which correspond to indirect ml models that target derived properties from a Hamiltonian in two different bases namely the def2-TZVP and STO-3G. Each of these contains subfolders representing models that target different properties: eig (eigenvalues), eig_dip (eigenvalues + dipole moments), eig_dip_pol (adding polarizability), and eig_dip_pol_mbo (further adding Mayer bond orders), along with a property_model directory that consists of predictions from models that are trained to directly predict dipole moment and polarizability. Each of these subfolders then have the same content as the ridge folder. The reference directory includes the QM data computed with the def2-tzvp and sto-3g basis sets, with files like eva.hickle (eigenvalues), dip.hickle (dipole moments), pol.hickle (polarizabilities), fock.hickle (Fock matrices), gap.hickle (HOMO-LUMO gaps), mbo.hickle (Mayer bond orders), and overlap.hickle (overlap matrices), as well as the corresponding basis.yaml for each basis set. All model data and reference data is stored in as hickle files (.hickle).

The data folder also contains data for the periodic example in the graphene folder, which has the reference data in the SZV and DZVP basis as well as the ml output for the test structure.

The trained models are saved in the trained_models folder and it has similar structure to the data folder.

Data generation and Training scripts

The QM calculations to obtain the Fock, overlap matrices and other relevant derived properties at B3LYP/def2-TZVP and B3LYP/STO-3G level of theory have been run with PySCF. The script to run those calculations can be found in the scripts folder as pyscf_ref_calc.py.

We also present the python script needed for the training and evaluation of the model in scripts folder as model_train.py along with the virtual environment yaml file to install all the necessary pacakges to run the model. This script shows an example for training the eig_dip_pol model, but the loss functions can be easily modified to include or exclude more properties as well. We also have a cookbook example that we encourage the interested users to try out at https://atomistic-cookbook.org/examples/hamiltonian-qm7/hamiltonian-qm7.html

Jupyter notebooks - Figures and Tables

There are jupyter notebooks that reproduce the most relevant figures and tables in the paper. These notebooks are contained in the paper_figures_tables folder and are prefixed by the figure numbers. They use the saved dataset reported above and reproduce the figures.