Dataset of hydrocarbons

Contains molecules analyzed in "Electronic excited states from physically-constrained machine learning" by Edoardo Cignoni, Divya Suman, Jigyasa Nigam, Lorenzo Cupellini, Benedetta Mennucci, and Michele Ceriotti [1]. For all the conformations of each molecule, the Fock matrix and overlap matrix are available, as well as the definition of the orbitals associated with each atom. The Fock and overlap matrices are computed with B3LYP/def2-TZVP and B3LYP/STO-3G, and can be found in the b3lypg_def2tzvp and b3lypg_sto3g folders inside the folder of each molecule. The coordinates are reported as an xyz file.

For each molecule, the folder is structured as follows:

MOLECULE_NAME │ ├── b3lypg_def2tzvp │   ├── focks.npy : Fock matrices in B3LYP/def2-TZVP basis │   ├── orbs.json : Orbitals in B3LYP/STO-3G basis │   └── ovlps.npy : Overlap matrices in B3LYP/def2-TZVP basis ├── b3lypg_sto3g │ ├── focks.npy : Fock matrices in B3LYP/STO-3G basis │ ├── orbs.json : Orbitals in B3LYP/STO-3G basis │ └── ovlps.npy : Overlap matrices in B3LYP/STO-3G basis │ └── MOLECULE_NAME.xyz : N distorted coordinates for MOLECULE_NAME

For the molecules in the training set (ethane, ethene, butadiene, hexane, hexatriene, isoprene, styrene) there are 1000 conformations. For octatetraene and decapentene there are 100 conformations. For the polyalkenes (dodecahexaene, tetradecaheptaene, hexadecaoctaene, octadecanonaene, eicosadecaene) the optimized geometry at B3LYP/6-31G(d) is reported. For the aromatic molecules (benzene, azulene, naphthalene, biphenyl) the optimized geometry at B3LYP/6-31G(d) is reported,

For anthracene, only the coordinates are reported (i.e., no Fock/overlap matrix). Coordinates of anthracene are extracted from a DFTB3/3OB simulation in the NVT ensemble as described in Ref. [1]., for a total of 100,000 configurations. Alongside the coordinates, there are two folders, ml and tzvp. Inside the tzvp folder there is the spectral density (sds), the vibronic spectrum (vib), and the spectrum with the added disorder (spec) for the 10 ps chunk of MD trajectory in the time frame 10ps - 20ps. Inside the ml folder there are the same files along all the 10ps windows of the MD trajectory, alongside the same quantities averaged over the windows (avg in the file name).

The QM calculations to obtain the Fock and overlap matrices at B3LYP/def2-TZVP and B3LYP/STO-3G level of theory have been run with PySCF. The script to run those calculations can be found in the halex github repository that has been published alongside Ref. [1] (full URL: https://github.com/ecignoni/halex/blob/main/scripts/run_pyscf_spherical.py).

The same script is provided inside the folder for reference (run_pyscf_spherical.py).

Jupyter notebooks - Figures

There are some jupyter notebooks that reproduce the most relevant figures in the paper (supporting information excluded). These notebooks are contained in the paper_figures folder. They use the dataset reported above and reproduce the figures end-to-end (i.e., they load the model trained parameters, perform the prediction, ...). The trained model is contained in the folder paper_figures/trained_model.

References

[1] Cignoni, E., Suman, D., Nigam, J., Cupellini, L., Mennucci, B., & Ceriotti, M. (2023). Electronic excited states from physically-constrained machine learning. arXiv preprint arXiv:2311.00844.