eQM7 dataset
############


Description
===========

The electron QM7 (eQM7) dataset is created with the purpose of training and validating polarizable (machine learning) force fields on non-equilibrium configurations of small molecules. It contains 6868 molecules with hydrogen, carbon, nitrogen and oxygen. For each molecule, 500 perturbations are constructed using normal mode sampling, torsion sampling, dimer sampling and homogeneous electric fields. Energies, forces and Foster-Boys centers are computed using density functional theory (DFT) with the PBE0 functional, Aug-cc-pVTZ basis set in the ab-initio quantum chemistry code Psi4.

The eQM7 dataset is described and used in the eMLP paper:
M. Cools-Ceuppens, J. Dambre, T. Verstraelen, Modeling electronic response properties with an explicit-electron machine learning potential (in preparation)


Extended xyz file format
========================

All the xyz files in this archive are stored in the so-called extended XYZ file format (https://wiki.fysik.dtu.dk/ase/ase/io/formatoptions.html#extxyz). An example for methane is given below:

    10
    Properties=species:S:1:pos:R:3:Z:I:1:force:R:3 energy=-1101.0905991080365 efield="-0.000858 -0.002183 0.003142"
    C	0.980058	-0.028496	-0.057220	6	1.938766	2.759616	5.318995
    H	2.099262	-0.001267	0.006583	1	-0.937064	-0.131486	-0.241118
    H	0.631799	1.035951	0.006578	1	0.192833	-0.930122	-0.241735
    H	0.623938	-0.532116	0.889808	1	0.282564	0.395127	-1.128097
    H	0.655604	-0.487301	-0.869118	1	-1.477150	-2.093187	-3.708097
    Es	0.980092	-0.028444	-0.057141	99	0.000000	0.000000	0.000000
    Es	1.721255	-0.013024	-0.027028	99	0.000000	0.000000	0.000000
    Es	0.759978	-0.339039	-0.613127	99	0.000000	0.000000	0.000000
    Es	0.745957	0.676872	-0.027036	99	0.000000	0.000000	0.000000
    Es	0.742541	-0.363540	0.556230	99	0.000000	0.000000	0.000000
    
The positions of the Foster-Boys centers are stored as if they were einsteinium atoms (Es) with atom number 99. Note that all Foster-Boys centers are stored: core and valence electrons. The first and fifth column define the atom type of all the elements. The second to fourth column specify the positions of the atoms (or Foster-Boys centers) in angstrom and the last three columns are the forces in eV/angstrom. In the comment line, the energy is given in eV. The keyword 'efield' defines the homogeneous electric field in atomic units. 


Files in this archive
=====================

eQM7.tar.gz: 
------------
The full eQM7 dataset, containing 3,434,000 ab-initio calculations. The archive contains 6868 directories, each one for a single molecule. The index of the molecule refers to the index of the same molecule in the original QM7 paper (https://doi.org/10.1103/PhysRevLett.108.058301, http://quantum-machine.org/datasets/). In each directory, four different extended xyz files are stored. The filename refers to the perturbation technique used to generate the data. More information can be found in the eMLP paper. The four extended xyz files together contain 500 ab-initio calculations.

hessians.tar.gz:
----------------
An archive containing the hessians and optimized geometries for each of the 6868 molecules in the eQM7 dataset. In each molecule directory, two files are present. The first file 'minimum.xyz' is the extended xyz file containing the optimized geometry and the Foster-Boys centers. The second file 'hessian_data.npz' is a zipped numpy archive (https://numpy.org/doc/stable/reference/generated/numpy.load.html#numpy.load), containing the arrays of the positions, masses, gradient, elements, hessian and energy of the same optimized geometry in atomic units (except for the positions, which is stored in angstrom).

reference_hessians.tar.gz:
--------------------------
An archive containing the hessians and optimized geometries of the reference molecules for the eMLP. The archive is structured similar to hessians.tar.gz.