Data for: Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks

This is the dataset for the publication "Differentiable sampling of molecular geometries with uncertainty-based adversarial attacks", by D. Schwalbe-Koda, A.R. Tan, R. Gómez-Bombarelli. The repository contains the simulation data used to train the neural network potentials, as well as the adversarial attacks. The contents of the datasets are:

  • zeolite.json: contains all structures, and DFT energies and forces, both for unloaded zeolite frameworks, and zeolite-molecule pairs.
  • ala2.json: contains all geometries, energies and forces for alanine dipeptide calculated with the OPLS force field.
  • ammonia.json: contains all geometries, energies and forces for the ammonia molecule, as calculated with DFT.

In all datasets, the units for energy is given in kcal/mol, and the units for forces are given in kcal/mol Å. Positions and lattice parameters are given in Ångstrom. A detailed description of the columns in these datasets is shown below.

Zeolite dataset

All DFT energies and forces were calculated with VASP 5.4.4, using PBE-D3 (see Methods). The dataset contains the following information:

  • dataset_name: name of the method used to create the pose. They can be:
    • MD: geometry created by sampling an MD trajectory, and corresponds to a sigle frame.
    • random_displacement: geometry created by randomly displacing the atoms (see Methods of the manuscript)
    • adv_attack: adversarial attack on the ground state geometry
    • NNMD: geometry created by performing an NN-based MD simulation, then calculating the DFT energies/forces on a few sampled frames
    • ground_state: geometries obtained by optimizing the zeolite-OSDA poses using DFT simulations.
  • zeolite: IZA code of the zeolite framework
  • molecule: SMILES string of the OSDA docked inside of the zeolite (None if we only have a pure-silica framework)
  • loading: number of OSDAs docked in that particular pose of the zeolite
  • num_atoms: number of atoms of the geometry
  • nxyz: atomic number (first column) and xyz coordinates (in Å, second to fourth columns) of each atom
  • lattice: lattice matrix (in Å) of the structure
  • energy: DFT energy of the configuration (in kcal/mol)
  • forces: DFT forces calculated for each atom (in kcal/mol/Å)

Alanine dipeptide dataset

All energies and forces were calculated using the OPLS force field, as implemented in OpenMM (see Methods).

  • nxyz: atomic number (first column) and xyz coordinates (in Å, second to fourth columns) of each atom
  • energy: energy of the configuration (in kcal/mol) as computed with OPLS
  • forces: forces calculated for each atom (in kcal/mol/Å) as computed with OPLS
  • phi: collective variable phi, in degrees
  • psi: collective variable psi, in degrees

Ammonia dataset

All energies and forces were calculated using BP86-D3, as implemented in ORCA (see Methods).

  • dataset_name: name of the dataset used to train each model. They can be:
    • gen1: geometries generated from hessian displacements of the ammonia molecule
    • gen2: gen1 plus adversarial attacks from the first generation of NN potentials
    • gen3: gen2 plus adversarial attacks from the second generation of NN potentials
    • random: geometries generated by randomly displacing atomic coordinates by 0.1 and 0.3 Å.
  • nxyz: atomic number (first column) and xyz coordinates (in Å, second to fourth columns) of each atom
  • energy: energy of the configuration (in kcal/mol)
  • forces: forces calculated for each atom (in kcal/mol/Å)