This is the dataset for "Structure Determination of an Amorphous Drug through Large-Scale NMR Predictions"
Authors: Manuel Cordova, Martins Balodis, Albert Hofstetter, Federico Paruzzo, Sten O. Nilsson Lill, Emma S. E. Eriksson, Pierrick Berruyer, Bruno Simões de Almeida, Michael J. Quayle, Stefan T. Norberg, Anna Svensk Ankarberg, Staffan Schantz, Lyndon Emsley
The dataset is divided into three main subfolders described below.
This directory contains all experimental and computational data used.
This directory contains the CIF files for the 10 lowest energy CSP candidates considered for NMR crystallography. It also contains the structure determined by X-Ray crystallography (file "AZ16A226.cif"), and the CIF files of the tautomers of candidate #1 considered (in the subfolder "Tautomers").
This directory contains Numpy arrays of the formation energies (Ef) of intermolecular complexes. "Ec_" refers to the energy of the probe (central) molecule, "Ee_" refers to the energy of the environment, "Et_" refers to the energy of the intermolecular complex. "Ef_" refers to the formation energy defined as Et - Ec - Ee, and Ef_[...]_2 refers to the formation energy defined as Et - Ee.
The file names follow the convention "[type of energy]_[probe/central molecule](_2)_w_[water content].npy"
The numpy arrays are sorted as follows:
The "Sample_input" directory contains the inputs for DFTB computation of the formation energy of the first molecule of frame 0 of the first simulation at 4% water. The remaining inputs can be generated using the script "Make_environments.ipynb". Note that this script generates XYZ structure files, that then need to be converted into DFTB+ input files corresponding to the files in the "Sample_input" directory.
This directory contains XYZ structure files required to compute the formation energies. The directory was left empty in order to reduce the size of the dataset. The XYZ files can be generated using the script "Make_environments.ipynb".
This directory contains XYZ structure files of the H-bonding motifs identified in the MD structures. The directory was left empty in order to reduce the size of the dataset. The XYZ files can be generated using the script "AZD5718_pattern_recognition.ipynb" followed by "Sort_patterns.ipynb". The first scipts generates the structure files for each simulation snapshots considered, and the second script sorts the files by snapshot.
This directory contains XYZ structure files of the H-bonding motifs that yield a chemical shift above 11 ppm identified in the MD structures. The directory was left empty in order to reduce the size of the dataset. The XYZ files can be generated using the script "AZD5718_pattern_recognition_11_ppm.ipynb" followed by "Sort_patterns_11_ppm.ipynb". The first scipts generates the structure files for each simulation snapshots considered, and the second script sorts the files by snapshot.
This directory contains input and output files for chemical shift computation of the CSP candidates, tautomers and X-ray structure. The subfolder "positional_variance" contains input and output files for chemical shift computations of the perturbed structures used for determining the positional uncertainty of candidate #1.
This directory contains pdb structure files of snapshots of the MD simulations. Each simulation is contained in a subfolder which naming follows the convention "amcell-[water content]_[simulation number]_last100ns".
This directory contains predicted proton, carbon and nitrogen chemical shieldings for all MD snapshots. The predictions are contained in subdirectories named according to their corresponding MD simulation. Predicted shifts and errors are stored in Numpy arrays within these directories.
This directory contains all NMR experiments run on the crystalline and amorphous samples of AZD5718. The files "Crystalline.txt" and "Amorphous.txt" contain processed 1H spectra that are used to determine shielding-to-shift conversion parameters. This is done in the "MD_Shifts.ipynb" script.
This directory contains all figures generated from the Python scripts.
This directory contains intermolecular complex formation energies as a function of the H-bonding partner of each NH group. The figures are generated using the script "Extract_formation_energies.ipynb".
This directory contains the bar plot of the number of instances of the most often occurring H-bonding motifs that yield a predicted chemical shift above 11 ppm. The figure is generated using the script "Pattern_statistics_11ppm.ipynb".
This directory contains the simulated 1H spectra of crystalline and amorphous AZD5718 (all water contents) using the shifts predicted by ShiftML. The figures are generated using the script "MD_Shifts.ipynb".
This directory contains the simulated 1H spectra of the NH groups of AZD5718 involved in the most often occurring H-bonding motifs identified. The spectra extracted from simulations at the different water contents are found in the files "N[index of N]_w_[water content].pdf". The spectra extracted from all simulations are found in the files "N[index of N].pdf". The figure "Statistics.pdf" contains the number of H-bonding partners of each NH group as a function of the water content. These numbers are normalized by the total number of each NH group in the simulations in the file "Statistics_norm.pdf".
This directory contains all scripts used to analyze the data. All Python notebook scripts require the following Python libraries:
Python 3 (3.7.9)
Numpy (v. 1.19.2)
Atomic Simulation Environment (ASE) (v. 3.18.0)
Scipy (v. 1.5.2)
Matplotlib (v. 3.3.2)
This script identifies the different H-bonding motifs present in the simulation snapshots, extracts the molecules involved in the motifs and saves the corresponding XYZ file in the directory "Data/H_Bonding_Patterns/". The
This script identifies the different H-bonding motifs that yield a predicted chemical shift above 11 ppm present in the simulation snapshots, extracts the molecules involved in the motifs and saves the corresponding XYZ file in the directory "Data/H_Bonding_Patterns_11_ppm/".
This scripts computes the formation energies from the DFTB output files. This should be run after running the "Make_environments.ipynb" script and computing the DFTB energy of each structure generated. The DFTB computations should be contained in the directory "Data/DFTB_D3H5/" and each computation should be run in an individual directory named "amcell-[water content]/amcell-[water content]_[simulation number]_last100ns_frame[frame number]_ind_[index of the molecule]/". The DFTB+ output should be piped to a file named "dftb.out" in that directory.
This script extracts the molecular environment around each molecule in the MD snapshots. The extracted XYZ structure files are saved in the directory "Data/Environments/".
This script optimizes the scaling parameters for shielding to shift conversion of the ShiftML predicted shieldings such that the simulated 1H spectra of crystalline and amorphous AZD5718 best match the experimental spectry. It also plots the simulated 1H spectra.
This script plots the simulated 1H spectra of NH groups involved in the different H-bonding motifs identified. It also plots the number/fraction of each NH group bonded to each H-bonding partner as a function of the water content. This script should be run after the "AZD5718_pattern_recognition.ipynb" and "Sort_patterns.ipynb" scripts are run.
This script extracts the number of each H-bonding motif yielding a predicted shift above 11 ppm identified (including secondary neighbours) and plots the occurrences of the most often identified ones. This script should be run after the "AZD5718_pattern_recognition_11_ppm.ipynb" and "Sort_patterns_11_ppm.ipynb" scripts are run.
This script compares the intermolecular complex formation energies of the different H-bonding motifs identified. This script should be run after the "Extract_formation_energies.ipynb" and "AZD5718_pattern_recognition.ipynb" scripts are run.
This script gathers all identical H-bonding motifs identified into a single directory. This script should be run after the "AZD5718_pattern_recognition.ipynb" script.
This script gathers all identical H-bonding motifs yielding a chemical shift above 11 ppm identified into a single directory. This script should be run after the "AZD5718_pattern_recognition_11_ppm.ipynb" script.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.