MaterialsCloud data entry for: Score-based diffusion models for accurate crystal-structure inpainting and reconstruction of hydrogen positions

This data entry contains the data and code to reproduce the results presented in the manuscript: Score-based diffusion models for accurate crystal-structure inpainting and reconstruction of hydrogen positions. The first part covers the machine-learning-based inpainting approach using score-based diffusion models, while the second part focuses on a purely DFT-based reconstruction approach that was discussed as a reference, mainly in the SI.

Installing AiiDA

If you don't have AiiDA installed yet, you can quickly follow the Quick installation guide here: https://aiida.readthedocs.io/projects/aiida-core/en/stable/installation/guide_quick.html#quick-installation-guide

Results for: XtalPaint – Score-based diffusion models for crystal structure inpainting

Importing the AiiDA archive to inspect the data

To import the data into your AiiDA profile, just run:

verdi archive import Hydrogen-inpainting.aiida

Example code to inspect the inpainting WorkGraphs

from aiida import orm, load_profile

load_profile()

# The WorkGraph compares the predicted structures against the reference structures
analysis_workgraph = orm.load_node("670d8abf-713a-4a29-bfeb-c711622ecf41")

# This WorkGraph performs structural relaxations using MLIPs
relaxation_workgraph = orm.load_node
("6f3f3f7e-5f4b-4e2b-9f13-3e1f3e4c5d6a")

# The analyses were perfromed at different stages: directly after the inpainting of the diffusion model, 
# after a constrained relaxation of only the hydrogen positions, 
# and after a full relaxation of all atomic positions.
analysis_workgraph.outputs.evaluation.keys()
#   dict_keys(['inpainting', 'inpainted_constrained_relaxation', 'pre_relaxed_inpainted_full_relaxation'])

# Get the RMSD per sample
df_rmsd = pd.DataFrame(
    analysis_workgraph.outputs.evaluation.inpainting.rmsd_individual.get_dict().items(),
    columns=['keys', 'rmsd']
)

# Get the structural match per sample
df_match = pd.DataFrame(
    analysis_workgraph.outputs.evaluation.inpainting.match_individual.get_dict().items(),
    columns=['keys', 'match']
)

# The previous examples and results are based on an older version of the API. 
# The new versions directly output pandas `DataFrames` and drop the 
# `inidivual_<metric>` and `agg_<metric>` namespace, more details 
# in the documentation of XtalPaint


# Inspecting the relaxation results

# Again, the relaxations were performed at different stages and can be accessed
# by changing the level `inpainted_constrained_relaxation` below.
# Moreover, the relaxation outputs contain the initial energies and forces, 
# the final energies and forces, as well as the relaxed structures.
relaxation_workgraph.outputs.inpainted_constrained_relaxation.keys()
    #  dict_keys(['initial_forces', 'final_energies',   'final_forces', 'structures', 'initial_energies'])

df_final_energies = relaxation_workgraph.outputs.inpainted_constrained_relaxation.final_energies.value

# Since we always work with multiple structures, we work with the 
# `BatchedStructuresData` AiiDA data type. More details in the documentation: https://github.com/psi-lms/XtalPaint

print(
    relaxation_workgraph.outputs.inpainted_constrained_relaxation.structures.get_structures(
        keys='2e5756e3_77f4_4e83_83bb_fca85f2dfec4_sample_6'
    )
)

The keys above follow the scheme <uuid>_sample_<number>, where <uuid> is the UUID of the original structure from which the hydrogen positions were removed (you can find them in the MC3D PBE-v1 database: https://www.materialscloud.org/explore/mc3d-pbe-v1/), and <number> is an index to distinguish multiple samples per structure structure. We replaced - with _ in the uuids.

Example code to inspect the related DFT calculations

First, you need to import the corresponding archive:

verdi archive import Hydrogen-inpainting-DFT-stability-validation.aiida

from aiida import orm, load_profile
load_profile()

dft_data = orm.QueryBuilder().append(
    orm.Group, filters={'label': {'like': 'H-inpainting/DFT-validation-stability/workflows/DFT/%'}},
    tag='group', project='label'
).append(
    orm.Node, with_group='group', filters={'attributes.exit_status': 0},
    project='*', tag='pw-relax'
).append(
    orm.StructureData,
    with_outgoing='pw-relax', tag='structure',
    filters={
        'or': [
            {'extras.inpainted-key': {'like': '2e5756e3_77f4_4e83_83bb_fca85f2dfec4%'}},
            {'extras.reference-key': {'like': '2e5756e3_77f4_4e83_83bb_fca85f2dfec4'}}
        ]
    }
).append(
    orm.Dict, with_incoming='pw-relax', project='attributes', tag='parameters'
)

# This will return a nested list: 
#   the group of the DFT calculations (either `H-inpainting/DFT-validation-stability/workflows/<*>/references` or `H-inpainting/DFT-validation-stability/workflows/<*>/samples`)
#   the actual DFT calculation (`PwRelaxWorkChain`, see https://github.com/aiidateam/aiida-quantumespresso)
#   the dictionary containing the parsed output parameters of the calcualtion.

How to run the code for your own structures

The code repository and its documentation show examples of how to run our inpainting models. The repository supports execution with and without AiiDA. Further details can be found here: https://github.com/psi-lms/XtalPaint

Results for: DFT-based reconstruction approach

To run the AiiDA WorkChains yourself, please refer to the repository: https://github.com/psi-lms/aiida-hydrogen-restorer

Importing the AiiDA archive to inspect the data

Similar to the previous section, one can also load the WorkChains analyzed in the SI of the manuscript by importing the AiiDA archive:

verdi archive import DFT-based-missing-hydrogen.aiida

The following code snippet shows an example of how to retrieve the results of a WorkChain from the AiiDA database:

from aiida import orm, load_profile


# Starting the queries from the final structures that were optimized with DFT
dft_restore = orm.QueryBuilder().append(
    orm.Group, 
    filters={'label': 'structures/opt/allexcode0_withnew3d'}, 
    tag='group'
).append(
    orm.StructureData, with_group='group', tag='StDat',
).append(
    orm.WorkChainNode, 
    filters={'attributes.exit_status': 0, 'attributes.process_label': 'RestoreHydrogenWorkChain'}, 
    with_outgoing='StDat', project='*'
)

# Get one example result
example_result = dft_restore.first()[0]

# Check the initial structure (hydrogens removed, one can find the original structures 
# via the uuid defined in the extras: `example_result.inputs.structure.base.extras.all`)
example_result.inputs.structure.get_pymatgen()

# Final structure with restored hydrogen positions
dft_restore.first()[0].outputs.final_structure.get_pymatgen()

# Peak positions and values of the electrostatic potential
(
    dft_restore.first()[0].outputs.all_peaks.get_array('peak_positions'),
    dft_restore.first()[0].outputs.all_peaks.get_array('peak_values')
)

Recreating the figures in the manuscript

The data-collection-and-figures.zip folder contains Jupyter notebooks to reproduce the figures that are presented in the manuscript. In the /data-collection subdirectory, further Jupyter notebooks are provided to collect and process the data from the AiiDA archives. However, this is not mandatory, as all the processed data is already available as JSON files in the data-collection/data subdirectory.

The structure of the folder is as follows:

data-collection/: Jupyter notebooks to collect and process the data from the AiiDA archives. Note, due to license restrictions of the original experimental databases, we only provide the AiiDA archives for the calculations based on the DFT and DFT-20-40 datasets. For the calculations based on the experimental structures, we only provide the processed data as JSON files in the data-collection/data subdirectory.
- data/: Processed data as JSON files. This folder contains further subfolders, each of the related to one or multiple figures in the paper. Subfolders corresponding to multiple figures separate the different figures by underscores in their name, e.g., S1_S2 for the combined data of figures S1 and S2.
- 1a_b_S1_S2_S4_get_data.ipynb: Collect the data for figures 1a, 1b, S1, S2, and S4. Uses the Hydrogen-inpainting.aiida archive to compare different inpainting approaches.
- 1c_get_data.ipynb Uses the Hydrogen-inpainting.aiida archive to collect the data for figure 1c, which compares the trajectories of different inpainting approaches.
- 2_4_get_data.ipynb: Collect the data for figures 2 and 4 in the main text. Relies on the Hydrogen-inpainting.aiida and Hydrogen-inpainting-DFT-stability-validation.aiida archives, and summarizes the energetic and structural agreement of the presented inpainting approach.
- 3_get_data.ipynb: Collect the data for figure 3 in the main text. Uses the Hydrogen-inpainting-DFT-stability-validation.aiida archive to compare the stability of the predictions and reference.
- S5_get_data.ipynb: Collect the data for figure S5 in the SI. Uses the Hydrogen-inpainting.aiida archive to compare different relaxation strategies.
- S6_S7_S8.py: Here, we only provide an example script showing the parameters to run the MLIP relaxations and provide the aggregated data in the data/S6_S7_S8/ subfolder. This data is used to create figures S6, S7, and S8 in the SI.
- S9_S10_get_data.ipynb: Collect the data for figures S9 and S10 in the SI. Uses the Hydrogen-inpainting.aiida and Hydrogen-inpainting-DFT-stability-validation.aiida archives to analyze the energetic ranking obtained with the NequIP MLIP and DFT.
- S11_get_data.ipynb: Collect the data for figure S11 in the SI. Example script to generate the PETMAD features to measure structural similarity. Again, we only show data based on the DFT dataset.
- S12_get_data.ipynb and S12_S13_get_data.ipynb: Collect the data for figure S12 and S13 in the SI. S12_get_data.ipynb only collects the data related to the pinball method. Those notebooks use the DFT-based-missing-hydrogen.aiida archive to analyze the performance of the DFT-based hydrogen reconstruction approach. These notebooks also require the lowdimfinder.py script, also available here: https://github.com/epfl-theos/tool-ml-layer-finder/blob/master/compute/utils/lowdimfinder.py
main/: Subfolders referring to the figures in the main text. Each of them contains a Jupyter notebook to recreate the respective figure from the processed data in the data-collection/data/ subfolder.
SI/: Subfolders referring to the figures in the SI. Each of them contains a Jupyter notebook to recreate the respective figure from the processed data in the data-collection/data/ subfolder.