About the repository
====================

This repository contains data files to reproduce the analysis contained in the article "Adsorbate chemical environment-based machine learning framework for heterogeneous catalysis" by Pushkar Ghanekar, Siddharth Deshpande, and Jeffrey Greeley. 

The files in this repository are in a format that can be used by the software package "ACE_GCN" at https://gitlab.com/jgreeley-group/ace_gcn. 

To access the process datasets in this directory please use the `pickle` module to extract the relevant features. 

```
import pickle
with open('pickle_file.pkl','rb') as f:
    data = pickle.load(f)
```

For saving space most of the data is loaded in tar.bz2 format. This can be uncompressed on your local cluster / machine using following functions: 

```bash
function compress_tar(){
folder_name=$1
tar -jcvf $folder_name.tar.bz2 $folder_name
}

function uncompress_tar(){
folder_name=$1
tar -jxvf $folder_name
}
```


Directory structure 
=====================

Soure_Data
├── Fig3
│   ├── NO5_POSCAR_CONTCAR_NO1234_model.csv
│   ├── NO6_POSCAR_CONTCAR_NO12345_model.csv
│   ├── train_NO123_full_sumNO4.csv
│   └── val_NO123_full_sumNO4.csv
├── Fig4
│   ├── Pt221_4OH_Train_123Pt221_Pt100.csv
│   ├── Pt221_5OH_w4OH.csv
│   └── Pt221_6OH_w45OH.csv
|-- Fig5.py
├── Pt3Sn_NO
│   ├── pkls
│   │   ├── 4NO
│   │   │   ├── 4NO_OHE_6_pkls.tar.bz2
│   │   │   └── id_prop_4NO_CONTCAR_POSCAR.csv
│   │   ├── 5NO
│   │   │   ├── 5NO_OHE_6_pkls.tar.bz2
│   │   │   └── id_prop_5NO_CONTCAR_POSCAR.csv
│   │   ├── 6NO
│   │   │   ├── 6NO_OHE_6_pkls.tar.bz2
│   │   │   └── id_prop_6NO_CONTCAR_POSCAR.csv
│   │   └── Pt3Sn_NO_1_6_processed.pkl.tar.bz2
│   └── raw_files
│       ├── 4NO-cont-pos-out.tar.bz2
│       ├── 5NO-cont-pos-out.tar.bz2
│       ├── 6NO-cont-pos-out.tar.bz2
│       └── raw_converged_123456NO_PtSn.tar.bz2
├── Pt_OH
│   ├── pkls
│   │   ├── Pt100
│   │   │   ├── Pt100_CN_OHE_6_Hbonds_pkls.tar.bz2
│   │   │   └── id_prop_Pt100.csv
│   │   ├── Pt100_Pt221_12345OH.pkl.tar.bz2
│   │   └── Pt221
│   │       ├── 221_123OH_CN_OHE_6_Hbonds_pkls.tar.bz2
│   │       ├── 221_3OH_UNIQUE_CN_OHE_6_Hbonds_pkls.tar.bz2
│   │       ├── 221_4OH_TOP_SITE_DFT_PKLS.tar.bz2
│   │       ├── 221_5OH_TOP_SITE_DFT_PKLS.tar.bz2
│   │       ├── Pt221_456OH_guess
│   │       │   ├── POSCAR_Pt221_5OH_TOP_SITE_EXHAUST_GUESS_PKLS.tar.bz2
│   │       │   ├── POSCAR_Pt221_6OH_TOP_SITE_EXHAUST_GUESS_PKLS.tar.bz2
│   │       │   └── POSCAR_Pt_221_4OH_TOP_SITE_DFT_PKLS.tar.bz2
│   │       ├── id_prop_1OH.csv
│   │       ├── id_prop_2OH.csv
│   │       ├── id_prop_3OH_most_stable.csv
│   │       ├── id_prop_OH4_TOP_DFT.csv
│   │       ├── id_prop_OH5_TOP_DFT.csv
│   │       └── id_prop_UNIQUE_3OH.csv
│   └── raw_files
│       ├── Pt100_123OH_raw_converged.tar.bz2
│       ├── Pt221_123OH.tar.bz2
│       ├── Pt221_4OH_GUESS_SITES.tar.bz2
│       ├── Pt221_4OH_TOP_SITES_DFT.tar.bz2
│       ├── Pt221_5OH_GUESS_SITES.tar.bz2
│       ├── Pt221_5OH_TOP_SITES_DFT.tar.bz2
│       └── Pt221_6OH_GUESS_SITES.tar.bz2
├── README.txt
└── processing_scripts
    ├── binding_distance_utils.py
    ├── data_coverage_OHE_CN_HBonding.py
    └── make_graph_objects_dask.py

Description of file 
====================

The files provided in this directory broadly fall into 3 categories: 
1) Atom position files (POSCARs) and trajectory files (OUTCARs) for Pt3Sn/NO and Pt/OH example. These files are stored in the tar.bz2 compression and can be retrieved by running the `tar -jxvf <folder_name>` on bash command line. 
2) Graph objects - processed atom position files abstracted in graph objects through the surf graph algorithm ready to be ready by the ACE-GCN code. These files terminate with '_pkl/.pkl' and compressed by tar.bz2. 
3) Numpy processed objects - atom objects ready for training and prediction using the ACE-GCN model. A simple example is generated for Pt/OH (/Pt_OH/pkls/Pt100_Pt221_12345OH.pkl.tar.bz2) and PtSn/NO (Pt3Sn_NO/Pt3Sn_NO_1_6_processed.pkl.tar.bz2). 

Fig files
=========

Contains comma-separated-value files which have the raw data generated from ACE-GCN and used in plotting the figures. All plots are made using the `matplotlib` and `seaborn` package. 

Pt3N_Sn
=========

Pkls: Processed graph objects relaxed and unrelaxed guess configurations  
raw_files: optimized atomic trajectories for 1-6 NO* configurations on Pt3Sn (111), this brute-force enumeration was used training the ACE-GCN model. It also contains the initial guess structures used for 4/5/6 NO* cases. 

Pt_OH
=========

Pkls: Compressed graph objects for the unrelaxed and optimized atomic positions for Pt100 and Pt221 configuration. 
raw_files: contain atom positions and trajectory 

For reading the pickle files use the following snippet: 
Make sure to use the id_prop file which is the look-up dataset for reading the graph object. 

```python
from processing_script.binding_distance_utils import generate_dataset_list
graph_dataset = np.array(generate_dataset_list(directory_path, id_prop_file, pickle_path), dtype=object)
```