This Materials cloud entry contains the data we used to study the diversity of metal-organic framework databases. In the following, the information for each database is explained:


####################################################################################################
1. structures/

The Tarballs contain:
EQeq_structures.tar -- the structures with EQeq charges for the following databases:
    ARABG-DB
    BW-20K
    CoRE2019
    CoREDDEC

ToBaCCo_structures.tar -- the structures from ToBaCCo database
hMOF_structures.tar -- the structures from hMOF database
BWDB_structures.tar -- the structures from BWDB database

####################################################################################################
2. features_labels/

The chemical and geometric features of the MOFs from all the databases that were studied (files named DB_alldescriptors.csv).
The gas adsorption properties and maximum positive charge (MPC) and minimum negative charges (MNC) for those databases that were used to validate features are included (files named DB_alldata.csv).


####################################################################################################
3. feature_importance/

Each folder contains the relative feature importance for the different properties computed with random forest models for each database.
The naming rule is:

The names are formatted as: method_set_RF_GeoChem_prop.json 

    method: sklearn (Gini), SHAP (SHaply), and permutation
    set: train or test
    prop: 
        gas: CH4 or CO2
        pressure: KH (Henry regime), LP (Low pressure), and HP (high pressure)


####################################################################################################
4. unsupervised_labels/

The down-selected structures for the dimensionality reduction (PCA and TSNE) with their features, labels, and coordinates.


####################################################################################################
5. diversity_metrics/

These files contain the features used for computing the three diversity metrics, i.e., variety, balance, and disparity.
kms and fps refer to k-means and farthest point sampling approach for binning the space.
The column kms method with 1000 bins was used for the figure in the main text and the 1000+/-200 were used for sensitivity analysis in SI.


####################################################################################################
6. diverse_set/

This folder contains data used for training the model with diverse set and comparing its performance with models trained on each database.

BW20K_testdata.csv and CoRE2019_testdata.csv are the test sets that were set aside for testing the model performance.
all the remaining data from CoRE2019 and BW20K are gather in the all_traindata.csv file. A subset of 7000 structures were selected from the all_traindata.csv using the indices in sel_inds_subspace_7000_outof_24904.txt which contains the indices of the diverse set selected by MaxMin algorithm.
The CoRE2019_traindata.csv and BW20K_traindata.csv contain the training data for each database.


####################################################################################################
7. timeline/

The structures in CoRE-MOF database and the year the structures were deposited in the CSD.
The timeline_topdistance.csv contains the information of the structures with highest distance to the preceding years for each year.



####################################################################################################
8. ForceField/

The force field parameters from UFF that were used in computing gas adsorption properties.

####################################################################################################
9. EDA/

Exploratory data analysis of the databases and the diversity maps.

####################################################################################################
10. codes/

Some of the scripts that were used in the study