This Materials cloud entry contains the data we used to study the diversity of metal-organic framework databases. In the following, the information for each database is explained: #################################################################################################### 1. structures/ The Tarballs contain: EQeq_structures.tar -- the structures with EQeq charges for the following databases: ARABG-DB BW-20K CoRE2019 CoREDDEC ToBaCCo_structures.tar -- the structures from ToBaCCo database hMOF_structures.tar -- the structures from hMOF database BWDB_structures.tar -- the structures from BWDB database #################################################################################################### 2. features_labels/ The chemical and geometric features of the MOFs from all the databases that were studied (files named DB_alldescriptors.csv). The gas adsorption properties and maximum positive charge (MPC) and minimum negative charges (MNC) for those databases that were used to validate features are included (files named DB_alldata.csv). #################################################################################################### 3. feature_importance/ Each folder contains the relative feature importance for the different properties computed with random forest models for each database. The naming rule is: The names are formatted as: method_set_RF_GeoChem_prop.json method: sklearn (Gini), SHAP (SHaply), and permutation set: train or test prop: gas: CH4 or CO2 pressure: KH (Henry regime), LP (Low pressure), and HP (high pressure) #################################################################################################### 4. unsupervised_labels/ The down-selected structures for the dimensionality reduction (PCA and TSNE) with their features, labels, and coordinates. #################################################################################################### 5. diversity_metrics/ These files contain the features used for computing the three diversity metrics, i.e., variety, balance, and disparity. kms and fps refer to k-means and farthest point sampling approach for binning the space. The column kms method with 1000 bins was used for the figure in the main text and the 1000+/-200 were used for sensitivity analysis in SI. #################################################################################################### 6. diverse_set/ This folder contains data used for training the model with diverse set and comparing its performance with models trained on each database. BW20K_testdata.csv and CoRE2019_testdata.csv are the test sets that were set aside for testing the model performance. all the remaining data from CoRE2019 and BW20K are gather in the all_traindata.csv file. A subset of 7000 structures were selected from the all_traindata.csv using the indices in sel_inds_subspace_7000_outof_24904.txt which contains the indices of the diverse set selected by MaxMin algorithm. The CoRE2019_traindata.csv and BW20K_traindata.csv contain the training data for each database. #################################################################################################### 7. timeline/ The structures in CoRE-MOF database and the year the structures were deposited in the CSD. The timeline_topdistance.csv contains the information of the structures with highest distance to the preceding years for each year. #################################################################################################### 8. ForceField/ The force field parameters from UFF that were used in computing gas adsorption properties. #################################################################################################### 9. EDA/ Exploratory data analysis of the databases and the diversity maps. #################################################################################################### 10. codes/ Some of the scripts that were used in the study