In this repository, we provide all the data and code required to reproduce the results reported in A universal machine learning model for the electronic density of states [https://arxiv.org/abs/2508.17418]. This repository is split into several compressed tar.gz files:
Data
Compressed as a Data.tar.gz which comprises of 5 Subfolders:
External_samples: Contains data from samples extracted from external datasets, used to evaluate the generalizability of PET-MAD-DOS in the paperGaAs: Contains data used to train and evaluate the GaAs bespoke and LoRA modelsHEA25S: Contains data used to train and evaluate the high entropy alloys (HEA) bespoke and LoRA modelsLiPS: Contains data used to train and evaluate the Lithium Thiophosphate (LiPS) bespoke and LoRA modelsMAD: Contains data from the MAD dataset, used to train and evaluate the PET-MAD-DOS model.Inside each subfolder, we find the extracted DFT data, in .xyz form. Each structure contains these values in .info dictionary.
number of electrons : number of electrons in the structure (excluding electrons in pseudopotentials)gap : HOMO-LUMO gapDOS : Density of States (DOS), obtained via gaussian smearing of eigenvalues with sigma = 0.3eV. The energy axis is the same for all structures, with lower_bound, upper_bound, interval set to (-149.6456 eV, 80.6528 eV, 0.05 eV). The DOS has units of states/eV. To see how the energy axis is initialized, look at lines 120 to 128 in denoise_predictions.py.mask: Integer mask, where 1 represents the regions where the DOS is reliable and 0 represents the regions where the DOS is unreliable due to the energy cutoff in the underlying DFT calculation. It shares the same energy axis as the DOS.pks : Optional field, it contains an identifying integer for the structure. It can be relevant if one wants to use it to reference against the MAD dataset.MD-Trajectories
GaAs: compressed as GaAs_bespoke_trajectory.tar.gz and GaAs_universal_trajectory.tar.gz respectively. Both folders contain the same trajectories for GaAs in different phases, except that the bespoke_trajectory is run on a bespoke GaAs MLIP while the universal_trajectory is run on the PET-MAD universal MLIP.HEA25S: Contains the MD trajectory of the HEA system at different temperatures, compressed as HEA_MD.tar.gzLiPS: Contains the MD trajectory of the LiPS system in different phases, each trajectory is split into 8 due to the large number of snapshots in the MD trajectory. Compressed as LiPS_alpha.tar.gz, LiPS_beta.tar.gz, and LiPS_gamma.tar.gz for the respective named phases.Bespoke_Models, LoRA_Models
{system}_{type}.tar.gz. For example, the GaAs Bespoke model would be GaAs_bespoke.tar.gz. Inside each file, we find 3 subfolders:Denoising: Contains denoising_model.pt, a model that predicts the Fermi level of the predicted DOS spectraModel: Contains the base model (without UQ). The base model can be exported with mtt export checkpoint.ckpt after installing metatrain, and used to run on metatrain/metatomicUQ: Contains the files required to perform UQ evaluation. The path to the .pt files are a required input for the UQ_Evaluate.py and UQ_LoRA_Evaluate.py scriptPETMADDOS
tar.gz file:bandgap_model.pt, a model that predicts the bandgap of the predicted DOS spectra.Metatrain
Metatrain.tar.gz. The entries need to be edited to reflect the correct filepaths before they can be used. To use these files, one can use mtt train hypers.yaml. For more information, we refer the reader to the metatrain documentation [https://docs.metatensor.org/metatrain/latest/index.html]Scripts
Scripts.tar.gz. We refer the reader to the argparse sections in the .py files for the necessary inputs.prepare_data.py: Takes as input a .xyz file (Containing the 'DOS' and mask field in the .info dictionary of each atom) and outputs a .xyz file with the DOS and mask fields processed correctly to be used for model trainingtrain_UQ.py: Calibrates the last-layer prediction rigidty ensemble on a trained model.UQ_Evaluate.py, UQ_LoRA_Evaluate.py: Evaluates the UQ model on an input .xyz to output the ensemble quantitiesdenoise_prediction.py: Runs the denoising algorithm on an input DOS predictionget_gap_CNN.py/get_gap_physical.py: Calculates the bandgap from the DOS, based on either a CNN or a physical interpretation of the input DOS.To run most scripts and training workflows, one would require the metatrain [https://github.com/metatensor/metatrain], metatomic [https://github.com/metatensor/metatomic], and metatensor [https://github.com/metatensor/metatensor] packages.