Preview

In this repository, we provide all the data and code required to reproduce the results reported in A universal machine learning model for the electronic density of states [https://arxiv.org/abs/2508.17418]. This repository is split into several compressed tar.gz files:

Data
- Compressed as a Data.tar.gz which comprises of 5 Subfolders:
  - External_samples: Contains data from samples extracted from external datasets, used to evaluate the generalizability of PET-MAD-DOS in the paper
  - GaAs: Contains data used to train and evaluate the GaAs bespoke and LoRA models
  - HEA25S: Contains data used to train and evaluate the high entropy alloys (HEA) bespoke and LoRA models
  - LiPS: Contains data used to train and evaluate the Lithium Thiophosphate (LiPS) bespoke and LoRA models
  - MAD: Contains data from the MAD dataset, used to train and evaluate the PET-MAD-DOS model.
- Inside each subfolder, we find the extracted DFT data, in .xyz form. Each structure contains these values in .info dictionary.
  - number of electrons : number of electrons in the structure (excluding electrons in pseudopotentials)
  - gap : HOMO-LUMO gap
  - DOS : Density of States (DOS), obtained via gaussian smearing of eigenvalues with sigma = 0.3eV. The energy axis is the same for all structures, with lower_bound, upper_bound, interval set to (-149.6456 eV, 80.6528 eV, 0.05 eV). The DOS has units of states/eV. To see how the energy axis is initialized, look at lines 120 to 128 in denoise_predictions.py.
  - mask: Integer mask, where 1 represents the regions where the DOS is reliable and 0 represents the regions where the DOS is unreliable due to the energy cutoff in the underlying DFT calculation. It shares the same energy axis as the DOS.
  - pks : Optional field, it contains an identifying integer for the structure. It can be relevant if one wants to use it to reference against the MAD dataset.
MD-Trajectories
- Contains the converged portion of the MD trajectories in 3 Categories:
  - GaAs: compressed as GaAs_bespoke_trajectory.tar.gz and GaAs_universal_trajectory.tar.gz respectively. Both folders contain the same trajectories for GaAs in different phases, except that the bespoke_trajectory is run on a bespoke GaAs MLIP while the universal_trajectory is run on the PET-MAD universal MLIP.
  - HEA25S: Contains the MD trajectory of the HEA system at different temperatures, compressed as HEA_MD.tar.gz
  - LiPS: Contains the MD trajectory of the LiPS system in different phases, each trajectory is split into 8 due to the large number of snapshots in the MD trajectory. Compressed as LiPS_alpha.tar.gz, LiPS_beta.tar.gz, and LiPS_gamma.tar.gz for the respective named phases.
Bespoke_Models, LoRA_Models
- Each file labels the material system in which the bespoke/LoRA models were fitted on with the format {system}_{type}.tar.gz. For example, the GaAs Bespoke model would be GaAs_bespoke.tar.gz. Inside each file, we find 3 subfolders:
  - Denoising: Contains denoising_model.pt, a model that predicts the Fermi level of the predicted DOS spectra
  - Model: Contains the base model (without UQ). The base model can be exported with mtt export checkpoint.ckpt after installing metatrain, and used to run on metatrain/metatomic
  - UQ: Contains the files required to perform UQ evaluation. The path to the .pt files are a required input for the UQ_Evaluate.py and UQ_LoRA_Evaluate.py script
PETMADDOS
- It has the same format as the subfolders of the Bespoke/LoRA Models. With the exception of another subfolder compressed in the tar.gz file:
  - Bandgap: Contains bandgap_model.pt, a model that predicts the bandgap of the predicted DOS spectra.
Metatrain
- Contains the hypers used to perform LoRA Finetuning or model training of PET-MAD-DOS/bespoke models compressed as Metatrain.tar.gz. The entries need to be edited to reflect the correct filepaths before they can be used. To use these files, one can use mtt train hypers.yaml. For more information, we refer the reader to the metatrain documentation [https://docs.metatensor.org/metatrain/latest/index.html]
Scripts
- Contains 7 scripts used for different parts of the machine learning workflow compressed as Scripts.tar.gz. We refer the reader to the argparse sections in the .py files for the necessary inputs.
  - prepare_data.py: Takes as input a .xyz file (Containing the 'DOS' and mask field in the .info dictionary of each atom) and outputs a .xyz file with the DOS and mask fields processed correctly to be used for model training
  - train_UQ.py: Calibrates the last-layer prediction rigidty ensemble on a trained model.
  - UQ_Evaluate.py, UQ_LoRA_Evaluate.py: Evaluates the UQ model on an input .xyz to output the ensemble quantities
  - denoise_prediction.py: Runs the denoising algorithm on an input DOS prediction
  - get_gap_CNN.py/get_gap_physical.py: Calculates the bandgap from the DOS, based on either a CNN or a physical interpretation of the input DOS.

To run most scripts and training workflows, one would require the metatrain [https://github.com/metatensor/metatrain], metatomic [https://github.com/metatensor/metatomic], and metatensor [https://github.com/metatensor/metatensor] packages.