Published April 25, 2022 | Version v1
Dataset Open

cell2mol: encoding chemistry to interpret crystallographic data

  • 1. Laboratory for Computational Molecular Design (LCMD), Institute of Chemical Sciences and Engineering (ISIC), École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
  • 2. National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
  • 3. National Center for Competence in Research-Catalysis (NCCR-Catalysis), École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland

* Contact person

Description

The creation and maintenance of crystallographic data repositories is one of the greatest data-related achievements in chemistry. Platforms such as the Cambridge Structural Database host what is likely the most diverse collection of synthesizable molecules. If properly mined, they could be the basis for the large-scale exploration of new regions of the chemical space using quantum chemistry (QC). However, it is currently challenging to retrieve all the necessary information for QC based exclusively on the available structural data, especially for transition metal complexes. To solve this shortcoming, we present cell2mol, a software that interprets crystallographic data and retrieves the connectivity and total charge of molecules, including the oxidation state (OS) of metal atoms. We prove that cell2mol outperforms other popular methods at assigning the metal OS, while offering a much more comprehensive interpretation of the unit cell, and we make publicly available reliable QC-ready databases totaling 31k transition metal complexes and 13k ligands, encompassing incomparable chemical diversity. This record contains the aforementioned database of crystallographic structures after interpretation using the cell2mol software. The database spans 8 different transition metals (Fe, Mn, Ru, Re, Cr, Co, Ni, Cu; named from 1 to 8) and contains over 31000 different transition metal complexes and 13000 unique ligands, but also contains the interpreted contents of the entire unit cells in terms of discrete chemical species with well-defined charges and connectivities. Details can be found in the README.txt file and an exemplary script is provided for usage. The cell2mol code can be obtained in https://github.com/lcmd-epfl/cell2mol.

Files

File preview

files_description.md

All files

Files (2.0 GiB)

Name Size
md5:17ee7cb88107c895d5e825947983ccb0
1.4 KiB Preview Download
md5:ac0717456a8633227e9f1e6c53e02e1d
191.8 MiB Preview Download
md5:70dd11a0a0bf9ef434ba88d10029737e
96.7 MiB Preview Download
md5:3663da5064d8efe38e3e8220fca3f2f6
180.6 MiB Preview Download
md5:30523799374c68b5b59d255d400479a2
66.5 MiB Preview Download
md5:c6ac99b42df49980673e23ca93da713f
54.6 MiB Preview Download
md5:91ca67ca07e54358f5f45c21c045dae4
240.3 MiB Preview Download
md5:65a3009dd2086021ceb67f87691b2209
321.7 MiB Preview Download
md5:bb537bfd4ed1f53cfddaea9189033afb
471.4 MiB Preview Download
md5:062f1aeae1f005b2cdbb4e98c2afd151
2.3 KiB Download
md5:c72132628a592f17c5007010278f9efd
64.6 MiB Preview Download
md5:8a10a0c191968ae9b96a4c84312c6b5b
5.1 MiB Preview Download
md5:4420d7ebe2531570fd578d12381c03a7
2.3 KiB Preview Download
md5:958c375933934ab05042f7466cb14646
397.4 MiB Preview Download

References

Journal reference (Paper where the data and algorithms are introduced, described and discussed.)
S. Vela, R. Laplaza, Y. Cho, C. Corminboeuf, npj Comput Mater 8, 188 (2022), doi: 10.1038/s41524-022-00874-9