Dataset Collection

Quantum
Mechanics

Physical
Chemistry

  • QM7/ QM7b : Electronic properties(atomization energy, HOMO/LUMO, etc.) determined using ab-initio density functional theory(DFT).

  • QM8: Electronic spectra and excited state energy of small molecules calculated by multiple quantum mechanic methods.

  • QM9: Geometric,  energetic, electronic and thermodynamic properties of DFT-modelled small molecules.
Regression
  • ESOL: Water solubility data(log solubility in mols per litre) for common organic small molecules.

  • FreeSolv: Experimental and calculated hydration free energy of small molecules in water.

  • Lipophilicity: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
Regression
3D Coordinates
Regression
Regression
3D Coordinates
Regression
Regression
3D Coordinates

Physiology

Biophysics

  • PCBA: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening.

  • MUV: Subset of PubChem BioAssay by applying a refined nearest neighbor analysis, designed for validation of virtual screening techniques.

  • HIV: Experimentally measured abilities to inhibit HIV replication.

  • PDBbind: Binding affinities for bio-molecular complexes, both structures of proteins and ligands are provided.

  • BACE: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1).
Classification
  • BBBP: Binary labels of blood-brain barrier penetration(permeability).

  • Tox21 : Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways.

  • ToxCast: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks.

  • SIDER: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes.

  • ClinTox: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.
Classification
Classification
Classification
Classification
Classification
Classification
Regression
3D Coordinates
Regression
Classification
Classification

Dataset Details

a All MoleculeNet datasets are split into training, validation and test subsets following a 80/10/10 ratio. Different  splittings are recommended depending on each dataset's contents. For details of splitting methods please refer to the paper.

b Different classification and regress metrics are recommended based on previous works and dataset's contents:
          ROC-AUC:  Area Under Curve of Receiver Operating Characteristics
          PRC-AUC:  Area Under Curve of Precision Recall Curve
          RMSE: Root-Mean-Square Error
          MAE: Mean Absolute Error
    For details of metrics please refer to the paper.