Models and Featurizations

Model​​s

Featurizations

Standard classification model by applying logistic function to weighted linear combination of input features.
Molecule  is decomposed into segments of variable sizes, all originated from heavy atoms(C, N, O). 
All segment are then assigned with  unique identifiers, which are hashed together into a fixed length binary fingerprint.
Classification
Deterministic
Deterministic
Standard classification and regression method based on an ensemble of decision trees, each trained on a different subsampled version of the original dataset.
Regression
Deterministic
Classification
Grid Featurizer, initially built for PDBbind,  relies on detailed structures of protein-ligand pair to summarize inter-molecular forces. It incorporates fingerprints of both proteins and ligands, as well as an enumeration of salt bridges, hydrogen bonding, etc.
Refined K-nearest neighbour classifier. Using the hypothesis that compounds with similar substructures have similar functionality, it makes prediction by combining labels from the top-K compounds most similar to the sample.
Classification
Deterministic
Deterministic
3D Coordinates
Symmetry functions is another encoding of Cartesian coordinates which focuses on preserving the rotational and permutation symmetry of the system. It introduces a series of radial and angular symmetry functions  with different distance and angle cutoffs.
Standard neural network prediction method designed for multitask settings. Input features are processed through multiple shared fully-connected layers and then fed into separate linear classifiers/regressors for each different task. In case of single task dataset, it will become vanilla neural network model.
Deterministic
3D Coordinates
Regression
Deterministic
Classification
Coulomb Matrix encodes nuclear charges and corresponding Cartesian coordinates into a matrix, with diagonal elements representing nuclear charges and off-diagonal elements representing Coulomb repulsions.
A modified version of multitask network designed for uncorrelated tasks. Based on the structure of multitask network, it adds "bypass" layers directly connecting input features and each individual task, hence increasing explanatory power in case of unrelated variations in the sample.

Deterministic
3D Coordinates
Regression
Deterministic
Classification
Adaptable extension of the Coulomb Matrix featurizer. Nuclear charges(atom types) are mapped to feature vectors, which are further updated based on distance matrix and neighbour atoms. Final states of all atoms' feature vectors are mapped to the outputs and then summed to predict molecular properties. 
Regression
3D Coordinates
Molecule is represented by a neighbout list and a set of initial feature vectors , each corresponding to a single atom,. Feature vector summarizes the atom's local chemical environment,  including atom-types, hybridization types and valence structures.
A learnable version of circular fingerprint which replaces fixed hash functions by differentiable network layers. In graph convolutional models, molecules are treated as undirected graphs: atoms as nodes and bonds as edges. Each convolutional layer will extend the feature vector of the central atom by applying convolutional functions(network layer) on itself and its neighbours(other nodes connected by edges).
Variable
Regression
Variable
Classification
An alternate of graph-based method that applies to directed graphs. The model regards each molecule as a set of directed acyclic graphs, each originated from a different atom. Results from all possible graphs of a molecule are calculated and averaged to yield molecular-level properties.
Regression
Variable
Classification
A similar adaptive graph-based model that treats molecules as undirected graphs. Instead of doing convolution locally(central atom and neighbour atoms), it applies global convolutions to central atom and all other atoms in the molecule, together with their corresponding pair features.


With the same feature vectors for atoms as Graph convolutions featurizer, Weave featurizer elaborates the neighbour list as a matrix of pair feature vectors, each representing the connectivity and distance between a pair of atoms.
Regression
Variable
Variable
Classification
Message passing neural network(MPNN) is a generalized graph-based model. Its prediction process is separated into two phases: message passing phase(an edge dependent neural network) and readout phase(seq2seq model for sets).
Regression
Variable
Classification