Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Artificial intelligence, trained via machine learning or computational statistics algorithms, holds much promise for the improvement of small molecule drug discovery. However, structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs. 25 publicly available molecular datasets were extracted from ChEMBL. Neural nets, random forests, support vector machines (regression) and ridge regression were then fitted to the structure-activity data. A new validation method, based on quantile splits on the activity distribution function, is proposed for the construction of training and testing sets. Model validation based on random partitioning of available data favours models which overfit and `memorize' the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes models which can extrapolate onto structurally different molecules outside of the training data. This approach favours more constrained models, namely ridge regression and support vector regression. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand. Model performance should be evaluated from a decision theoretic perspective with subjective loss functions. Data-splitting based on the separation of high and low activity data provides a robust methodology for determining the best extrapolating model. Simpler, traditional statistical methods such as ridge regression outperform state-of-the-art machine learning methods in this setting.


Publication Date



stat.AP, stat.AP