Nowadays, machine learning algorithms are successfully employed for classification, regression, clustering, or dimensionality reduction tasks of large sets of especially high-dimensional input data. In fact, machine learning has proved to have superhuman abilities in numerous fields (such as playing go, self driving cars, image classification, etc). As a result, huge parts of our daily life, for example, image and speech recognition, web-searches, fraud detection, email/spam filtering, credit scores, and many more are powered by machine learning algorithms.
These large-scale simulations and calculations together with experimental high-throughput studies are producing an enormous amount of data making possible the use of machine learning methods to materials science.
googol ( $ 10^{100} $ ):一个大数,天文数字。
In order to produce significant results in materials science, one necessarily has not only to play to the strength of machine learning techniques but also apply the lessons already learned in other fields. 为了在材料科学中取得显著的成果,人们不仅必须发挥机器学习技术的优势,还必须应用已经在其他领域学到的经验教训。
It is of course possible to use machine learning methods as a simple fitting procedure for small low-dimensional datasets. However, this does not play to their strength and will not allow us to replicate the success machine learning methods had in other fields. 用机器学习一定要数量极大
The research landscape has quickly transformed.
Every machine learning application has to consider the aspects of overfitting and underfitting. The reason for underfitting usually lies either in the model, which lacks the ability to express the complexity of the data, or in the features, which do not adequately describe the data. This inevitably leads to a high training error. On the other hand, an overfitted model interprets part of the noise in the training data as relevant information, therefore failing to reliably predict new data. Usually, an overfitted model contains more free parameters than the number required to capture the complexity of the training data. In order to avoid overfitting, it is essential to monitor during training not only the training error but also the error of the validation set. Once the validation error stops decreasing, a machine learning model can start to overfit. This problem is also discussed as the bias-variance trade off in machine learning.70,71 In this context, the bias is an error based on wrong assumptions in the trained model, while high variance is the error resulting from too much sensitivity to noise in the training data. As such, underfitted models possess high bias while overfitted models have high variance. 关于欠拟合与过拟合的讨论,有参考文献
从图中可以看到测试集、验证集是用来干什么的。