| 2023 | 2022 | 2021 | 2020 | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 | 1999 | 1997

Active and Transfer Learning for the Machine Learning prediction of material properties

Authors: Noah Hoffmann

Ref.: Master thesis, Martin-Luther University of Halle-Wittenberg (2022)

Abstract: With recent advances in Machine Learning (ML), it found its way into material science. The use of Decision Trees, Neural Networks, Graph Neural Networks, Support Vector Machines, etc. allow for very fast predictions. Although such models can be very accurate they require large amounts of training data and are not as accurate as DFT, especially on completely new data. Therefore a typical cycle of high-throughput ML searches starts with some initial DFT calculations, which are used to train an ML model. This in turn predicts the properties of current interest for a large number of new materials. Now one can survey the new data for interesting materials and calculate those accurately with DFT, while at the same time expanding the training data of the ML model to make it more precise. But this always expanding training set may also lead to problems. First of all the initial data set may be biased, i.e., it might contain mainly one material class such as perovskites, or one might only look at certain elements. This bias becomes a problem when exploring new structures and compositions outside of the training distribution. The ever-growing data set may also have a second problem. Large data sets often lead to better ML models, but also increase the required training time and can also introduce redundant information. This leads to the question of whether one can significantly reduce the training set size, while still achieving similar errors or alternatively whether the ML model can choose, which new training data is the most beneficial. This problem is known as active learning. Active learning is a well studied topic for classification problems with applications in the field of natural language processing or image recognition. While there are also studies for regression problems and also some working examples, such as for smaller data sets, we want to know, whether it is possible to expand this approach to larger data sets. Furthermore we will investigate the usability of active learning and Gaussian processes on an out-of-distribution data set, i.e. data with a different distribution than the training set.

A second issue with the commonly used datasets nowadays is the quality of the data. Until very recently all large-scale databases relied on the PBE functional. However, since the development of the PBE functional 25 years ago, an enormous amount of research time has been invested in the development of better functionals. Two products of this research are the PBEsol and SCAN functional that produce better geometries and thermodynamic stabilities respectively. The publication of a recent large dataset of PBEsol and SCAN calculations allows us to train models for more accurate predictions. However, as the size of these datasets is still more than one order of magnitude smaller than the available PBE data we research the effectiveness of transfer learning on these datasets.