As the introduction of machine learning methods to materials science is still recent, a lot of published applications are quite basic in nature and complexity. Often they involve fitting models to extremely small training sets or even applying machine learning methods to composition spaces that could possibly be mapped out in hundreds of CPU hours. It is of course possible to use machine learning methods as a simple fitting procedure for small low-dimensional datasets. However, this does not play to their strength and will not allow us to replicate the success machine learning methods had in other fields.
LS Models Issue 911 Complete 30 Sets
Download File: https://tlniurl.com/2vKlOA
In this section, we briefly introduce and discuss the most prevalent algorithms used in materials science. We start with linear- and kernel-based regression and classification methods. We then introduce variable selection and extraction algorithms that are also largely based on linear methods. Concerning completely non-linear models, we discuss decision tree-based methods like random forests (RFs) and extremely randomized trees and neural networks. We start with simple fully connected feed-forward networks and convolutional networks and continue with more complex applications in the form of variational autoencoders (VAEs) and generative adversarial networks (GANs).
A recent study by Kim et al.237 used the same method for the discovery of quaternary Heusler compounds and identified 53 new stable structures. The model was trained for different datasets (complete open quantum materials database,80 only the quaternary Heusler compounds, etc.). For the prediction of Heusler compounds, it was found that the accuracy of the model also benefited from the inclusion of other prototypes in the training set. It has to be noted that studies with such large datasets are not feasible with kernel-based methods (e.g. KRR, SVMs) due to their unfavorable computational scaling.
It is clear that component prediction via machine learning can greatly reduce the cost of high-throughput studies through a preselection of materials by at least a factor of ten.1 Naturally, the limitations of stability prediction according to the distance to the convex hull have to be taken into consideration when working on the basis of DFT data. While studies based on experimental data can have some advantage in accuracy, this advantage is limited to crystal structures that are already thoroughly studied, e.g., perovskites, and consequently a high number of experimentally stable structures is already known. For a majority of crystal structures, the number of known experimentally stable systems is extremely small and consequently ab initio data-based studies will definitely prevail over experimental data-based studies. Once again, a major problem is the lack of any benchmark datasets, preventing a quantitative comparison between most approaches. This is even true for work on the same structural prototype. Considering, for example, perovskites, we notice that three groups predicted distances to the convex hull.33,56,99 However, as the underlying composition spaces and datasets are completely different it is hardly possible to compare them.
Zhuo et al.289 tried to circumvent the problems of the different theoretical methods by directly predicting experimental band gaps. Their approach started with a classification of the materials as either metal or non-metal using SVM classifiers and then progressed by predicting the band gap with SVM regressors. The performance of the resulting models in predicting experimental band gaps lies somewhere between basic functionals (like the PBE) and hybrid functionals. The error turns out to be comparable to, e.g., ref. 40,41. However, Zhuo et al. improved upon those earlier machine learning results, as the error is with respect to the experimental results instead of DFT calculations. While there have been earlier attempts at using experimental band gap training data (e.g., ref. 290), the dataset used by Zhou et al. includes >6000 band gaps, dwarfing all previous datasets.
Naturally, neural networks will never reach the algorithmic transparency of linear models. However, representative datasets, a good knowledge of the training process, and a comprehensive validation of the model can usually overcome this obstacle. Furthermore, if we consider the possibilities for post hoc explanations or the decomposability of neural networks, they are actually far more interpretable than their reputation might suggest.
In other cases that are characterized by a lack of data, several strategies are very promising. First of all, one can take into consideration surrogate-based optimization (active learning), which allows researchers to optimize the results achieved with a limited experimental or computational budget. Surrogate-based optimization allows us to somewhat overlook the limited accuracy of the machine learning models while nevertheless arriving at sufficient design results. As the use of such optimal design algorithms is still confined to relatively few studies with small datasets, much future work can be foreseen in this direction. A second strategy to overcome the limited data available in materials science is transfer learning. While it has already been applied with success in chemistry,489 wider applications in solid-state materials informatics are still missing. A last strategy to handle the small datasets that are so common in materials science was discussed by Zhang et al. in ref. 77. Crude estimation of properties basically allows us to shift the problem of predicting a property to the problem of predicting the error of the crude model with respect to the higher-fidelity training data. Up to now, this strategy was mostly used for the prediction of band gap, as datasets of different fidelity are openly available (DFT, GW, or experimental). Moreover the use of crude estimators allows researchers to benefit from decades of work and expertise that went into classical (non-machine learning) models. If the lower-fidelity data are not available for all materials, it is also possible to use a co-kriging approach that still profits from the crude estimators but does not require it for every prediction.292
The majority of early machine learning applications to solid-state materials science employed straightforward and simple-to-use algorithms, like linear kernel models and decision trees. Now, that these proofs-of-concept exist for a variety of application, we expect that research will follow two different directions. The first will be the continuation of the present research, the development of more sophisticated machine learning methods, and their applications in materials science. Here one of the major problems is the lack of benchmarking datasets and standards. In chemistry, a number of such datasets already exists, such as the QM7 dataset,490,491 QM8 dataset,491,492 QM7b dataset,493,494 etc. These are absolutely essential to measure the progress in features and algorithms. While we discussed countless machine learning studies in this review, definitive quantitative comparisons between the different works were mostly impossible, impeding the evaluation of progress and thereby progress itself. It has to be noted that there has been one recent competition for the prediction of formation energies and band gaps.495 In our opinion, this is an very important step in the right direction. Unfortunately, the dataset used in this competition was extremely small and specific, putting the generalizability of the results to larger and more diverse datasets into doubt.
Another major challenge relates to the propagation of the uncertainties at each step of the methodology, from the global forcings to the global climate and from regional climate to impacts at the ecosystem level, considering local disturbances and local policy effects. The risks for natural and human systems are the result of complex combinations of global and local drivers, which makes quantitative uncertainty analysis difficult. Such analyses are partly done using multimodel approaches, such as multi-climate and multi-impact models (Warszawski et al., 2013, 2014; Frieler et al., 2017)37. In the case of crop projections, for example, the majority of the uncertainty is caused by variation among crop models rather than by downscaling outputs of the climate models used (Asseng et al., 2013)38. Error propagation is an important issue for coupled models. Dealing correctly with uncertainties in a robust probabilistic model is particularly important when considering the potential for relatively small changes to affect the already small signal associated with 0.5C of global warming (Supplementary Material 3.SM.1). The computation of an impact per unit of climatic change, based either on models or on data, is a simple way to present the probabilistic ecosystem response while taking into account the various sources of uncertainties (Fronzek et al., 2011)39.
In summary, in order to assess risks at 1.5C and higher levels of global warming, several things need to be considered. Projected climates under 1.5C of global warming differ depending on temporal aspects and emission pathways. Considerations include whether global temperature is (i) temporarily at this level (i.e., is a transient phase on its way to higher levels of warming), (ii) arrives at 1.5C, with or without overshoot, after stabilization of greenhouse gas concentrations, or (iii) is at this level as part of long-term climate equilibrium (complete only after several millennia). Assessments of impacts of 1.5C of warming are generally based on climate simulations for these different possible pathways. Most existing data and analyses focus on transient impacts (i). Fewer data are available for dedicated climate model simulations that are able to assess pathways consistent with (ii), and very few data are available for the assessment of changes at climate equilibrium (iii). In some cases, inferences regarding the impacts of further warming of 0.5C above present-day temperatures (i.e., 1.5C of global warming) can also be drawn from observations of similar sized changes (0.5C) that have occurred in the past, such as during the last 50 years. However, impacts can only be partly inferred from these types of observations, given the strong possibility of non-linear changes, as well as lag effects for some climate variables (e.g., sea level rise, snow and ice melt). For the impact models, three challenges are noted about the coupling procedure: (i) the bias correction of the climate model, which may modify the simulated response of the ecosystem, (ii) the necessity to downscale the climate model outputs to reach a pertinent scale for the ecosystem without losing physical consistency of the downscaled climate fields, and (iii) the necessity to develop an integrated study of the uncertainties. 2ff7e9595c
Comments