大数据和计算中的数据集偏差对通往材料科学之路的影响

注：机翻，未校。

Why big data and compute are not necessarily the path to big materials science

Naohiro Fujinuma, Brian DeCost, Jason Hattrick-Simpers & Samuel E. Lofland

Communications Materials volume 3, Article number: 59 (2022) Cite this article

Abstract 摘要

Applied machine learning has rapidly spread throughout the physical sciences. In fact, machine learning-based data analysis and experimental decision-making have become commonplace. Here, we reflect on the ongoing shift in the conversation from proving that machine learning can be used, to how to effectively implement it for advancing materials science. In particular, we advocate a shift from a big data and large-scale computations mentality to a model-oriented approach that prioritizes the use of machine learning to support the ecosystem of computational models and experimental measurements. We also recommend an open conversation about dataset bias to stabilize productive research through careful model interrogation and deliberate exploitation of known biases. Further, we encourage the community to develop machine learning methods that connect experiments with theoretical models to increase scientific understanding rather than incrementally optimizing materials. Moreover, we envision a future of radical materials innovations enabled by computational creativity tools combined with online visualization and analysis tools that support active outside-the-box thinking within the scientific knowledge feedback loop.
应用机器学习已迅速传播到整个物理科学领域。事实上，基于机器学习的数据分析和实验决策已经变得司空见惯。在这里，我们反思了对话中正在进行的转变，从证明机器学习可以被使用，到如何有效地实施它来推进材料科学。特别是，我们主张从大数据和大规模计算的心态转变为面向模型的方法，该方法优先使用机器学习来支持计算模型和实验测量的生态系统。我们还建议就数据集偏差进行公开对话，通过仔细的模型询问和有意识地利用已知偏差来稳定富有成效的研究。此外，我们鼓励社区开发机器学习方法，将实验与理论模型联系起来，以增加科学理解，而不是逐步优化材料。此外，我们设想了通过计算创新工具与在线可视化和分析工具相结合实现激进材料创新的未来，这些工具支持在科学知识反馈循环中积极地跳出框框思考。

Introduction 介绍

Since Frank Rosenblatt created Perceptron to play checkers1, machine learning (ML) applications have been used to emulate human intelligence. The field has grown immensely with the advent of ever more powerful computers with increasingly smaller size combined with the development of robust statistical analyses. These advances allowed Deep Blue to beat Grandmaster Gary Kasparov in chess and Watson to win the game show Jeopardy! The technology has since progressed to more practical applications such as advanced manufacturing and common tasks we now expect from our phones like image and speech recognition. The future of ML promises to obviate much of the tedium of everyday life by assuming responsibility for more and more complex processes, e.g., autonomous driving.
自从 Frank Rosenblatt 创建用于玩跳棋的 Perceptron 以来1，机器学习（ML）应用程序一直被用于模拟人类智能。随着越来越强大的计算机和越来越小的尺寸的出现以及强大的统计分析的发展，该领域得到了极大的发展。这些进步让 Deep Blue 在国际象棋中击败了特级大师 Gary Kasparov，并让 Watson 赢得了游戏节目 Jeopardy！此后，该技术已发展到更实际的应用，例如高级制造以及我们现在期望手机执行的常见任务，例如图像和语音识别。ML 的未来有望通过承担越来越复杂的流程（例如自动驾驶）来消除日常生活的大部分乏味。

When it comes to scientific application, our perspective is that current ML methods are just another component of the scientific modeling toolbox, with a somewhat different profile of representational basis, parametrization, computational complexity, and data/sample efficiency. Fully embracing this view will help the materials and chemistry communities to overcome perceived limitations and at the same time evaluate and deploy these techniques with the same level of rigor and introspection as any physics-based modeling methodology. Toward this end, in this essay we identify four areas in which materials researchers can clarify our thinking to enable a vibrant and productive community of scientific ML practitioners:
当谈到科学应用时，我们的观点是，当前的 ML 方法只是科学建模工具箱的另一个组成部分，在表示基础、参数化、计算复杂性和数据/样本效率方面略有不同。完全接受这一观点将有助于材料和化学界克服感知到的局限性，同时以与任何基于物理的建模方法相同的严谨性和内省水平评估和部署这些技术。为此，在本文中，我们确定了材料研究人员可以阐明我们思考的四个领域，以实现一个充满活力和生产力的科学 ML 从业者社区：

Maintain perspective on resources required 保持对所需资源的看法

The recent high profile successes in mainstream ML applications enabled by internet-scale data and massive computation2,3 have spurred two lines of discussion in the materials community that are worth examining more closely. The first is an unmediated and limiting preference for large-scale data and computation, under the assumption that successful ML is unrealistic for materials scientists with datasets that are orders of magnitude smaller than those at the forefront of the publicity surrounding deep learning. The second is a tendency to dismiss brute-force ML systems as unscientific. While there is some validity to both these viewpoints, there are opportunities in materials research for productive and creative ML work with small datasets and for the “go big or go home” brute-force approach.
最近，由互联网规模的数据和大规模计算2,3 支持的主流 ML 应用程序取得了引人注目的成功，这在材料社区中引发了值得更仔细研究的两条讨论线。首先是对大规模数据和计算的无中介和有限偏好，假设成功的 ML 对于数据集比处于深度学习宣传前沿的数据集小几个数量级的材料科学家来说是不现实的。第二个是倾向于将蛮力 ML 系统视为不科学。虽然这两种观点都有一定的道理，但在材料研究中也有机会使用小数据集进行高效和创造性的 ML 工作，以及“要么做大，要么回家”的蛮力方法。

Molehills of data (or compute) are sometimes better than mountains 数据（或计算）的 molehills 有时比山更好

A common sentiment in the contemporary deep-learning community is that the most reliable means of improving the performance of a deep-learning system is to amass ever larger datasets and apply raw computational power. This sometimes can encourage the fallacy that large-scale data and computation are fundamental requirements for success with ML methods. This can lead to needlessly deploying massively overparameterized models when simpler ones may be more appropriate4, and it limits the scope of applied ML research in materials by biasing the set of problems people are willing to consider addressing. There are many examples of productive, creative ML work with small datasets in materials research that counter this notion5,6.
当代深度学习社区的一个普遍观点是，提高深度学习系统性能的最可靠方法是积累越来越大的数据集并应用原始计算能力。这有时会助长一种谬论，即大规模数据和计算是使用 ML 方法成功的基本要求。这可能会导致不必要地部署大规模过度参数化的模型，而更简单的模型可能更合适4，并且它通过偏向人们愿意考虑解决的一系列问题，限制了材料中应用 ML 研究的范围。在材料研究中，有许多使用小数据集进行高效、创造性的 ML 工作的例子与这一概念相悖5,6。

In the small-data regime, high-quality data with informative features often trump excessive computational power with massive data and weakly correlated features. A promising approach is to exploit the bias-variance trade-off by performing more rigorous feature selection or crafting a more physically motivated model form7. Alternatively, it may be wise to reduce the scope of the ML task by restricting the material design space or use ML to solve a smaller chunk of the problem at hand. ML tools for exploratory analysis with appropriate features can help us comprehend much higher dimensional spaces even at an early stage of the research, which may be helpful to have a bird’s-eye view on our target. For example, cluster analysis can help researchers identify representative groups in large high-throughput datasets, making the process of formulating hypotheses more tractable.
在小数据领域，具有信息特征的高质量数据通常胜过具有大量数据和弱关联特征的过度计算能力。一种很有前途的方法是通过执行更严格的特征选择或制作更具物理动机的模型形式来利用偏差-方差权衡7。或者，明智的做法是通过限制 Material Design 空间来缩小 ML 任务的范围，或者使用 ML 来解决手头问题的一小部分。即使在研究的早期阶段，用于探索性分析的 ML 工具也可以帮助我们理解更高维的空间，这可能有助于鸟瞰我们的目标。例如，聚类分析可以帮助研究人员识别大型高通量数据集中的代表性群体，从而使制定假设的过程更易于处理。

There are also specific ML disciplines aimed at addressing the well-known issues of small datasets, dataset bias, noise, incomplete featurization, and over-generalization, and there has been some effort to develop tools to address them. Data augmentation and other regularization strategies can allow even small datasets to be treated with large deep-learning models. Another common approach is transfer learning, where a proxy model is trained on a large dataset and adapted to a related task with fewer data points8,9,10. Chen et al.11 showed that multi-fidelity graph networks could be used in comparatively inexpensive low-fidelity calculations to bolster the accuracy of ML predictions for expensive high-fidelity calculations. Finally, active learning methods are now being explored in many areas of materials research, where surrogate models are initialized on small datasets and updated as predictions are used to guide the acquisition of new data generation, often in a manner that balances exploration with optimization12. Generally a solid understanding of the uncertainty in the data is critical for success with these strategies, but ML systems can lead us to some insights or perhaps serve as a guide for optimization which might otherwise be intractable.
还有一些特定的 ML 学科旨在解决众所周知的小型数据集、数据集偏差、噪声、不完整特征化和过度泛化等问题，并且已经努力开发工具来解决这些问题。数据增强和其他正则化策略可以允许使用大型深度学习模型处理较小的数据集。另一种常见的方法是迁移学习，其中代理模型在大型数据集上进行训练，并适应数据点较少的相关任务8,9,10。Chen 等人11 表明，多保真图网络可用于相对便宜的低保真计算，以提高昂贵的高保真计算的 ML 预测的准确性。最后，现在正在材料研究的许多领域探索主动学习方法，其中代理模型在小数据集上初始化，并在使用预测时进行更新，以指导新数据的生成，通常以平衡探索与优化的方式 12。一般来说，对数据不确定性的深刻理解对于这些策略的成功至关重要，但 ML 系统可以引导我们获得一些见解，或者可以作为优化指南，否则这些优化可能会很棘手。

We assert that the materials community would generally benefit from taking a more model-oriented approach to applied ML, in contrast to the popular prediction-oriented approach that many method-development papers take. With the current prediction-oriented application of ML to the physical sciences, the primary intent of the model is to obtain property predictions, often for screening or optimization workflows. We propose that the community would be better served to instead use ML as a means to generate scientific understanding, using, for instance, inference techniques to quantify physical constants from experiments. To achieve the goals of scientific discovery and knowledge generation, predictive ML must often play a supporting role within a larger ecosystem of computational models and experimental measurements. It can be productive to reassess13 the predictive tasks we are striving to address with ML methods; more carefully thought out applications may provide more benefit than simply collecting larger datasets and training higher capacity models.
我们断言，与许多方法开发论文采用流行的面向预测的方法相比，材料社区通常会从采用更加面向模型的方法中受益。随着当前面向预测的 ML 在物理科学中的应用，该模型的主要目的是获得属性预测，通常用于筛选或优化工作流程。我们建议，使用 ML 作为产生科学理解的手段会更好地为社区服务，例如，使用推理技术来量化实验中的物理常数。为了实现科学发现和知识生成的目标，预测性 ML 通常必须在更大的计算模型和实验测量生态系统中发挥支持作用。重新评估 13 我们努力使用 ML 方法解决的预测任务可能会很有成效;更仔细考虑的应用程序可能比简单地收集更大的数据集和训练更高容量的模型提供更多的好处。

扫描二维码关注公众号，回复： 17434616 查看本文章

Massive computation can be useful but is not everything 大规模计算可能有用，但不是全部

On the other hand, characterizing brute computation as “unscientific” can lead to missed opportunities to meaningfully accelerate and enable new kinds or scales of scientific inquiry14. Even without investment in massive datasets or specialized ML models, there is evidence that simply increasing the scale of computation applied can help compensate for small datasets. For example, ref. 15 show that simply by increasing the number of training iterations, large-object detection and segmentation models trained from random initialization can match the performance of the conventional transfer learning approach. In many cases, advances enabled in this way do not directly contribute to scientific discovery or development, but they absolutely change the landscape of feasible scientific research by lowering the barrier to exploration and increasing the scale and automation of data analysis.
另一方面，将野蛮计算定性为“不科学”可能会导致错失有意义地加速和实现新类型或规模的科学探索的机会14。即使不投资于海量数据集或专门的 ML 模型，也有证据表明，只需增加应用的计算规模就可以帮助补偿小型数据集。例如，参考文献 15 表明，只需增加训练迭代次数，从随机初始化训练的大目标检测和分割模型就可以与传统迁移学习方法的性能相匹配。在许多情况下，以这种方式实现的进步并不直接有助于科学发现或发展，但它们通过降低探索门槛和增加数据分析的规模和自动化，绝对改变了可行科学研究的格局。

A perennial challenge in organic chemistry is predicting the structure of proteins, but recent advances in learned potential methods16 have provided paradigm-shifting improvements in performance made possible by sheer computational power. In addition, massive computation can enable new scientific applications through scalable automated data analysis systems. Recent examples include phase identification in electron backscatter diffraction17 and X-ray diffraction18, and local structural analysis via extended x-ray absorption fine structure19,20. These ML systems leverage extensive precomputation through the generation of synthetic training data and training of models; this makes online data analysis possible, removing barriers to more adaptive experiments enabled by real-time decision making.
预测蛋白质的结构是有机化学的一个长期挑战，但学习潜在方法16 的最新进展通过纯粹的计算能力实现了性能的范式转变改进。此外，大规模计算可以通过可扩展的自动化数据分析系统实现新的科学应用。最近的例子包括电子背散射衍射 17 和 X 射线衍射 18 中的物相识别，以及通过扩展 X 射线吸收精细结构进行局部结构分析19,20。这些 ML 系统通过生成合成训练数据和训练模型来利用广泛的预计算;这使得在线数据分析成为可能，消除了通过实时决策实现更具适应性的实验的障碍。

In light of the potential value of large-scale computation in advancing fundamental science, the materials field should make computational efficiency21 an evaluation criterion alongside accuracy and reproducibility22. Comparison of competing methods with equal computational budgets can provide insight into which methodological innovations actually contribute to improved performance (as opposed to simply boosting model capacity) and can provide context for the feasibility of various methods to be deployed as online data analysis tools. Careful design and interpretation of benchmark tasks and performance measures are needed for the community to avoid chasing arbitrary targets that do not meaningfully facilitate scientific discovery and development of novel and functional materials.
鉴于大规模计算在推进基础科学方面的潜在价值，材料领域应将计算效率 21 与准确性和可重复性 22 一起作为评估标准。将计算预算相等的竞争方法进行比较，可以深入了解哪些方法创新实际上有助于提高性能（而不是简单地提高模型容量），并且可以为部署为在线数据分析工具的各种方法的可行性提供背景信息。社区需要仔细设计和解释基准任务和性能衡量标准，以避免追逐不会有意义地促进新型和功能性材料的科学发现和开发的武断目标。

Openly assess dataset bias 公开评估数据集偏差

Acknowledging dataset bias 承认数据集偏差

It is widely accepted that materials datasets are distinct from the datasets used to train and validate ML systems for more “mainstream” applications in a number of ways. While some of this is hyperbole, there are some genuine differences that have a large impact on the overall outlook for ML in materials research. For instance, there is a community-wide perception that all ML problems involve data on the scale of the classic image recognition and spam/ham problems. While there are over 140,000 labeled structures in the Materials Project Database23 and the MNIST24 dataset contains about twice that amount, other popular ML benchmark datasets are much more modest in size. For instance, the Iris Dataset contains only 50 samples each of three species of Iris and is treated as a standard dataset for evaluating a host of clustering and classification algorithms. As noted above dataset size is not necessarily the major hurdle for the materials science community in terms of developing and deploying ML systems; however, the data, input representation, and task must each be carefully considered.
人们普遍认为，材料数据集在许多方面与用于训练和验证 ML 系统的数据集不同，以实现更“主流”的应用程序。虽然其中一些是夸张的，但也有一些真正的差异对材料研究中 ML 的整体前景产生了重大影响。例如，整个社区都认为所有 ML 问题都涉及经典图像识别和垃圾邮件/火腿问题规模的数据。虽然材料项目数据库23 中有超过 140,000 个标记结构，而 MNIST24 数据集包含大约两倍的数量，但其他流行的 ML 基准数据集的大小要适中得多。例如，鸢尾花数据集仅包含三种鸢尾花中的每种 50 个样本，并被视为用于评估大量聚类和分类算法的标准数据集。如上所述，数据集大小不一定是材料科学界在开发和部署 ML 系统方面的主要障碍;但是，必须仔细考虑 data、input representation 和 task 每个。

Viewed as a monolithic dataset, the materials literature is an extremely heterogeneous multiview corpus with a significant fraction of missing entries. Even if this dataset were accessible in a coherent digital form, its diversity and deficiencies would pose substantial hurdles to its suitability for ML-driven science. Most research papers narrowly focus on a single or a small handful of material instances, address only a small subset of potentially relevant properties and characterization modalities, and often fail to adequately quantify measurement uncertainties. Perhaps most importantly, there is a strong systemic bias toward positive results25. All of these factors negatively impact the generalization potential of ML systems.
作为一个整体数据集，材料文献是一个极其异构的多视图语料库，其中很大一部分缺失条目。即使该数据集可以以连贯的数字形式访问，其多样性和缺陷也会对其适合 ML 驱动的科学构成重大障碍。大多数研究论文狭隘地关注单个或一小部分材料实例，只涉及一小部分可能相关的特性和表征模式，并且往往无法充分量化测量不确定性。也许最重要的是，对积极结果存在强烈的系统性偏见25。所有这些因素都会对 ML 系统的泛化潜力产生负面影响。

Two aspects of publication bias play a particularly large role: domain bias and selection bias (Fig. 1b) . Domain bias results when training datasets do not adequately cover the input space. For example, ref. 26 recently demonstrated that the “tried and true” method of selecting reagents following previous successes artificially constrained the range of chemical space searched, providing the ML with a distorted view of the viable parameter space. Severe domain bias can lead to overly optimistic estimates of the performance of ML systems27,28 or in the worst case even render them unusable for real-world scientific application29,30.
发表偏倚的两个方面起着特别大的作用：领域偏倚和选择偏倚（图 1b）。当训练数据集未充分覆盖输入空间时，会产生域偏差。例如，参考文献 26 最近证明，在以前的成功之后选择试剂的“久经考验的”方法人为地限制了搜索的化学空间范围，为 ML 提供了可行参数空间的扭曲视图。严重的领域偏差会导致对 ML 系统的性能估计过于乐观27,28，或者在最坏的情况下甚至使它们无法用于现实世界的科学应用29,30。

Fig. 1: Impact of datasets and feature sets in implementing ML for materials research.

数据集和特征集对材料研究中机器学习的影响

a Materials literature with a heterogeneous dataset due to domain bias and selection bias. Domain bias results when training datasets do not adequately cover the research space. Selection bias arises when some external factors such as questionability and inexplicability restrict the likelihood of a data inclusion in the datasets; such data can be either experimental, theoretical, or computational. b Holistic description of the synthesis, composition, microstructure, and macrostructure of materials, which are related to material properties and performance. Identifying a sufficient feature space with essential variables such as synthesis parameters requires careful observation and lateral thinking.
a 由于域偏倚和选择偏倚而具有异构数据集的材料文献。当训练数据集没有充分覆盖研究空间时，会导致领域偏差。当某些外部因素（如可疑性和不可解释性）限制了数据集中包含数据的可能性时，就会出现选择偏差;此类数据可以是实验的、理论的或计算的。b 对材料的合成、组成、微观结构和宏观结构进行整体描述，这些都与材料性能和性能有关。使用综合参数等基本变量确定足够的特征空间需要仔细观察和横向思考。

Selection bias arises when some external factor influences the likelihood of a data points inclusion in the dataset. In scientific research, a major source of such selection bias is the large number of unreported failures (Fig. 1a). For instance the Landolt-Bornstein collection of ternary amorphous alloys lists 71% of the alloys as being glass formers while the actual occurence of glass-forming compounds is estimated to be about 5%31. This further complicates the already challenging task of learning from imbalanced datasets by skewing the prior probability of glass formation through dataset imbalance. Schrier et al.32 reported on how incorporating failed experiments into ML models can actually improve upon the overall predictive power of a model.
当某些外部因素影响数据集中包含数据点的可能性时，就会出现选择偏差。在科学研究中，这种选择偏差的一个主要来源是大量未报告的失败（图 1a）。例如，Landolt-Bornstein 三元非晶合金系列列出了 71% 的合金是玻璃形成剂，而玻璃形成化合物的实际出现量估计约为 5%31。这进一步使本已具有挑战性的任务复杂化，即通过数据集不平衡来扭曲玻璃形成的先验概率。Schrier 等人32 报告了将失败的实验纳入 ML 模型如何真正提高模型的整体预测能力。

Furthermore, the annotations or targets used to train ML systems do not necessarily represent true physical ground truth. As an example, in the field of metallic glasses the full width half-maximum (FWHM) of the strongest diffraction peak at low wavevector is often used to categorize thin-film material as being metallic glass, nanocrystalline, or crystalline. Across the literature the FWHM value used as the threshold to distinguish between the first two classes varies from 0.4 to 0.7 Å−1 (with associated uncertainties) depending upon the research group. Although compendiums invariably capture the label ascribed to the samples, they almost ubiquitously omit the threshold used for the classification, the uncertainty in the measurement of the FWHM, and the associated synthesis and characterization metadata. Comprehensive studies often report only reduced summaries for the datasets presented and include full details only for a subset of “representative data”. These shortcomings are common across the primary materials science literature. Given that even experts can reasonably disagree on the interpretation of experimental results, the lack of access to primary datasets prevents detailed model critique, posing a substantial impediment to model validation29,33. The push for creating F.A.I.R. (Findable, Accessible, Interoperable, and Reusable34) datasets with human/computer readable data structures notwithstanding, most of the data and meta-data for materials that have ever been made and studied have been lost to time.
此外，用于训练 ML 系统的注释或目标不一定代表真实的物理地面实况。例如，在金属玻璃领域，低波矢量下最强衍射峰的全宽半最大值（FWHM）通常用于将薄膜材料分类为金属玻璃、纳米晶或晶体。在文献中，用作区分前两类阈值的 FWHM 值从 0.4 到 0.7 Å−1 不等（具有相关的不确定性），具体取决于研究小组。尽管纲要总是捕获归因于样本的标签，但它们几乎无处不在地省略了用于分类的阈值、FWHM 测量的不确定性以及相关的合成和表征元数据。综合研究通常只报告所呈现数据集的简化摘要，并且只包括 “代表性数据 ”子集的完整细节。这些缺点在主要材料科学文献中很常见。鉴于即使是专家也可以合理地对实验结果的解释持不同意见，因此无法访问原始数据集会阻止详细的模型评论，从而对模型验证构成重大障碍29,33。尽管人们正在推动创建具有人类/计算机可读数据结构的 F.A.I.R.（可查找、可访问、可互操作和可重用 34）数据集，但曾经制作和研究过的大部分数据和资料元数据已经随着时间而丢失。

Systematic errors in datasets are not restricted to experimental results alone. Theoretical predictions from high-throughput density functional theory (DFT) databases, for example, are a valuable resource for predicted material (meta-) stability, crystal structures, and physical properties, but DFT computations contain several underlying assumptions that are responsible for known systematic errors, e.g., calculated band gaps. DFT experts are well aware of these limitations and their implications for model building; however, scientists unfamiliar with the field may not be able to reasonably draw conclusions about the potential viability of a model’s predictions given these limitations. Discrepancy between DFT and experimental data will expand as systems get increasingly more complex, a longstanding trend in applied materials science. A heterogeneous model, in particular, may cause large uncertainty depending on the complexity of the input structure, and many times little to no information is detailed about the structure or the rationale for choosing it.
数据集中的系统性误差不仅限于实验结果。例如，来自高通量密度泛函理论（DFT）数据库的理论预测是预测材料（元）稳定性、晶体结构和物理性质的宝贵资源，但 DFT 计算包含几个导致已知系统误差的基本假设，例如计算的带隙。DFT 专家非常清楚这些限制及其对模型构建的影响;然而，鉴于这些限制，不熟悉该领域的科学家可能无法合理地得出关于模型预测潜在可行性的结论。随着系统变得越来越复杂，DFT 和实验数据之间的差异将扩大，这是应用材料科学的一个长期趋势。特别是异构模型，根据输入结构的复杂程度，可能会导致很大的不确定性，而且很多时候几乎没有关于结构或选择它的基本原理的详细信息。

Finally, even balanced datasets with quantified uncertainties are not guaranteed to generate predictive models if the features used to describe the materials and/or how they are made are not sufficiently descriptive. Holistically describing the synthesis, composition, microstructure, macrostructure of existing materials for their property/performance (Fig. 1b) is a challenging problem and the feature set used (e.g., microstructure 2-point correlation, compositional descriptors and radial distribution functions for functional materials, and calculated physical properties) is largely community driven. This presupposes that we know and can measure the relevant features during our experiments. Often identifying the parameters that strongly influence materials synthesis and the structural aspects highly correlated to function is a matter of scientific inquiry in and of itself. For example, identifying the importance of temperature in cross-linking rubber or the effect of moisture in the reproducible growth of super-dense, vertically aligned single-walled carbon nanotubes requires careful observation and lateral thinking to connect seemingly independent or unimportant variables. If these parameters (or covariate features, e.g., chemical vapor deposition system pump curves) are not captured from the outset, then there is no hope of algorithmically discovering a causal model, and weakly predictive models are likely to be the best case output.
最后，如果用于描述材料和/或制造方式的特征不够描述，即使是具有量化不确定性的平衡数据集也不能保证生成预测模型。整体描述现有材料的合成、组成、微观结构、宏观结构的性能/性能（图 1b）是一个具有挑战性的问题，所使用的特征集（例如，微观结构 2 点相关、功能材料的成分描述符和径向分布函数，以及计算的物理特性）在很大程度上是由社区驱动的。这前提是我们知道并可以在实验期间测量相关特征。通常，确定强烈影响材料合成的参数和与功能高度相关的结构方面本身就是一个科学探索的问题。例如，确定温度在交联橡胶中的重要性或水分对超致密、垂直排列的单壁碳纳米管可重复生长的影响，需要仔细观察和横向思考，以连接看似独立或不重要的变量。如果这些参数（或协变量特征，例如化学气相沉积系统泵曲线）没有从一开始就被捕获，那么就没有希望通过算法发现因果模型，而弱预测模型可能是最佳情况输出。

There is no silver bullet that will solve the issue of dataset bias, but there are several concrete steps that can be taken to begin addressing it. For instance, as a community we can commit to re-balancing the data pool against selection bias by including in our supplementary material one failed (or subpar) result for every successful result in the main text. Domain bias is best addressed by first acknowledging its existence and then encouraging researchers (possibly through funding) to spend time exploring outside of the well-known regions within their respective fields (perhaps resulting in additional data points to address selection bias). In terms of the need to capture all relevant material features, we accept that (happily) new insights will constantly crop up, and when they do, public datasets should be updated to contain the newly important features. Even if the new field is left empty for historical records, its existence will draw attention to its relevance for model builders. Finally, individuals applying ML in their research should analyze and discuss sources of bias in the data used to train and evaluate models and their potential impact on reported results.
没有解决数据集偏差问题的灵丹妙药，但可以采取几个具体步骤来开始解决它。例如，作为一个社区，我们可以承诺重新平衡数据池与选择偏差，在我们的补充材料中包括正文中每个成功结果的一个失败（或低于标准）的结果。解决领域偏差的最佳方法是首先承认它的存在，然后鼓励研究人员（可能通过资助）花时间在各自领域的知名领域之外进行探索（这可能会导致额外的数据点来解决选择偏差）。就捕获所有相关材料特征的需求而言，我们承认（令人高兴的）新的见解会不断出现，当它们出现时，应该更新公共数据集以包含新的重要特征。即使新字段在历史记录中留空，它的存在也会引起人们对它与模型构建器的相关性的关注。最后，在研究中应用 ML 的个人应分析和讨论用于训练和评估模型的数据中的偏差来源及其对报告结果的潜在影响。

Productivity in spite of dataset bias 尽管存在数据集偏差，但仍能提高工作效率

Bias in historical and as-collected datasets should be acknowledged, but it does not entirely preclude their use to train an ML targeted toward scientific inquiry. Instead one can continue to gain productive insights from ML by taking the appropriate approach and thinking analytically about the results of the model.
应该承认历史数据集和收集时数据集中的偏差，但这并不完全排除它们用于训练针对科学探究的 ML。相反，通过采取适当的方法并分析思考模型的结果，可以继续从 ML 中获得富有成效的见解。

Especially with small datasets, it is important to characterize the extent of dataset bias and perform careful model performance analysis to obtain realistic estimates of the generalization of ML models. Rauer and Bereau28 provide compelling examples of these effects of dataset bias by comparing the empirical distribution in chemical space of three similar molecular property datasets. Dataset bias can cause common measures of a model’s generalization ability to become overconfident; typically generalization ability is measured through cross-validation where a portion of the data is withheld from the training data. Recent research in the chemical and materials informatics literature has focused on developing dataset unbiasing techniques that aim to find cross-validation splits that more faithfully serve as a check against overfitting. For example, the Asymmetric Validation Embedding method27 quantifies the bias of a dataset split by using a nearest-neighbor model to memorize the training data. If the nearest-neighbor lookup can achieve a good validation accuracy, then the training and validation sets are deemed to be too similar. Searching for cross-validation splits that minimize this bias metric can improve the robustness of the benchmark, but the Asymmetric Validation Embedding metric is specific to classification tasks. In contrast, the leave-one-cluster-out cross-validation35 is more general, using only distances in the input space to define cross-validation groups to reduce information leakage between folds. Extending these kinds of debiasing methods to additional material classification and prediction tasks will have an outsized impact on applied artificial intelligence for practical scientific advances and discoveries because by nature these goals depend on excellent generalization and extrapolation performance.
特别是对于小型数据集，重要的是要描述数据集偏差的程度并执行仔细的模型性能分析，以获得对 ML 模型泛化的真实估计。Rauer 和 Bereau28 通过比较三个相似分子性质数据集在化学空间中的经验分布，提供了数据集偏倚的这些影响的令人信服的例子。数据集偏差会导致模型泛化能力的常见度量变得过于自信;通常，泛化能力是通过交叉验证来衡量的，其中一部分数据从训练数据中隐藏。化学和材料信息学文献中的最新研究集中在开发数据集无偏技术上，该技术旨在找到更忠实地用作防止过拟合的交叉验证拆分。例如，非对称验证嵌入方法27通过使用最近邻模型来记忆训练数据，从而量化数据集拆分的偏差。如果最近邻查找可以实现良好的验证准确性，则认为训练集和验证集太相似。搜索可最小化此偏差指标的交叉验证拆分可以提高基准的稳健性，但 Asymmetric Validation Embedding 指标特定于分类任务。相比之下，leave-one-cluster-out 交叉验证35 更通用，仅使用输入空间中的距离来定义交叉验证组，以减少折叠之间的信息泄漏。将这些类型的去偏方法扩展到额外的材料分类和预测任务将对应用人工智能的实际科学进步和发现产生巨大影响，因为这些目标本质上取决于出色的泛化和外推性能。

One method for maintaining “good” features and models is to adapt an active human intervention in the ML loop. For example, we have recently demonstrated that Random Forest models that are tuned to aggressively maximize only cross-validation accuracy may produce low-quality, unreliable feature ranking explainability36. Carefully tracking which features (and data points) the model is most dependent on for its predictions allows a researcher to ensure that the model is capturing physically relevant trends, identify new potential insight into material behavior, and spot possible outliers. Similarly, when physics-based models are used to generate features and training data for ML models, subsequent comparison of new predictions to theory-based results offers the opportunity for improvement of both models37. The preceding examples are all a human-initiated post-hoc investigation of model outputs. Kusne et al.38 recently demonstrated the inverse example where the ML model can request expert input, such as performing a measurement or calculation, that is expected to lower predictive uncertainties.
维护 “良好” 特征和模型的一种方法是在 ML 循环中采用主动的人工干预。例如，我们最近证明，随机森林模型被调整为仅积极地最大化交叉验证准确性可能会产生低质量、不可靠的特征排名可解释性36。仔细跟踪模型预测最依赖的特征（和数据点）使研究人员能够确保模型捕捉到物理相关的趋势，识别对材料行为的新潜在见解，并发现可能的异常值。同样，当基于物理的模型用于为 ML 模型生成特征和训练数据时，随后将新预测与基于理论的结果进行比较为改进这两个模型提供了机会37。前面的示例都是人工发起的对模型输出的事后调查。Kusne 等人38 最近演示了一个反向示例，其中 ML 模型可以请求专家输入，例如执行测量或计算，这有望降低预测不确定性。

Dimensionality reduction tools and latent space models are useful to assess the general distribution of a data set. Visualizations from such models can illustrate potential bias and unequal distributions of a dataset by inspecting the internal structure/distribution and the true dimensionality. For instance, ref. 39 used principle component analysis as a method for investigating the role of dataset bias by investigating the density of data points with scores plots. Gomez-Bombarelli et al.40 have used variational autoencoders to identify sparsely sampled regions in the parameter space by pushing them toward the outside of the latent space distribution. They demonstrated that variational autoencoders can highlight when the model is incapable of recognizing certain classes, indicating the data is outside of the distribution that the model was trained on. A holistic analysis helps gain knowledge about both the ML models and the datasets and thus may lead to more effective research steps.
降维工具和潜在空间模型可用于评估数据集的一般分布。来自此类模型的可视化可以通过检查内部结构/分布和真实维度来说明数据集的潜在偏差和不均匀分布。例如，参考文献 39 使用主成分分析作为一种方法，通过用分数图研究数据点的密度来研究数据集偏差的作用。Gomez-Bombarelli 等人40 使用变分自动编码器来识别参数空间中的稀疏采样区域，方法是将它们推向潜在空间分布的外部。他们证明，当模型无法识别某些类别时，变分自动编码器可以突出显示，这表明数据在模型训练的分布之外。整体分析有助于获取有关 ML 模型和数据集的知识，因此可能会导致更有效的研究步骤。

A culture of careful model criticism is also important for robust applied ML research41. A narrow focus on benchmark tasks can lead to false incremental progress, where, over time, models begin overfitting to a particular test dataset and then lack generalizability beyond the initial dataset. Ref. 42 demonstrated that a broad range of computer vision models suffer from this effect by developing extended test sets for the CIFAR-10 and ImageNet datasets extensively used in the community for model development. This can make it difficult to reason about exactly which methodological innovations truly contribute to generalization performance. Because many aspects of ML research are empirical, carefully designed experiments are needed to separate genuine improvements from statistical effects, and care is needed to avoid post-hoc rationalization (Hypothesizing After the Results are Known (HARK)43).
谨慎的模型批评文化对于稳健的应用 ML 研究也很重要41。狭隘地关注基准测试任务可能会导致错误的增量进度，随着时间的推移，模型开始过度拟合到特定的测试数据集，然后在初始数据集之外缺乏泛化性。参考文献 42 通过为 CIFAR-10 和 ImageNet 数据集开发扩展测试集，证明广泛的计算机视觉模型受到这种影响，这些数据集在社区中广泛用于模型开发。这使得很难准确推断哪些方法创新真正有助于泛化性能。由于 ML 研究的许多方面都是实证的，因此需要精心设计的实验来将真正的改进与统计效应区分开来，并且需要注意避免事后合理化（已知结果后的假设（HARK）43）。

That there is historical dataset bias is both unavoidable and unresolvable, but once identified this bias does not necessarily constrain the search for new materials in directions that directly contradict the bias44. For instance, ref. 26 identified anthropogenic biases in the design of amine-templated metal oxides, in that a small number of amine complexes had been used for a vast majority of the literature. Their solution was to perform 548 randomly generated experiments to demonstrate that a global maximum had not been reached but also to erode the systemic data bias their models observed. This is not to say that such an approach is a panacea for dataset or feature set bias as such experiments are still designed by scientists carrying their own biases (e.g., using only amines) and may suffer from uncaptured (but important!) features. Of course, a question remains how to best remove human bias from the experimental pipeline.
存在历史数据集偏差既不可避免又无法解决，但一旦识别出这种偏差，就不一定会限制对新材料的搜索，方向与偏差直接相矛盾44。例如，参考文献 26 确定了胺模板金属氧化物设计中的人为偏见，因为绝大多数文献都使用了少量的胺配合物。他们的解决方案是执行 548 个随机生成的实验，以证明尚未达到全局最大值，同时也要削弱他们的模型观察到的系统数据偏差。这并不是说这种方法是解决数据集或特征集偏差的灵丹妙药，因为此类实验仍然由带有自身偏见的科学家设计（例如，仅使用胺），并且可能会受到未捕获（但很重要）特征的影响。当然，一个问题仍然是如何从实验流程中最好地消除人为偏见。

One potential path forward is deployment of automated systems that perform the ultimate selection of the experiment to be performed and manage data acquisition, functionally to attack the small dataset problem by using automation to fill in the cracks. Using these tools and adopting objective functions that permit random or maximum expected improvement exploration may help researchers avoid biasing their research toward particular solutions, allowing them to focus more on higher-level problem formulation and hypothesis specification. Currently, model prototyping often is done in notebook computing environments, which are convenient for exploring new ideas but make it easy to create unsustainable software. More accessible tools for exploring new ideas while maintaining traceability, reproducibility, flexibility, interactivity, and integration with laboratory equipment will help researchers focus on goal setting, intuition and insights for featurization, and data curation. This is analogous to ML life-cycle management45, which is used in industrial settings to ensure traceability of predictions to specific models formulations.
一种可能的前进道路是部署自动化系统，这些系统执行要执行的实验的最终选择并管理数据采集，在功能上通过使用自动化来填补裂缝来解决小型数据集问题。使用这些工具并采用允许随机或最大预期改进探索的目标函数可能有助于研究人员避免将研究偏向于特定解决方案，从而使他们能够更多地关注更高级别的问题表述和假设规范。目前，模型原型设计通常在笔记本计算环境中完成，这便于探索新想法，但很容易创建不可持续的软件。在保持可追溯性、可重复性、灵活性、交互性以及与实验室设备集成的同时，用于探索新想法的更易于使用的工具将帮助研究人员专注于目标设定、特征化的直觉和见解以及数据管理。这类似于 ML 生命周期管理45，用于工业环境，以确保预测的可追溯性到特定模型公式。

Keep sight of the goal 关注目标

While the implementation of ML in materials science is often focused on a push for better accuracy and faster calculations, these are not always the only objectives or even the most important ones. For the ML novice it is helpful to remember to keep the scientific aim at the forefront when selecting a model and then designing training and validation procedures. Consider the trade-off between accuracy and discovery. If one is optimizing the pseudopotentials to use for DFT46,47, then design may be centered around accuracy of predicting material characteristics when compared to an existing benchmark set, and this may lead to better predictions for other known compounds. On the other hand, one may want to sacrifice accuracy for exploratory studies. The aforementioned high-accuracy model may fail to predict the novel combination of physical properties of an undiscovered compound. In fact, even if the phase had been recently identified and included in the training set, the model may not be trustworthy due to the inherent lack of benchmark datasets whenever new science appears.
虽然 ML 在材料科学中的实施通常侧重于推动更高的准确性和更快的计算，但这些并不总是唯一的目标，甚至不是最重要的目标。对于 ML 新手，记住在选择模型然后设计训练和验证程序时将科学目标放在首位是很有帮助的。考虑准确性和发现之间的权衡。如果正在优化用于 DFT46,47 的赝势，那么与现有基准集相比，设计可能以预测材料特性的准确性为中心，这可能会对其他已知化合物产生更好的预测。另一方面，人们可能想为了探索性研究而牺牲准确性。上述高精度模型可能无法预测未被发现的化合物的物理性质的新组合。事实上，即使该阶段最近已被识别并包含在训练集中，但每当出现新科学时，该模型也可能不值得信赖，因为该模型本身就缺乏基准数据集。

There are clearly cases where ML is the obvious choice to accelerate research, but there can be concerns about the suitability of ML to answer the relevant question. Many applied studies focus only on physical or chemical properties of materials and often fail to include parameters relating to their fundamental utility such as reproducibility, scalability, stability, productivity, safety, or cost48. While humans may not be able to find correlations or patterns in high-dimensional spaces, we have rich and diverse background knowledge and heuristics; we have only just begun the difficult work of inventing ways of building this knowledge into ML systems. In addition, for domains with small datasets, limited features, and a strong need for higher-level inference rather than a surrogate model, ML should not necessarily be the default approach. A more traditional approach may be faster due to the error in the ML models associated with sample size, and heuristics can play a role even with larger datasets49.
显然，在某些情况下，ML 是加速研究的明显选择，但人们可能会担心 ML 是否适合回答相关问题。许多应用研究只关注材料的物理或化学特性，往往不包括与其基本用途相关的参数，如可重复性、可扩展性、稳定性、生产率、安全性或成本48。虽然人类可能无法在高维空间中找到相关性或模式，但我们拥有丰富多样的背景知识和启发式方法;我们才刚刚开始发明将这些知识构建到 ML 系统中的艰难工作。此外，对于数据集较小、功能有限且强烈需要更高级别推理而不是替代模型的领域，ML 不一定是默认方法。由于 ML 模型中与样本量相关的错误，更传统的方法可能更快，并且启发式方法甚至可以在较大的数据集中发挥作用49。

One alternative is to employ a hybrid method which may include a Bayesian methodology to analysis50 or may use ML to guide the work through selective intervention51. ML is only a means to model data, and a good fit to the dataset is no guarantee that the model will be useful since it may have little to no relationship to actual science as it attempts to emulate apparent correlations between the features and the targets (Fig. 2). To provide some insight into this issue, Lee and Lundberg52 developed Shapley additive explanations based on game theory to assess the impact of each feature on ML predictions.
一种选择是采用混合方法，其中可能包括贝叶斯分析方法50，或者可以使用 ML 通过选择性干预来指导工作51。ML 只是对数据进行建模的一种手段，与数据集的良好拟合并不能保证该模型有用，因为它可能与实际科学几乎没有关系，因为它试图模拟特征和目标之间的明显相关性（图 2）。为了对这个问题提供一些见解，Lee 和 Lundberg52 开发了基于博弈论的 Shapley 加法解释，以评估每个特征对 ML 预测的影响。

Fig. 2: Comparison of theoretical and ML Models of the Hall-Petch effect.

Hall-Petch 效应的理论模型和ML模型的比较

The success of a given ML model may have little or no relationship to the actual physical processes as the model is merely interpolating between observations. For example, a Gaussian Process model can "capture’’ the changeover in the behavior of the flow stress in metals from being dependent on grain boundary density in large-grain metals78 to being dominated by grain boundary sliding in nanocrystalline alloys79 even though the model is unaware of either mechanism. However, outside the range of acquired data the lack of encoding scientific understanding results in rapidly increasing uncertainties, even in well-calibrated systems. Code for reproducing this figure is available at https://github.com/usnistgov/ml-materials-reflections80.
给定 ML 模型的成功可能与实际物理过程几乎没有关系，因为该模型只是在观察之间进行插值。例如，高斯过程模型可以“捕获”金属中流动应力行为的变化，从依赖于大晶粒金属的晶界密度78 转变为由纳米晶合金的晶界滑动主导79，即使该模型不知道这两种机制。然而，在获取的数据范围之外，缺乏编码科学理解会导致不确定性迅速增加，即使在校准良好的系统中也是如此。https://github.com/usnistgov/ml-materials-reflections80 处提供了用于重现此图的代码。

A corollary is that any ML predictions, especially when working with small datasets, may be unphysical. Again, we stress that it doesn’t imply that we should never use ML for small datasets. As demonstrated by ref. 53, non-negative matrix factorization can be constrained to provide predictions only within physical spaces. In any case, we need to employ ML tools judiciously and understand their limitations in the context of our scientific goals. For instance, while most ML models are reasonably good at interpolation54, ML is not nearly as robust when used for extrapolation, although this can be mitigated to some extent by including rigorous statistical analyses on the predictions55.
一个推论是，任何 ML 预测，尤其是在使用小型数据集时，都可能是非物理的。我们再次强调，这并不意味着我们永远不应该将 ML 用于小型数据集。如参考文献 53 所示，非负矩阵分解可以限制为仅在物理空间内提供预测。无论如何，我们都需要明智地使用 ML 工具，并了解它们在我们的科学目标背景下的局限性。例如，虽然大多数 ML 模型在插值方面都相当出色54，但 ML 在用于外推时并不那么健壮，尽管这可以通过对预测进行严格的统计分析来在一定程度上缓解55。

A discussion of errors and failure modes can help one understand the bounds of the validity of any ML analysis although it is often lacking or limited. An honest discourse includes not only principled estimates of model performance and detailed studies of predictive failure modes but also notes how reproducible the results within and across research groups. Explanation of model failure modes is required for validating the use of ML for any application.
对错误和失败模式的讨论可以帮助人们了解任何 ML 分析的有效性界限，尽管它通常缺乏或有限。诚实的论述不仅包括对模型性能的原则估计和对预测失效模式的详细研究，还包括研究小组内部和研究小组之间结果的可重复性。要验证 ML 对任何应用程序的使用，都需要对模型失效模式进行说明。

Finally, one of the biggest potential pitfalls that can occur, even for large, well-curated datasets, is that one can lose sight of the goal by focusing on the accuracy of the model rather than using it to learn new science. There is a particular risk of the community spending disproportionate effort incrementally optimizing models to overfit against benchmark tasks42, which may or may not even truly represent meaningful scientific endeavors in themselves. We note that in the case of the MatBench benchmark dataset and ML challenge56, many of the top performing models are neural networks. While these models have impressive predictive capability their interpretability (and thus their ability to inform scientific progress) is limited. This is also the case for the Open Catalyst Challenge57.
最后，即使对于大型、精心策划的数据集，也可能发生的最大潜在陷阱之一是，如果人们专注于模型的准确性而不是使用它来学习新的科学，可能会忽视目标。社区花费不成比例的努力逐步优化模型以过度拟合基准任务42 存在一个特别的风险，这本身可能甚至可能真正代表有意义的科学努力。我们注意到，在 MatBench 基准数据集和 ML 挑战赛 56 的情况下，许多表现最好的模型都是神经网络。虽然这些模型具有令人印象深刻的预测能力，但它们的可解释性（以及它们为科学进步提供信息的能力）是有限的。Open Catalyst Challenge57 也是如此。

The objective should not be to identify the one algorithm that is good at everything but rather to develop a more focused effort that addresses a specific research question. For ML to reach its true potential to transform research and not just serve as a tool to expedite materials discovery and optimization, it needs to help provide a means to connect experimental and theoretical results instead of simply serving as a convenient vehicle to describe them.
目标不应该是确定一种什么都擅长的算法，而是开发一种更集中的工作来解决特定的研究问题。为了让 ML 发挥其真正的潜力来改变研究，而不仅仅是作为加速材料发现和优化的工具，它需要帮助提供一种将实验和理论结果联系起来的方法，而不仅仅是作为描述它们的便捷工具。

Dream big enough for radical innovation 梦想远大，勇于创新

To date, ML has increased its presence in materials science for mainly three applications: (1) automating data analysis that used to be done manually; (2) serving as lead-generation in a materials-screening funnel, illustrated by the Open Quantum Materials Database and Materials Project; and (3) optimizing existing materials, processes, and devices in a broadly incremental manner. While these applications are critically important in this field, radical innovation historically has often been accomplished outside of the context of these three general research frameworks, driven by human interests or serendipity along with stubborn trial and error. For instance, graphene was first isolated during Friday night experiments when Geim and Novoselov would try out experimental science that was not necessarily linked to their day jobs. Escobar et al.58 discovered that peeling adhesive tape can emit enough x-rays to produce images. Shirakawa59 discovered a conductive polyacetylene film by accidentally mixing doping materials at a concentration a thousand times too high.
迄今为止，ML 在材料科学中的影响力主要集中在三个方面：（1）自动化过去手动完成的数据分析;（2）在材料筛选漏斗中充当潜在客户生成，由 Open Quantum Materials Database 和 Materials Project 说明;（3）以广泛的增量方式优化现有材料、工艺和设备。虽然这些应用在该领域至关重要，但历史上激进的创新通常是在这三个一般研究框架的背景之外完成的，由人类的兴趣或偶然性以及顽固的试验和错误驱动。例如，石墨烯首次被分离出来是在周五晚上的实验中，当时 Geim 和 Novoselov 会尝试不一定与他们日常工作相关的实验科学。Escobar 等人58 发现剥离的胶带可以发射足够的 X 射线来产生图像。Shirakawa59 通过不小心将浓度高出一千倍的掺杂材料混合，发现了导电聚乙炔膜。

Design research has argued that every radical innovation investigated was done without careful analysis of a person’s or even a society’s needs60. If this is the case, an ultimate question about ML deployment in materials science would be, can ML help humans make the startling discovery of “novel” materials and eventually new science? The new science often relies on a discrete discovery possibly outside the context of an existing theory, which is noticeably different from current ML applications which tackle problems like chess and Jeopardy!.
设计研究认为，所调查的每一项激进创新都是在没有仔细分析个人甚至社会需求的情况下完成的60。如果是这样的话，那么关于材料科学中 ML 部署的终极问题是，ML 能否帮助人类发现“新颖”材料并最终发现新科学？新科学通常依赖于可能在现有理论背景之外的离散发现，这与当前解决国际象棋和 Jeopardy！等问题的 ML 应用程序明显不同。

According to a proposed categorization in design research60, one can position their research based on scientific and application familiarity (Fig. 3a). Here, incremental areas (blue region) can provide easier data acquisition and interpretation of results but may hinder new discovery. In contrast, an unexplored area may more likely provide such unexpected results but presents a huge risk of wasting research resources due to the inherent uncertainty. Self-aware resource allocation and inter-area feedback will be needed to balance novelty with the probability of successful research outcomes. Although there is currently a lack of ML methods that can directly navigate one in the radical change/radical application region to discover new science, we expect that there are methodologies that can harness ML to increase the chance of radical discovery.
根据设计研究中提出的分类 60，人们可以根据科学和应用的熟悉程度来定位他们的研究（图 3a）。在这里，增量区域（蓝色区域）可以更轻松地获取数据和解释结果，但可能会阻碍新的发现。相比之下，未开发的区域更有可能提供这种意想不到的结果，但由于固有的不确定性，存在浪费研究资源的巨大风险。需要自我意识的资源分配和区域间反馈来平衡新颖性与研究结果成功的可能性。尽管目前缺乏可以直接在激进变化/激进应用领域中导航以发现新科学的 ML 方法，但我们预计有一些方法可以利用 ML 来增加激进发现的机会。

Fig. 3: Use of outside-the-box thinking in advancing scientific research with ML.

利用创新思维推动机器学习的科学研究

a Conceptual research domain defined by a scientific concept and an applicational goal where the arrows represent a radical shift in research driven by outside-the-box thinking and/or creative artificial intelligence (AI). b Machine-learning-involved research loop in conjunction with possible generalization and outside-the-box thinking pathways. Blue arrows illustrate research flows in an incremental domain, green arrows show knowledge-based new research steps, and orange arrows illustrate radical shifts based on new hypotheses and generalizations in the loop.
由科学概念和应用目标定义的概念研究领域，其中箭头代表由开箱即用的思维和/或创造性人工智能（AI）驱动的研究的根本转变。b 涉及机器学习的研究循环与可能的泛化和开箱即用的思维途径相结合。蓝色箭头表示增量领域的研究流程，绿色箭头表示基于知识的新研究步骤，橙色箭头表示基于循环中新假设和概括的根本转变。

Active outside-the-box exploration driven by ML-assisted knowledge acquisition 由 ML 辅助知识获取驱动的主动开箱即用探索

Human interests motivate outside-the-box research that may lead to a radical discovery, and these interests are fostered by theoretical or experimental knowledge acquisition. Therefore, any applied ML and automated research systems may contribute to discrete discovery by accelerating the knowledge feedback loop (Fig. 3b). Such ML-involved research loop can include a proposal of hypotheses, theoretical and experimental examination, knowledge extraction, and generalization, which may lead to an opportunity for radical thinking. Analysis and online visualization tools can help better interpret the result and mechanism of ML-involved research, which facilitates new hypotheses and generalization through knowledge extraction. Such interactive analysis/visualization can be implemented in various steps of the research loop such as feature selection, ML model investigation, and ML interpretation.
人类的兴趣激发了跳出框框的研究，这可能会导致激进的发现，而这些兴趣是通过理论或实验知识的获取来培养的。因此，任何应用的 ML 和自动化研究系统都可以通过加速知识反馈循环来促进离散发现（图 3b）。这种涉及机器学习的研究循环可以包括假设的提出、理论和实验检验、知识提取和泛化，这可能会带来激进思考的机会。分析和在线可视化工具可以帮助更好地解释涉及 ML 的研究的结果和机制，从而通过知识提取促进新的假设和泛化。这种交互式分析/可视化可以在研究循环的各个步骤中实现，例如特征选择、ML 模型研究和 ML 解释。

For ML to play a meaningful role in expediting this loop, one also should maintain exploratory curiosity at each step and be inspired or guided by any outputs while attentively being involved in the loop. In addition, at the very beginning of proof-of-concept research, either in a current research loop or outside-the-box search, the fear of reproducibility should not prevent the attempt at new ideas because the scientific community needs to integrate conflicting observations and ideas into a coherent theory61.
为了让 ML 在加速这个循环中发挥有意义的作用，一个人还应该在每一步都保持探索性的好奇心，并在专心参与循环的同时受到任何输出的启发或指导。此外，在概念验证研究的一开始，无论是在当前的研究循环中还是在开箱即用的搜索中，对可重复性的恐惧都不应该阻止对新想法的尝试，因为科学界需要将相互冲突的观察和想法整合到一个连贯的理论中61。

One can harken back to Delbruck’s principle of limited sloppiness62, which reminds us that our experimental design sometimes tests unintended questions, and hidden selectivity requires attention to abnormality. In this context, ML may help us notice the anomaly or even hidden variables with a rigorous statistical procedure, leading to new pieces of knowledge and outside-the-box exploration. For instance, ref. 63 used automated experiments and statistical analysis to clarify the effect of trace water (a hidden variable) on crystal/domain growth of halide perovskite (an important property), which had often been communicated only in intra-lab conversation. Since such correlation analysis can only shed light on a domain where features are input, researchers still need comprehensive experimental records containing both data and metadata to be fed, possibly regardless of their initial interests. Also, an unbiased and flexible scientific attitude based upon observation may be crucial to reforming a question after finding the abnormality.
人们可以回想起德尔布吕克的有限草率原则62，它提醒我们，我们的实验设计有时会测试无意的问题，而隐藏的选择性需要关注异常。在这种情况下，ML 可以通过严格的统计程序帮助我们注意到异常甚至隐藏的变量，从而获得新的知识和跳出框框的探索。例如，参考文献 63 使用自动实验和统计分析来阐明微量水（一个隐藏变量）对卤化物钙钛矿（一种重要特性）的晶体/畴生长的影响，这通常只在实验室内部对话中传达。由于这种相关性分析只能阐明输入特征的领域，因此研究人员仍然需要包含数据和元数据的综合实验记录，这可能无论他们最初的兴趣如何。此外，基于观察的公正和灵活的科学态度对于在发现异常后改革问题可能至关重要。

Deep generative inverse design to assist in creating material concepts 深度生成逆向设计，协助创建材料概念

Functionality-oriented inverse design64 is an emerging approach for searching chemical spaces65 for small molecules and possibly solid-state compounds66. Here, generative models simultaneously learn how to map existing materials to a set of few key variables and how to generate “new” materials from those key “latent” variables. One can then optimize a material by finding latent variables that should maximize the property and then generating a new material from those coordinates. Novel compounds likely to have desired properties can then be sampled from the generative model67. While the design spaces, such as the 166 billion molecules mapped by chemical space projects68, are far beyond the human capability to understand them comprehensively, ML may distill patterns connecting functionalities and compound structures spanning the space. This approach can be a critical step in conceptualizing materials design based upon desired functionalities and further accelerating the ML-driven research loop. One application of such inverse design is to create a property-first optimization loop which includes defining a desired property, proposing a material and structure for that property, validating the results with (automated) experiments, and refining the model.
面向功能的逆向设计64 是一种新兴的方法，用于在化学空间中65 搜索小分子和可能的固态化合物66。在这里，生成模型同时学习如何将现有材料映射到一组几个关键变量，以及如何从这些关键的“潜在”变量生成“新”材料。然后，可以通过找到应该最大化属性的潜在变量，然后从这些坐标生成新材料来优化材料。然后可以从生成模型中采样可能具有所需特性的新型化合物67。虽然设计空间（例如化学空间项目绘制的 1660 亿个分子）68 远远超出了人类全面理解它们的能力，但 ML 可以提炼出连接整个空间的功能和复合结构的模式。这种方法可能是根据所需功能概念化材料设计并进一步加速 ML 驱动的研究循环的关键步骤。这种逆向设计的一个应用是创建一个属性优先的优化循环，其中包括定义所需的属性、为该属性提出材料和结构、通过（自动）实验验证结果以及优化模型。

While these generative methods may start to approach creativity, they still explicitly aim to learn an empirical distribution based on the available data. Therefore, extrapolation outside of the current distribution of known materials is not guaranteed to be productive. For instance, these methods would probably not generate a carbon nanotube given only pre-nanotube-era structures for training or generate ordered superlattices if there is none in the training data. In addition, these huge datasets are mainly constructed based on simulation, and we need to be careful about a gap between simulated and actual experimental data as discussed previously. Still, a new concept extracted from inverse design may inspire researchers to jump into a new discrete subfield of material design by actively interpreting the abstracted property-structure relationship.
虽然这些生成方法可能开始接近创造力，但它们仍然明确地旨在根据可用数据学习经验分布。因此，不能保证在已知材料的当前分布之外进行外推是有效的。例如，如果训练数据中没有，这些方法可能不会生成仅用于训练的前纳米管时代结构的碳纳米管，也不会生成有序的超晶格。此外，这些庞大的数据集主要是基于仿真构建的，如前所述，我们需要注意仿真和实际实验数据之间的差距。尽管如此，从逆向设计中提取的新概念可能会激发研究人员通过积极解释抽象的属性-结构关系来跳入材料设计的新离散子领域。

Creative artificial intelligence for materials science 材料科学的创意人工智能

The essence of scientific creativity is the production of new ideas, questions, and connections69. The era of artificial intelligence as an innovative investigator in this sense has yet to arrive. However, since human creativity has been captured by actively learning and connecting dots highlighted by our curiosity, it may be possible that machine “learning” can be as creative as humans in order to reach radical innovation.
科学创造力的本质是产生新的想法、问题和联系69。从这个意义上说，人工智能作为创新研究者的时代尚未到来。然而，由于人类的创造力是通过积极学习和连接我们的好奇心所突出的点来捕捉的，因此机器“学习”有可能像人类一样具有创造力，以实现激进的创新。

While conventional supervised natural language processing70 has required large hand-labeled datasets for training, a recent unsupervised learning study71 indicates the possibility of extracting knowledge from literature without human intervention to identify relevant content and capturing preliminary materials science concepts such as the underlying structure of the periodic table and structure-properties relationships. This unsupervised learning was demonstrated by encoding latent literature into information-dense word embeddings, which recommended some materials for a specific application ahead of human discovery. Since the amount of currently existing literature is too massive for human cognition, such generative artificial intelligence systems may be useful to suggest a specific design or concept given appropriately defined functionalities.
虽然传统的监督自然语言处理70需要大型手动标记数据集进行训练，但最近的一项无监督学习研究71 表明，可以在没有人工干预的情况下从文献中提取知识，以识别相关内容并捕获初步的材料科学概念，例如元素周期表的底层结构和结构-性质关系。这种无监督学习是通过将潜在文献编码为信息密集的词嵌入来证明的，这为人类发现之前的特定应用推荐了一些材料。由于目前存在的文献数量对于人类认知来说太大了，因此这种生成式人工智能系统可能有助于在适当定义的功能下提出特定的设计或概念。

Beyond latent variable optimization, one may consider computational creativity, which is used to model imagination in fields such as the arts72, music73, and gaming. This endeavor may start with finding a vector space to measure novelty as a distance74. A novelty-oriented algorithm searches the space for a set of distant new objects that is as diverse as possible as to maximize novelty instead of an objective function75. Since there would be some bias for measuring the distance along with exploratory space, deep learning novelty explorer (DeLeNox) was recently proposed76 as a means to dynamically change the distance functions for improved diversity. These approaches could be applied to materials science to diversify research directions and help us pose and consider novel materials and ideas though measuring novelty may be subjective and most challenging for the community, and one always needs to be mindful of ethical and physical materials constraints.
除了潜在变量优化之外，人们还可以考虑计算创造力，它用于模拟艺术 72、音乐 73 和游戏等领域的想象力。这项工作可能从找到一个向量空间来测量新奇性作为距离开始74。一种面向新颖性的算法在空间中搜索一组遥远的新对象，这些对象尽可能多样化，以最大限度地提高新颖性，而不是目标函数75。由于测量距离和探索空间会存在一些偏差，因此最近提出了深度学习新奇浏览器（DeLeNox） 76 作为动态更改距离函数以提高多样性的方法。这些方法可以应用于材料科学，以多样化研究方向，并帮助我们提出和考虑新颖的材料和想法，尽管衡量新颖性可能是主观的，并且对社区来说最具挑战性，并且需要始终注意道德和物理材料的限制。

Outlook 展望

Machine learning has been effective at expediting a variety of tasks, and the initial stage of its implementation for materials research has already confirmed that it has great promise to accelerate science and discovery77. To realize that full potential, we need to tailor its usage to answer well defined questions while keeping perspective of the limits of the resources needed and the bounds of meaningful interpretation of the resulting analyses. Eventually, we may be able to develop ML algorithms that will consistently lead us to new breakthroughs. In the meantime, a complementary team of humans, ML, and robots has already begun to advance materials science.
机器学习在加速各种任务方面已经很有效，其在材料研究中实施的初始阶段已经证实了它在加速科学和发现方面的巨大前景77。为了实现这一全部潜力，我们需要调整其用途以回答明确定义的问题，同时保持所需资源的局限性和对结果分析的有意义解释的界限。最终，我们或许能够开发出 ML 算法，始终如一地引领我们取得新的突破。与此同时，一个由人类、ML 和机器人组成的互补团队已经开始推进材料科学。

References

Rosenblatt, F. Perceptron simulation experiments. Proc. IRE 48, 301–309 (1960).
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inform. Proc. Syst. 33, 1877–1901 (2020).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009).
D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395 (2020).
Hattrick-Simpers, J. R., Choudhary, K. & Corgnale, C. A simple constrained machine learning model for predicting high-pressure-hydrogen-compressor materials. Mol. Syst. Design Eng. 3, 509–517 (2018).
Xue, D. et al. Accelerated search for materials with targeted properties by adaptive design. Nat. Commun. 7, https://doi.org/10.1038/ncomms11241 (2016).
Childs, C. M. & Washburn, N. R. Embedding domain knowledge for machine learning of complex material systems. MRS Commun. 9, 806–820 (2019).
Yamada, H. et al. Predicting materials properties with little data using shotgun transfer learning. ACS Centr. Sci. 5, 1717–1730 (2019).
Hoffmann, J. et al. Machine learning in a data-limited regime: augmenting experiments with synthetic data uncovers order in crumpled sheets. Sci. Adv. 5, eaau6792 (2019).
Goetz, A. et al. Addressing materials’ microstructure diversity using transfer learning. npj Comput. Mater. 8, 1–13 (2022).
Chen, C., Zuo, Y., Ye, W., Li, X. & Ong, S. P. Learning properties of ordered and disordered materials from multi-fidelity data. Nat. Comput. Sci. 1, 46–53 (2021).
Lookman, T., Balachandran, P. V., Xue, D. & Yuan, R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput. Mater. 5, https://doi.org/10.1038/s41524-019-0153-8 (2019).
Bartel, C. J. et al. A critical examination of compound stability predictions from machine-learned formation energies. npj Comput. Mater. 6 (2020). https://doi.org/10.1038/s41524-020-00362-y. Bartel et al. show that compound stability prediction on the basis of regression models for formation energy cannot be taken at face value.
Holm, E. A. In defense of the black box. Science 364, 26–27 (2019).
He, K., Girshick, R. & Dollár, P. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4918–4927. https://doi.org/10.1109/ICCV.2019.00502 (2019).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Kaufmann, K., Zhu, C., Rosengarten, A. S. & Vecchio, K. S. Deep neural network enabled space group identification in EBSD. Microscopy Microanaly. 26, 447–457 (2020).
Maffettone, P. M. et al. Crystallography companion agent for high-throughput materials discovery. Nat. Comput. Sci. 1, 290–297 (2021).
Timoshenko, J. et al. Linking the evolution of catalytic properties and structural changes in copper–zinc nanocatalysts using operando EXAFS and neural-networks. Chem. Sci. 11, 3727–3736 (2020).
Schmeide, K. et al. Technetium immobilization by chukanovite and its oxidative transformation products: Neural network analysis of EXAFS spectra. Sci. Total Environ. 770, 145334 (2021).
Schwartz, R., Dodge, J., Smith, N. A. & Etzioni, O. Green AI. Commun. ACM 63, 54–63 (2020).
Pineau, J. et al. Improving reproducibility in machine learning research: a report from the neurips 2019 reproducibility program. J. Mach. Learning Res. 22 (2021). This report summarizes common sources of computational irreproducibility in machine learning research and assesses the impact of a reproducibility checklist on improving quality and transparency of research.
Jain, A. et al. The Materials Project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
Grother, P. J. & Flanagan, P. A. NIST special database 19: Handprinted forms and characters database, National Institute of Standards and Technology. https://doi.org/10.18434/T4H01C (1995).
Dwan, K. et al. Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS ONE 3, e3081 (2008).
Jia, X. et al. Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis. Nature 573, 251–255 (2019). This work illustrates how follow-on-study bias influences the exploration of subsequent chemical studies across an entire field and shows that more time spent performing “bad” experiments enriches our overall understanding of how inorganic synthesis works.
Wallach, I. & Heifets, A. Most ligand-based classification benchmarks reward memorization rather than generalization. J. Chem. Inform. Modeling 58, 916–932 (2018).
Rauer, C. & Bereau, T. Hydration free energies from kernel-based machine learning: compound-database bias. J. Chem. Phys. 153, 014101 (2020).
Griffiths, R.-R., Schwaller, P. & Lee, A.A. Dataset bias in the natural sciences: a case study in chemical reaction prediction and synthesis design (2021).
Cubuk, E. D., Sendek, A. D. & Reed, E. J. Screening billions of candidates for solid lithium-ion conductors: a transfer learning approach for small data. J. Chem. Phys. 150, 214701 (2019).
Kawazoe, Y., Carow-Watamura, U. & Yu, J.-Z. (eds.) Physical Properties of Ternary Amorphous Alloys. Part 2: Systems from B-Be-Fe to Co-W-Zr (Springer Berlin Heidelberg, 2011). https://doi.org/10.1007/978-3-642-13850-8.
Raccuglia, P. et al. Machine-learning-assisted materials discovery using failed experiments. Nature 533, 73–76 (2016).
Hattrick-Simpers, J. R. et al. An open combinatorial diffraction dataset including consensus human and machine learning labels with quantified uncertainty for training new machine learning models. Integr. Mater. Manufact. Innovat. 10, 311–318 (2021).
Wilkinson, M. D. et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3, https://doi.org/10.1038/sdata.2016.18 (2016).
Meredig, B. et al. Can machine learning identify the next high-temperature superconductor? examining extrapolation performance for materials discovery. Mol. Syst. Desig. Eng. 3, 819–825 (2018).
Lei, K., Joress, H., Persson, N., Hattrick-Simpers, J. R. & DeCost, B. Aggressively optimizing validation statistics can degrade interpretability of data-driven materials models. J. Chem. Phys. 155, 054105 (2021).
Liu, N. et al. Interactive human–machine learning framework for modelling of ferroelectric–dielectric composites. J. Mater. Chem. C 8, 10352–10361 (2020).
Kusne, A. G. et al. On-the-fly closed-loop materials discovery via bayesian active learning. Nat. Commun. 11, https://doi.org/10.1038/s41467-020-19597-w (2020).
Breuck, P.-P. D., Evans, M. L. & Rignanese, G.-M. Robust model benchmarking and bias-imbalance in data-driven materials science: a case study on MODNet. J. Phys.: Condensed Matter 33, 404002 (2021).
Gomez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Centr. Sci. 4, 268–276 (2018).
Lipton, Z. C. & Steinhardt, J. Troubling trends in machine learning scholarship: some ml papers suffer from flaws that could mislead the public and stymie future research. Queue 17, 45–77 (2019).
Recht, B., Roelofs, R., Schmidt, L. & Shankar, V. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 5389–5400 (PMLR, 2019).
Gencoglu, O. et al. HARK side of deep learning - from grad student descent to automated machine learning. CoRR abs/1904.07633. http://arxiv.org/abs/1904.07633 (2019).
Nguyen, T. N. et al. Learning catalyst design based on bias-free data set for oxidative coupling of methane. ACS Catalys. 11, 1797–1809 (2021).
John, M. M., Olsson, H. H. & Bosch, J. Towards mlops: a framework and maturity model. 47th Euromicro Conference on Software Engineering and Advanced Applications. 1–8 (SEAA, 2021).
Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, https://doi.org/10.1103/physrevlett.98.146401 (2007).
Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 104, https://doi.org/10.1103/physrevlett.104.136403 (2010).
Olivetti, E. A. & Cullen, J. M. Toward a sustainable materials system. Science 360, 1396–1398 (2018). Discusses materials research in a more general context than simply material properties.
George, J. & Hautier, G. Chemist versus machine: Traditional knowledge versus machine learning techniques. Trends in Chemistry 3, 86–95 (2021). Discussion of tradeoffs of conventional research compared to AI-assisted techniques and how the two can be synergistically merged.
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian data analysis (Chapman and Hall/CRC, 1995).
Hutchinson, M. L. et al. Overcoming data scarcity with transfer learning. arXiv preprint arXiv:1711.05099 (2017).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inform. Proc. Syst. 30 (2017).
Maffettone, P. M., Daly, A. C. & Olds, D. Constrained non-negative matrix factorization enabling real-time insights of in situ and high-throughput experiments. Appl. Phys. Rev. 9, 041410 (2021).
Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction (Springer open, 2017).
Tran, K. et al. Methods for comparing uncertainty quantifications for material property predictions. Mach. Learning: Sci. Technol. 1, 025006 (2020).
Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. npj Comput. Mater. 6, 1–10 (2020).
Chanussot, L. et al. Open catalyst 2020 (oc20) dataset and community challenges. ACS Cataly. 11, 6059–6072 (2021).
Sanderson, K. Sticky tape generates x-rays. Nature https://doi.org/10.1038/news.2008.1185 (2008).
Guo, X. Conducting polymers forward. Nat. Mater. 19, 921–921 (2020).
Norman, D. A. & Verganti, R. Incremental and radical innovation: Design research vs. technology and meaning change. Design Issues 30, 78–96 (2014).
Redish, A. D., Kummerfeld, E., Morris, R. L. & Love, A. C. Opinion: Reproducibility failures are essential to scientific inquiry. Proc. Natl Acad. Sci. 115, 5042–5046 (2018).
Yaqub, O. Serendipity: Towards a taxonomy and a theory. Res. Policy 47, 169 (2018).
Nega, P. W. et al. Using automated serendipity to discover how trace water promotes and inhibits lead halide perovskite crystal formation. Appl. Phys. Lett. 119, 041903 (2021).
Zunger, A. Inverse design in search of materials with target functionalities. Nat. Rev. Chem. 2, https://doi.org/10.1038/s41570-018-0121 (2018).
Kirkpatrick, P. & Ellis, C. Chemical space. Nature 432, 823–823 (2004).
Ren, Z. et al. An invertible crystallographic representation for general inverse design of inorganic crystals with targeted properties. Matter 5, 314–335 (2022).
Sanchez-Lengeling, B. & Aspuru-Guzik, A. Inverse molecular design using machine learning: generative models for matter engineering. Science 361, 360–365 (2018).
Reymond, J.-L. The chemical space project. Acc. Chem. Res. 48, 722–730 (2015).
Lehmann, J. & Gaskins, B. Learning scientific creativity from the arts. Palgrave Commun. 5, https://doi.org/10.1057/s41599-019-0308-8 (2019).
Krallinger, M., Rabal, O., Lourenço, A., Oyarzabal, J. & Valencia, A. Information retrieval and text mining technologies for chemistry. Chem. Rev. 117, 7673–7761 (2017).
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019). Unsupervised learning was demonstrated by encoding latent literature into information-dense word embeddings, which recommended some materials for a specific application by capturing materials science concepts.
Ellis, K. et al. Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. CoRR abs/2006.08381 https://arxiv.org/abs/2006.08381 (2020).
Briot, J., Hadjeres, G. & Pachet, F. Deep learning techniques for music generation - A survey. CoRR abs/1709.01620 http://arxiv.org/abs/1709.01620 (2017).
Berns, S. & Colton, S. Bridging generative deep learning and computational creativity. In Proc. 11th International Conference on Computational Creativity, 406–409 (2020).
Lehman, J. & Stanley, K. O. Abandoning objectives: evolution through the search for novelty alone. Evol. Comput. 19, 189–223 (2011). A̧ novelty-oriented algorithm for finding an instance that differs significantly from previous ones outperformed the objective-based search in some tasks, suggesting that some problems are best solved by methods that ignore the objective.
Liapis, A., Martinez, H. P., Togelius, J. & Yannakakis, G. N.Transforming exploratory creativity with delenox. CoRR abs/2103.11715 https://arxiv.org/abs/2103.11715 (2021).
Baker, N. et al. Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence. Tech. Rep., USDOE Office of Science, Washington, DC (United States) https://doi.org/10.2172/1478744 (2019).
Cordero, Z. C., Knight, B. E. & Schuh, C. A. Six decades of the hall–petch effect – a survey of grain-size strengthening studies on pure metals. Int. Mater. Rev. 61, 495–512 (2016).
Trelewicz, J. R. & Schuh, C. A. The hall–petch breakdown in nanocrystalline metals: a crossover to glass-like deformation. Acta Materialia 55, 5948–5958 (2007).
Fujinuma, N., DeCost, B., Hattrick-Simpers, J. & Lofland, S. ml-materials-reflections: v0.1. https://doi.org/10.5281/zenodo.6522627 (2022).

Author information

Authors and Affiliations

Department of Chemical Engineering, Rowan University, 201 Mullica Hill Rd, Glassboro, NJ, USA

Naohiro Fujinuma
Sekisui Chemical Co., Ltd, 2-4-4 Nishitemma, Kita-ku, Osaka, 530-8565, Japan

Naohiro Fujinuma
Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, Gaithersburg, MD, USA

Brian DeCost
Department of Materials Science and Engineering, University of Toronto, 27 King’s College Cir, Toronto, ON, Canada

Jason Hattrick-Simpers
Department of Physics and Astronomy, Rowan University, 201 Mullica Hill Rd, Glassboro, NJ, USA

Samuel E. Lofland

Contributions

N.F. Conceptualization (lead), Visualization, Writing (original draft), Writing (review & editing). B.D.C. Conceptualization, Visualization, Writing (original draft), Writing (review & editing). J.H-S. Conceptualization, Writing (original draft), Writing (review & editing). S.L. Conceptualization, Writing (original draft), Writing (review & editing).

Corresponding author

Correspondence to Brian DeCost.

Ethics declarations

Competing interests

The authors declare no competing interests.

via:

Why big data and compute are not necessarily the path to big materials science | Communications Materials

https://www.nature.com/articles/s43246-022-00283-x