The 3rd Forum of Materials Genome Engineering

6-21. Data-driven formula discovery in materials informatics: algorithms and the program

Sheng Sun*, Tong-Yi Zhang

Materials Genome Institute, Shanghai University, Shanghai, China

Abstract: The rapid development of materials informatics (MI) provides a new paradigm to design new materials and/or optimize the existing ones. MI utilizes machine learning (ML) techniques to build models based on data, and then uses the model to predict, for example, the best combinations of elements or processes to give the desired property. The traditional ML techniques, which are mainly developed for complex tasks with hundreds of or even thousands of features, are usually utilized to find an approximated solutions in materials design. In the application of ML in materials as a specific domain, ML modeling processes and efforts could be significantly simplified and model quality could also be greatly improved, if we take existing knowledge, accumulating over hundreds of years, into account. The best expression of knowledge on materials and mechanics is mathematical equations, which play the central role in understanding, calculation and prediction of materials and their phenomena and processes. From the ML point of view, the existing equations as models are extremely sparse, i.e., each equation includes very limit amount of variables and terms. The extremely sparsity of equations should also be remained in ML model building in MI and makes the infeasible method in ML, such as L₀ regularization (a NP-hard problem), among the best ones in MI, when a high efficient program is provided. In the present work, we provide the program, which can run paralleling on supercomputer clusters, to search fittest mathematical equations for given data, starting from the simplest one gradually to the candidate with greater complexity. The program conduct exhaustive search of candidate expressions by using L₀ regularization, which we proposed can find solutions for most tasks according to our experience on the complexity of existing equations. For tasks with too many features, the program can conduct sparse regression by L₁ regularization (LASSO) or symbolic regression by evolutionary computations, following the user’s setup.

Keywords: Materials informatics; Sparse modeling; Formula construction; Parallel program

数据驱动的智能化公式推导：面向材料信息学的机器学习算法和程序

孙升*，张统一

上海大学材料基因组工程研究院，中国上海 200444

摘要：在材料基因组工程理念的指引下，作为材料科学与信息科学和计算机科学的交叉学科，材料信息学（Materials Informatics, MI）获得了迅猛发展，成为指导新材料研发和设计的有效工具，近几年涌现了许多基于各种机器学习技术（Machine Learning, ML）的新材料设计和研发工作。材料信息学利用快速发展的ML技术基于数据构建模型，利用模型来预测和发现新材料或优化材料性能。但现有的MI工作多是直接采用各种ML技术构建复杂模型，寻找近似解，而这些ML技术的开发主要是面向具有成百上千个特征的任务。通过仔细分析材料领域的特点，利用已有的材料科学和工程的知识和经验，面向材料信息学的ML建模过程应该可以大大简化，并且模型质量也将大幅改进。传统的材料科学甚至整个自然科学中，模型的最好表达形式是数学公式，它是对材料中各种现象和机理理解的总结，也是各种尺度数值计算和数值模拟的基础。从ML的角度看，公式的最大特点是它的超稀疏性，即每个公式中仅包含极少的变量数和项数。在材料信息学的机器学习建模过程中，我们认为应该尽量采用公式模型，且公式模型的超稀疏性仍会保持。公式的这种超稀疏特性使得在其他领域无法实现的机器学习方法，如L₀正则化（NP-hard问题），在此处变为可能，但前提是提供一个可以对全局最优解进行并行遍历搜索的高性能计算程序。在本报告中，我们提供了这样的程序，它能在超算集群上并行运行，进行从简单形式到复杂形式的公式搜索，为给定的数据搜寻全局最优且形式最简的公式模型。此程序还集成了目前已知的公式模型搜索算法，例如，对拥有过多特征的少量任务，根据用户的设置，此程序可以使用L₁正则化（LASSO）或者演化计算来寻求近似解，或者先使用惩罚较小的L₁正则化来构建候选特征的子集，再运用遍历搜索发现最优解。

关键词： 材料信息学；稀疏建模；公式模型；并行程序

Brief Introduction of Speaker

孙升

博士，上海大学材料基因组工程研究院副研究员，博士生导师。2011年博士毕业于香港科技大学。2012年-2014年香港科技大学机械与航空工程系研究助理。现主要研究方向为：1）电池和腐蚀问题中电化学系统固液界面的第一性原理和连续介质跨尺度模拟计算，2）电池和腐蚀问题中电化学/力学多场耦合的力学与热力学建模，3）力学信息学与材料信息学。作为项目负责人承担国家自然科学基金2项，参与科技部“材料基因工程关键技术与支撑平台”重点专项项目2项。

Email: mgissh@shu.edu.cn