The 4th Forum of Materials Genome Engineering

S-4-06 An AI Framework and Program for Rule-based Formula Discovery from Small and Noised Data

An AI Framework and Program for Rule-based Formula Discovery from Small and Noised Data

Sheng Sun*，Tong-Yi Zhang

Materials Genome Institute, Shanghai University, Shanghai, China，200444

ABSTRACT: Materials Informatics (MI) is developing rapidly and becomes an effective tool to design and develop new materials. MI utilizes rapidly developing machine learning (ML) techniques to build models based on data, and then uses the model to predict and discover materials with new and/or optimized properties. An obstacle in MI is the scarce of material data, while most ML techniques are developed for big data tasks with hundreds of or even thousands of features. The combination of expert knowledge, accumulated in the material field over several centuries, with ML is promising to simplify and improve the quality of a ML model in MI.

In according with the developing paradigm of science and technology, the best ML model of natural science should be mathematical equations, which play the central role in understandings of phenomena and mechanisms of materials, and serve as foundation for numerical calculations/simulations at various scales. Comparing to other ML models, mathematical equations are simple and explainable. In addition, they are super sparsity, that is, a formula contains only a very limited-amount of variables, terms and primitive functions. The knowledge on existing equations provides us a chance to shrink the searching space of candidate equations for given data.

We propose an AI framework with a parallel program to perform rule-based formula discovery in the present work. Based on the key variables identified by users for a system in research, which provides an implicit equation with the form of , the program will automatically perform dimensional analysis and construct feature space by using elementary functions and operators including multiplication and division. Candidate equations made up of features in the feature space are exhaustively constructed and evaluated by using pseudo-inverse matrix. The program recommends solutions by balancing the accuracy of formulas and its complexity at last. In addition, the program also provides LASSO algorithm and evolutionary computations as alternative methods to find approximate solutions, if the search space of candidate equations is too huge.

Preliminary tests showed that the program can complete a search task with 50 data and 10⁹ equations in several hours on a single node (24 cores) of a supercomputer cluster. Two testings on pre-set equations showed that the equations can be successfully rediscovered based on only 50 data with errors of 5% Gaussian distributions. The data were generated from the pre-set equations. The tests on experimental data also showed the success of the present framework: The equation which was found in the (training) data of creep of nano-twinned Cu can successfully describe the (testing) data of stress relaxation of nano-twinned Cu, nano-grain Cu and coarse-grain Cu. The AI equation is a little better than expert-knowledge-based equation in both training and testing cases.

Keywords: materials informatics; symbolic regression; sparse regression; evolutionary computation; parallel program

Brief Introduction of Speaker

Sheng Sun

Sheng Sun is an associate professor at the Materials Genome Institute at Shanghai University, China. He earned his PhD degree in bioengineering from The Hong Kong University of Science and Technology in 2011. His current research focuses on materials informatics by combining multiscale simulations, mathematical modeling, and machine learning.