An AI Framework
and Program for Rule-based Formula Discovery from Small and
Noised Data
Sheng
Sun*,Tong-Yi
Zhang
Materials Genome Institute, Shanghai University,
Shanghai, China,200444
ABSTRACT: Materials Informatics (MI) is developing rapidly and becomes an effective tool to design and develop new materials. MI utilizes rapidly developing
machine learning (ML) techniques to build models based on data, and then uses the model to
predict and discover materials with new and/or optimized properties. An obstacle in
MI is the scarce of material data, while most ML techniques are developed for big data tasks with hundreds of or
even thousands of features. The combination of expert knowledge, accumulated in the material field over several centuries, with ML is
promising to simplify and improve the quality of a ML model in MI.
In according
with the developing paradigm of science and technology, the best ML model of
natural science should be mathematical equations, which play the central role in understandings of phenomena and mechanisms of materials, and serve as foundation for numerical calculations/simulations at various scales. Comparing to
other ML models, mathematical equations are simple and explainable. In
addition, they are super
sparsity, that is, a formula
contains only a very limited-amount of variables, terms and primitive functions. The knowledge
on existing equations provides us a chance to shrink the searching space of candidate equations for given data.
We propose an AI framework with a parallel program to
perform rule-based formula discovery in the present work. Based on the key variables identified by
users for a
system in research, which provides an implicit
equation with the form of , the
program will automatically perform dimensional analysis and construct feature
space by using elementary functions and operators including
multiplication and division.
Candidate equations made up of features in the feature
space are exhaustively constructed and evaluated by using pseudo-inverse
matrix. The program recommends solutions by balancing the accuracy
of formulas and its complexity at
last.
In addition, the program also provides
LASSO algorithm and evolutionary computations
as alternative methods to find approximate
solutions, if the search
space of candidate equations is too huge.
Preliminary
tests showed that the program can complete a search task with 50 data and 109 equations in several hours on a single node (24
cores) of a supercomputer
cluster. Two testings on pre-set equations showed that the equations can be successfully rediscovered based on only
50 data with errors of 5% Gaussian distributions. The data were generated from the pre-set equations. The
tests on
experimental data also showed the success of the present framework: The equation
which was found in the (training) data of creep of nano-twinned Cu can
successfully describe the (testing) data of stress relaxation of nano-twinned
Cu, nano-grain Cu and coarse-grain Cu. The AI equation is a little better than
expert-knowledge-based equation in both training and testing cases.
Keywords: materials
informatics; symbolic regression; sparse regression; evolutionary computation; parallel program
Sheng Sun is an associate professor at the Materials Genome Institute at Shanghai University, China. He earned his PhD degree in bioengineering from The Hong Kong University of Science and Technology in 2011. His current research focuses on materials informatics by combining multiscale simulations, mathematical modeling, and machine learning.