Pengcheng Xu, Xiaobo Ji, Minjie Li, Wencong Lu
Shanghai University, Shanghai, 200444, China
Abstract: Machine learning primarily optimizes the performance of computer programs using data or prior experiences. In contrast to experimental trial-and-error methods, machine learning allows for the rapid extraction of patterns and trends from available data without a deep understanding of underlying physical mechanisms, guiding the development of materials. Data serves as the cornerstone of machine learning models,山rectly influencing their performance. However, in materials science, there are challenges related to data, and specific quantitative indicators for data size are seldom defined. When exploring causal relationships, models constructed with small samples offer simpler uncertainty assessments than large data. However, small s皿ples often lead to issues of data and model imbalance, overfitting, or underfitting due to the small data scale, excessively high or low feature dimensions. This abstract sUil1111arizes approaches for handling small s皿ples in materials science, focusing on data sources, algoritlrms, and machine learning strategies. Regarding data sources, advancements in natural language processing and text mining enable the automatic extraction of data from publications. The development of material databases facilitates the convenient collection of fragmented material data. Additionally, high-throughput techniques can rapidly generate a large quantity of high-quality data through experimental or computational methods. Machine learning models rely not only on data but also on algoritlrms, and certain algorithms are inherently suitable for modeling with small s皿ples. Algoritlrms suitable for small datasets include support vector machines, Gaussian process regression, random forests, XGBoost, gradient boosting decision trees, and symbolic regression. Imbalance learning, primarily addressing classification tasks, arises from an uneven distribution of sample sizes in different classes, particularly when s皿ples in the minority class are limited. In terms of sample size, minority samples in imbalanced data can be categorized as absolute minority samples and relative minority samples. Absolute minority samples refer to a scarcity of data in the minority class, limiting the information contained in the data and making it challenging for classifiers to capture infom皿ion from minority class samples. Relative minority samples mean that the minority class samples are only a small proportion compared to majority class samples, blurring the boundaries of minority class s皿ples and reducing the identification ability of minority class s皿ples. Active learning can select s皿ples for labeling from a large pool of unlabeled data, enabling representation of information from small samples as much as possible. Active learning facilitates big data analysis and processing with small data. Transfer learning allows acquiring knowledge from a given source domain and learning task, adjusting the parameters of a pre-trained model with small s皿ples in the target domain to enhance predictive accuracy. Due to the inherent characteristics of the materials field, most material data for machine learning are expected to remain in the small sample stage. It is essential to consider both data and algorithms, building a comprehensive database or integrating existing technologies to increase sample size. Continuous efforts are required to develop small sample modeling algorithms and machine learning strategies. In the future, small sample machine learning is expected to contribute to better handling valuable yet limited experimental data, accelerating materials design and discovery.
Key word: Small Sample Machine Leaming; Materials Design; Imbalance Learning;
Reference:
[l] P. Xu, X. Ji, M. Li, W. Lu, npj Comput Mater 2023, 9 (1), 42.
[2] C. Yang C, C. Ren, Y. Jia, G. Wang, M. Li, W. Lu, Acta Materialia 2022, 222: 117431
Minjie Li is an associate professor in the Department of Chemistry at Shanghai University. She received her Ph.D. in Organic Chemistry from the University of Science and Technology of China in 2007. She conducted the work of visiting scholar with Prof. Alan Aspuru-Guzic in Vector Institute for Artificial Intelligence at the University of Toronto. She has been honored with the Pujiang Talent Program. Her research focuses on the design and performance optimization of molecules and materials. She has led 9 national and provincial-level projects, published over 50 SCI papers, authored 2 research monographs and 2 textbooks, and obtained authorization for more than 10 national invention patents and software copyrights.