Machine Learning for A Small
Amount of Samples Labeled
Xuemei Pu*, Yuanyuan Jiang, Yunhao Xie, Renling Hu,
Ling Hu,Liying Wang
College of Chemistry, Sichuan University,
Chengdu 610064
ABSTRACT:
Artificial intelligence, especially
machine learning algorithms, has shown its promising merits and application
potential in the biomedical and chemical fields. However, compared with the
genetic and proteomics data, some material domains are faced with obstacles of
small amount of samples labelled, restricting the application of the machine
learning in these fields. Hence, to address these limitations, it’s a key to
explore machine learning strategy for the small amount of samples. This report
introduces our recent two researches on the cocrystal screening and the quick
design of energetic material with high heat of explosion by means of active
learning and transfer learning strategies.
1) In order to explore a machine learning-based prediction model with
high accuracy and generalization to cocrystal screening, we reconstructed new
dataset with 7881 cocrystal samples and proposed a hierarchical representation
of cocrystal samples by combination of molecular maps and manually selected
molecular descriptors. More core is that we constructed a novel deep learning
framework based on graph neural network. The model accuracy on the independent
test set achieves 97.87%, remarkably surpassing the classic graph neural
network model and the traditional machine learning ones. In addition, by means
of designing a transferable learning strategy, the deep learning-based model is
further extended to the cocrystal field with very limited data like the
energetic cocrystals, in which the accuracy of the independent test set is high
up to 95%.
2) We introduced an adaptive design strategy to accelerate the design
of energetic materials with high heat of explosion, which is composed of
molecular generator, machine learning model, selector and QM verification.
Based on 88 initial samples labeled and the prediction uncertainty, the third
iteration can finds the compound with the highest heat of explosion from the
unexplored search space containing ca.
ninety thounsand samples. The adaptive strategy could be extended to the other
properties of the energetic compounds and the other fields.
Keywords: Deep learning; Limited dataset; Active learning; Transfer learning
Dr. Xuemei Pu is a professor of College of Chemistry, Sichuan University and a member of the Computational Chemistry Professional Committee of the Chinese Chemical Society. In recent years, she has carried out a series of research works in the field of functional materials and biomedical fields, supported by multiple National Natural Science Foundation of China. She has co-published more than 100 SCI papers and applied for 10 patents (4 authorized) and two computer software copyrights.