S-4-14 Machine Learning for A Small Amount of Samples Labeled

Machine Learning for A Small Amount of Samples Labeled

Xuemei Pu*, Yuanyuan Jiang, Yunhao Xie, Renling Hu, Ling HuLiying Wang

College of Chemistry, Sichuan University, Chengdu 610064

 

ABSTRACT: Artificial intelligence, especially machine learning algorithms, has shown its promising merits and application potential in the biomedical and chemical fields. However, compared with the genetic and proteomics data, some material domains are faced with obstacles of small amount of samples labelled, restricting the application of the machine learning in these fields. Hence, to address these limitations, it’s a key to explore machine learning strategy for the small amount of samples. This report introduces our recent two researches on the cocrystal screening and the quick design of energetic material with high heat of explosion by means of active learning and transfer learning strategies.

1)     In order to explore a machine learning-based prediction model with high accuracy and generalization to cocrystal screening, we reconstructed new dataset with 7881 cocrystal samples and proposed a hierarchical representation of cocrystal samples by combination of molecular maps and manually selected molecular descriptors. More core is that we constructed a novel deep learning framework based on graph neural network. The model accuracy on the independent test set achieves 97.87%, remarkably surpassing the classic graph neural network model and the traditional machine learning ones. In addition, by means of designing a transferable learning strategy, the deep learning-based model is further extended to the cocrystal field with very limited data like the energetic cocrystals, in which the accuracy of the independent test set is high up to 95%.

2)     We introduced an adaptive design strategy to accelerate the design of energetic materials with high heat of explosion, which is composed of molecular generator, machine learning model, selector and QM verification. Based on 88 initial samples labeled and the prediction uncertainty, the third iteration can finds the compound with the highest heat of explosion from the unexplored search space containing ca. ninety thounsand samples. The adaptive strategy could be extended to the other properties of the energetic compounds and the other fields.

 

Keywords: Deep learning; Limited dataset; Active learning; Transfer learning

Brief Introduction of Speaker
Xuemei Pu

Dr. Xuemei Pu is a professor of College of Chemistry, Sichuan University and a member of the Computational Chemistry Professional Committee of the Chinese Chemical Society. In recent years, she has carried out a series of research works in the field of functional materials and biomedical fields, supported by multiple National Natural Science Foundation of China. She has co-published more than 100 SCI papers and applied for 10 patents (4 authorized) and two computer software copyrights.