EXTENDED ABSTRACT: The release of ChatGPT has triggered a global wave of artiffcial intelligence (AI) attention, and AI for science is also becoming a hot topic in the scientiffc community. When we think about how to harness the power of AI to accelerate scientiffc research in our ffeld, whether there is a continuous supply, sufffcient scale, and high availability of scientiffc data becomes an unavoidable problem. In materials science, the existing ecology of scattered, small-scale, and non-standardized scientific data management is unable to provide reliable guarantees for the widespread application of AI in this domain. Creating an AI-ready materials science data ecosystem will support the broad and deep participation of AI in materials science research. Here, we analyze the governance needs of material science data from the perspective of reliable AI model construction at the data sample and dataset levels, respectively. At the same time, we discuss the key barriers and remedies that constrain the realization of these needs in the current data ecology and call on scientific data stakeholders to collaborate in building an AI-ready scientiffc data ecosystem. Although these discussions are based on the materials sciences, they are equally applicable to scientiffc data governance in most other areas of the natural sciences. Figure 1 is an AI-ready scientific data ecosystem from a holistic perspective [1]. To address the opportunities and challenges of AI for science, scientiffc data stakeholders need to make adjustments in the generation, collection, storage, and sharing of scientific data. These actions include emphasizing and supporting data management plans, establishing and using domain-specific data standards, improving research facilities for data generation and collection, promoting data sharing in the form of publications, and establishing comprehensive data-sharing platforms for segmented research communities. These actions address AI's requirements for accuracy, completeness, consistency, discoverability, and accessibility of scientific data samples and, more importantly, provide a sustainable channel for obtaining high-quality domain-speciffc scientiffc data. When selecting datasets, researchers should also follow general requirements such as complete data volume, comprehensive features, diverse and uniform samples, and professional labelling at the dataset level to ensure the reliability of AI models. In this way, a new data ecosystem community will be established, enabling AI to participate broadly and deeply in scientiffc research in this ffeld.
Keywords:Artiffcial intelligence, Data ecology, Data standard, Materials genome engineering
REFERENCES:
[1] Yongchao Lu, Hong wang, Lanting Zhang, et al. Scientific Data, 2024, in press. https://doi.org/10.1038/s41597-024- 03821-z
Lanting Zhang is currently a Professor of Materials Science & Engineering at Shanghai Jiao Tong University (SJTU) and Deputy Director of the Materials Genome Initiative Center (MaGIC) of SJTU. He is also the group leader of the High-Performance Metallic Materials Laboratory in the School of Materials Science and Engineering (SMSE). He received his BSc, MSc and PhD in materials science from SJTU in 1991, 1994 and 1997 respectively. His current research interests include high-throughput characterization of materials, development of rareearth permanent magnetic materials for traction motors and heat-resistant steels for turbines and ultra-supercritical power plants etc. He is now leading a National Key Research and Development Program project on high-throughput characterization of materials combinatorial chips. He is an associate editor of the Journal of Alloys and Metallurgical Systems and has published over 100 peer-reviewed journal articles in Acta Materialia, Scripta Materialia, PRB, JAP, Intermetallics and JALCOM etc.