Materials Big Data Governance and Applications

Yanjing Su1,2, Xue Jiang1, Dezhen Xue, Lei Zhang1
1 University of Science and Technology Beijing, Beijing, 100083, China;
2 Suzhou Laboratory, Jiangsu, 215123, China;
3 Xi'an Jiaotong University, Shaanxi, 710049, China;
ABSTRACT: Materials science is undergoing a paradigm shift from experience-driven to data-intelligence-driven research. High-throughput experiments, computational simulations, scientific literature, and intelligent manufacturing have generated massive amounts of materials data. However, these data are often multi-source and heterogeneous, semantically inconsistent, and of uneven quality, making it difficult to integrate and utilize them efficiently. As a result, their full potential in new materials design and discovery remains underexploited. Therefore, it is imperative to establish a systematic materials big data governance framework and explore data-driven application pathways for materials research and development, thereby propelling materials science into a new era of intelligence and autonomous innovation.To address the above challenges, this report focuses on four key aspects: materials big data governance and platform development, automatic data extraction and annotation, machine learning–driven materials design, and materials foundation models and scientific intelligence. In terms of data governance, our team has independently developed a large-scale materials database technology that enables efficient storage, management, and utilization of materials big data, laying a solid foundation for subsequent data-driven applications. In data extraction and annotation, we have developed specialized extraction techniques for materials science literature, constructed large-scale materials literature datasets, and successfully applied them across multiple domains. To further enhance data quality and knowledge representation, we proposed a multimodal scientific data extraction and annotation framework designed to achieve human–AI collaborative annotation and knowledge-oriented transformation of textual and visual scientific data. In the area of machine learning, our work centers on efficiently exploring complex design spaces. We have established a systematic methodology that spans materials cognition, decision optimization, and generative design. By integrating techniques such as active learning, reinforcement learning, multi-objective optimization, multitask learning, symbolic regression, and descriptor generation and screening, we have achieved efficient materials design and performance optimization from multiple perspectives. In the domain of materials foundation models and scientific intelligence, our team has developed a steel materials design foundation model to tackle key challenges in intelligent design across different steel grades and processing routes, providing critical support for paradigm transformation in materials R&D. This foundation model is built upon our independently developed pre-trained language model (SteelBERT) for the steel domain, which possesses accurate knowledge encoding capabilities. Based on this, we established an end-to-end predictive framework that takes chemical composition and processing route as inputs to deliver high-precision quantitative predictions of mechanical properties. By fine-tuning the model with small-scale experimental data, we successfully developed high-strength, high-toughness, and sulfide-resistant oil casing products for deep oil and gas wells. Furthermore, for typical property curves such as corrosion polarization curves and fatigue curves, we integrated the generative power of diffusion models for conditional probability distributions with the knowledge representation capabilities of large materials language models, achieving accurate prediction of materials’ corrosion behavior. Building upon this, we proposed an innovative dual-agent process sequence search strategy and a reverse process generation framework, enabling intelligent inverse design of processing routes through large-model-driven knowledge reasoning. By deeply integrating steel materials knowledge bases, domain-specific model libraries, and general foundation model architectures, this steel materials design foundation model has evolved into an intelligent design agent capable of task understanding and autonomous decision-making. It can accurately identify user intent, automatically plan task workflows, and invoke appropriate models to complete design objectives—thereby providing a unified support platform for data-driven knowledge mining, structure–property relationship prediction, dynamic process modeling, and novel materials design in the steel domain.
KEYWORDS: Data Governance; Large-scale Database; Machine Learning; Materials Foundation Model; Materials Intelligent Agent;
Brief Introduction of Speaker
Yanjing Su

Yanjing Su, Ph.D. supervisor, Professor at the University of Science and Technology Beijing (USTB), and Principal Researcher at Suzhou Laboratory. He concurrently serves as a member of the Expert Committee for the Ministry of Industry and Information Technology’s 13th Five-Year Plan Key Programs — “Key Technologies and Supporting Platforms for Materials Genome Engineering” and “Fundamental Manufacturing Technologies and Critical Components”; an expert for the Ministry of Science and Technology’s 14th Five-Year Plan Key R&D Program “New Rare Earth Materials”; and a member of the Steering Expert Committee for the National Natural Science Foundation of China’s Major Research Program “Interpretable and Generalizable Next-Generation Artificial Intelligence Technologies”. His primary research focuses on materials big data and artificial intelligence. He has led several national key R&D projects under the 13th and 14th Five-Year Plans, as well as sub-projects of major national initiatives, and has participated in the construction of National Materials Big Data Center. He established the first integrated materials database demonstration system that unifies data acquisition, database management, and
machine learning, and has developed both a stainless steel intelligent R&D platform and a pre-trained large foundation model for steel materials.