Data Factory - Transformative materials data generation infrastructure

Hong Wang
MGI Center, School of Materials Science and Engineering and Zhangjiang Institute for
Advanced Study, Shanghai Jiao Tong University, Shanghai, 200240, China 

ABSTRACT: The core of materials genetic engineering lies in data + artificial intelligence (AI). For a long time, data has been generated and collected as the result of experiments or calculations conducted by individual researchers aiming specific goals. As a result, the data exhibit characteristics of multi-source heterogeneity, small scale, discrete distribution, single parameters, and non-standardization, which is not suitable for AI. At present, lacking adequate supply of data is a major bottleneck in material genetic engineering which limits the widespread implementation of the data-driven mode. Therefore, it is necessary to establish a dedicated data generation facility in a transformative way, so that data production is turned into an organized activity in order to provide AI-ready material data with homology, scale, decentralization, and comprehensiveness. The data factories presented in this paper are such a transformative data generation infrastructure, based on high-throughput technology, capable of producing data in a standardized manner like an industrial production line. A computational "data factory" can be a platform with a variety of high-throughput computational software and hardware capable of generating large amounts of comprehensive material data through batch calculations. The experimental "data factory" can be a systematic high-throughput comprehensive preparation and characterization platform based on large-scale scientific facilities such as synchrotron radiation sources, neutron sources, etc., or an experimental facility integrating in-situ preparation and multi­parameter characterization methods.
The "data factory" will bring a series of revolutionary changes to the data generation process. First, comprehensive materials data will be consciously generated in large quantity for broader and long-term goals, rather than being confined to decentralized, purpose­specific experiments or calculations. Second, the "data factory" transforms data generation from an individual activity to an organized social activity. Third, such an organized effort will transform the social nature of data from private property to public resource. As a result, the quality, consistency, and comprehensiveness of data will be improved, and data sharing will become simpler. The result is a wealth of AI-ready material data that enables AI technologies reach their full potential.
Keywords: Data Factory, Data-Driven, AI-ready Data, High-Throughput Experiments, High-Throughput Computing. 

Brief Introduction of Speaker
Dr. Hong Wang

Dr. Hong Wang is a "Zhiyuan" Chair Professor and Director of the Materials Genome Initiative Center, at Shanghai Jiao Tong University. He currently serves as the Chairman of the Materials Genome Engineering Field Committee of the Chinese Standards of Testing and Materials (CSTM). He received his Ph.D. in materials science and engineering from the University of Illinois at Urbana-Champaign in 1994 and worked for multi-national companies such as SONY, Panasonic, and Guardian Industries Corp. in the United States for 16 years, studying thin film materials and their applications in semiconductors, flat panel displays, and energy efficient building glass. Later, he joined the China Building Materials Research Institute as the deputy director of the State Key Laboratory of Green Building Materials. He has been on the faculty of Shanghai Jiao Tong University since 2016. His current research focuses on theory of materials genome engineering, high-throughput material preparation and characterization techniques, and the application of AI in materials research.