The 3rd Forum of Materials Genome Engineering

6-15. The Construction of the High-Throughput Materials Simulation Environment

Xiangfei Meng*, Geng Li, Xiaoqian Zhu, Xiaodong Jian

National Supercomputer Center in Tianjin, Tianjin 300457, China

Abstract: A high-throughput computing (HTC) infrastructure is the key to accelerate materials discovery in the materials genome engineering.Several efforts has been devoted to build the HTC environment by computer and material researchers. For instance, in America, the Materials Project is implemented on the National Energy Research Supercomputing Center (NERSC), and a joint collaboration between the Cray Supercomputer Company and Duke University develop the Automatic-FLOW for Materials Discovery (Aflow). In Europe, the largest repository for computational materials science worldwide, the Novel Materials Discovery (NOMAD) Repository, is built based on the Barcelona Supercomputing Center. National Supercomputer Center in Tianjin, the most famous supercomputer in China, constructed a HTC infrastructure for materials with the features of automatic workflow, high concurrency and multi-scale calculation, named as the computational platform of China materials Genome Engineering (CNMGE).

In order to reach the materials HTC requirements, we designed the integrated environment of supercomputing, cloud computing and big-data management, and developed a material HTC system and data management system. The core of materials HTC system is to build several automatic calculation workflows based on material functional properties. According to the material scale, we divided the material calculations into four parts: microscopic calculation, mesoscopic calculation, macroscopic calculation and cross-scale calculation. Each part can also be divided into several computing function workflows. Each computing function workflow can be realized by several steps: generating input files, configuring input files, using HTC to generate raw data, using analysis programs to generate the organized data, filtering data, storing the material results into database automatically, and querying the data through the API interface for users. To now, several basic materials calculation workflows are implemented in our platform.

Based on the database engines (MySQL and MongoDB), the materials data management system is developed. The core functions include computation controlling, results analysis, data dissemination and data validation. In addition, it can interact with the web service to upload and download files, display graphical results, and query data. The system contains three databases: software information library, atom potential library and material property database.

Currently, we have finished the whole framework and prototype platform with several practical functions. It is open for users to test and operate. We expect to make a progress in developing workflows for more material properties, the multi-scale calculation, and material data sharing service by the free and incentive mechanism.

Keywords: high-throughput calculation; Tianhe Serial Supercomputers; automatic workflow; materials data management system

高通量材料计算平台构建

孟祥飞*，李庚，朱小谦，菅晓东

国家超级计算天津中心

摘要：高通量材料计算平台是高效材料研发与高性能计算的有机结合，是材料基因工程中的重要基础环节，是国际新材料研究领域的共性前沿创新载体，已经有国际团队开展高通量材料计算平台的研发和探索。这些平台突出的特点是依托超级计算系统结合当前高效材料研发的需求进行研发构建，例如NERSC的Materials Project，Cray构建的Automatic-Flow，以及依托Barcelona Supercomputing Centre构建的NOMAD等。我们依托国家超级计算中心和我国自主研发的千万亿次（P级）和正在研发中的百亿亿次（E级）超级计算平台，构建可实现高通量材料计算需要的自动流程、高并发、多尺度等突出特点和能力的“中国材料高通量计算平台（CNMGE）”。

为了系统实现高通量材料计算平台需求，我们设计实现了超级计算、云计算和大数据相融合的环境，以此为基础开发了高通量自动流程的材料计算系统和数据应用管理系统。高通量计算平台建设的核心是构建基于材料功能性质的高通量自动流程计算工作流。我们按照计算尺度将材料计算功能其分为四部分：微观计算、介观结算、宏观计算和跨尺度计算。每一部分又分为实现若干计算功能的工作流。每一个功能计算工作流包括：利用外部数据库或者结构产生软件输入结构文件，配置输入文件，利用HPC计算进行数据产生，原始数据进行数据存储，利用数据分析程序产生有组织的数据，结合材料知识实现数据筛选，通过API接口实现材料数据自动建立数据库，供用户查询和调用。目前实现了常用材料计算软件的基本功能的高通量计算。

高通量计算平台除了完成材料大规模高效并发计算，同时还要形成高效的计算关联数据、计算结果数据获取、管理等，通过整合非关系型数据库和关系型数据库，研发构建了材料数据管理与应用系统，通过该数据系统实现控制并行计算、数据结果分析、数据传输和数据验证。前端用户与数据管理系统可以交互访问、文件下载、结果展示和数据查询。目前数据管理系统的数据库主要包括：（1）材料性质数据库；（2）势函数数据库；（3）软件数据库。

截至目前，高通量计算平台的主体框架、主要功能已经基本实现，具备展示和部分应用用户开放使用的能力，对于下一步重点在于：（1）集成更多的材料计算的工作流；（2）集成丰富的跨尺度的自动流程模拟计算；（3）开发成果管理系统，设立开放基金，鼓励材料模拟计算、材料数据库等团队，开展新材料研发创新、数据上传共享。

关键词：高通量计算；天河系列超级计算机；自动流程；材料数据管理系统

Brief Introduction of Speaker

孟祥飞

理学博士，国家超级计算天津中心研究员级高工、应用研发部部长，国家发改委“大数据处理技术与应用”国地联合实验室主任工程师，中国计算机学会CCF高性能计算专委会常委、中国人工智能学会智慧医疗分会副主委、中国抗癌协会肿瘤人工智能专委会副主委。负责“天河”系列超级计算机应用技术研发与合作，主要研究方向涉及高性能计算技术，超级计算与大数据、人工智能融合平台构建与应用等。以第一和主要完成成人获得多项省部级一二等科技进步奖，获国务院特贴专家、中国五四青年奖章、天津市131人才计划创新团队领军人才。先后主持国家“十三五”重点研发计划项目、国家自然科学基金、国家高技术服务业项目等。

电话：022-65375551；Email: mengxf@nscc-tj.cn