The 3rd Forum of Materials Genome Engineering

4-24. Data augmentation in microscopic images for material data mining

Haiyou Huang*, Boyuan Ma, Xiaoyan Wei, Chuni Liu, Xiaojuan Ban*, Hao Wang, Weihua Xue, Yanjing Su

University of Science and Technoloyg Beijing, Beiing, 100083, P.R.China

Abstract: There has been considerable interest over the last few years in accelerating the process of materials design and discovery. In the past decade, accelerating discovery relied on databases, computation, mathematics and information science has created more and more successful cases in the materials sciences. In generally, the larger the database is, the more accurate the machine learning model becomes. However, in many materials researches, especially new materials, we still have to face the dilemma of lacking high-quality data due to a long-term or difficult experimental data (real data) collection and a low accuracy of computational data.

Material microstructure data is an important type of material data to build the intrinsic relationship of composition, structure, process and properties, which is the fundamental to material design. Therefore, the quantitative analysis of microstructures is essential in the control of the properties and performances of metals or alloys. One of the most important steps in this process is microscopic image processing using computational algorithms and tools. For example, image segmentation, which outputs the pixel-wise label of original image, is commonly used to extract significant information in microscopic images at the field of material structure characterization. At present，it commonly use machine learning model to segment microscopic images. However, creating large datasets with pixel-wise semantic labels is known to be very challenging due to the amount of human effort and expertise required to label microscopic images. Therefore, such big available datasets with pixel-wise label for image segmentation are always insufficient.

The material simulated model is another path to acquire microstructure. However, those acquired contents in simulated image data are too perfect to look like realistic due to some theoretical approximations and simplifications in the modeling process, which causes a challenge to simply apply simulated image data to real microscopic image processing system.

In order to reduce the highly time consumption in image data acquisition and labeling processes, we present a data augmentation method, which use image style transfer technique to fuse pixel level label of simulated image data and image style of real image data. The flowchart is shown in Figure 1. Experiment results have shown that the model trained with synthetic image data and 35% of the real image data outperform the model trained on all real image data, which reduces pressure of getting and labeling images from experiments. Besides, we believe that this strategy can easily apply to other, even outside materials data mining tasks.

Keywords: Data Mining; Data Augmentation; Image Style Transfer; Microscopic Images; Monte Carlo Potts Models

Figure 1. Flowchart of the data augmentation strategy for material images

图1基于风格迁移技术的材料图像数据增强策略示意图

基于风格迁移技术的材料图像数据增强

黄海友*，马博渊，魏晓燕，刘楚妮，班晓娟*，王浩，薛维华，宿彦京

北京科技大学，北京，100083，中国

摘要：在过去的几年里，人们对加速材料设计和发现[1]的过程产生了极大的兴趣。基于大数据集训练机器学习模型，快速预测材料性能、优化材料成分和工艺参数成为加速材料研发的重要途径，并创造了越来越多的成功案例。一般来说，数据库越大，机器学习模型就越精确。然而，在许多材料研究中，尤其是新材料的研究中，由于实验数据(真实数据)采集时间长或难度大，计算数据精度低，仍然面临着缺乏高质量数据的困境。

材料微观组织数据是一类重要的材料数据，是构建材料成分-工艺-组织结构-性能内禀关系的数据基础之一。材料微观组织结构的定量分析是获取高质量材料微观组织数据的关键技术。其中最重要的步骤之一是显微图像处理，它用于提取微观结构中的重要信息。其中，图像分割技术可输出像素级图像标签，是一种高准确度的材料微观组织图像定量分析方法。目前，通常采用机器学习模型进行图像分割，然而，为实验采集的材料显微组织图像数据标注像素级标签需要大量的人力和时间，因此通常无法建立大规模数据集训练上述机器学习模型。

材料显微组织模拟是获得材料显微组织结构的另一条途径。基于计算机模拟技术建立的材料显微组织结构三维模型，可以方便地获取具有像素级标签的模拟图像数据。但是，由于计算能力和建模技术的制约，这些模拟图像数据中获取的内容往往过于简单，与复杂的真实图像数据具有较大差距。

为了解决真实图像采集难度大、标注像素级标签成本高的问题，本文提出了一种数据增强方法，利用图像风格传输技术，融合模拟数据中的标签信息和实验数据中的图像风格信息，快速生成具有像素级标签的合成图像，技术路线如图1所示。研究结果表明，利用所获得的合成图像对真实图像进行补充，可以显著提高图像处理的性能。例如，使用合成图像+35%真实图像训练的模型，其识别准确度优于使用全部真实图像训练的模型。该方法可显著降低真实图像的实验获取和标注压力，显示了数据增强在材料数据挖掘任务中的潜力。

关键词：数据挖掘；数据增强；风格迁移；材料图像；Monte Carlo Potts模拟

Brief Introduction of Speaker

黄海友

工学博士，副研究员，硕士生导师。2001年3月毕业于北京科技大学材料工程与科学学院材料物理专业，获学士学位；2007年3月毕业于北京科技大学材料科学与工程学院材料物理与化学专业，获博士学位；2007年3月-2009年4月于香港科技大学机械工程系从事博士后研究。2009年4月至今，任教于北京科技大学新材料技术研究院。在Applied Physics Letters、APL Materials、Scripta Materials等期刊上发表学术论文60篇。授权发明专利5项：参编著作3部；2017年获教育部自然科学奖二等奖

主要研究方向包括：材料基因工程数据库与大数据技术；基于数据驱动的新材料研发；形状记忆合金等。

Email: huanghy@mater.ustb.edu.cn