哈尔滨工业大学社会计算与信息检索研究中心 – 理解语言，认知社会

哈工大SCIR一篇长文被KDD 2019录用

2019年05月11日

ACM SIGKDD（国际数据挖掘与知识发现大会，简称 KDD）是数据挖掘领域的顶级国际会议，将于2019年8月4日至8月8日在美国安克雷奇（阿拉斯加）举行。自 1995 年以来，该会议已经举办了二十多年，其对论文接收非常严格，每年的接收率不超过 20%。KDD 2019 包括两个 track：Research track 和 Applied Data Science track。Research track 共收到约 1200 篇论文投稿，其中约 110 篇被接收为 oral 论文，60 篇被接收为 poster 论文，接收率仅为 14%，相比去年下降了将近 4 个百分点。

今年KDD 大会对论文提交的要求也变得更加严格，首次采取双盲评审制度，所有提交论文必须严格按照论文提交要求撰写，论文中不得出现作者姓名和机构信息。

哈尔滨工业大学社会计算与信息检索研究中心共有1篇长文被KDD 2019录用，下面是论文简要信息及摘要：

•The Role of “Condition”: A Novel Scientific Knowledge Graph Representation and Construction Model

作者：姜天文，赵通，秦兵，刘挺，Nitesh V. Chawla，蒋朦

单位：哈尔滨工业大学，圣母大学

摘要：条件在科学研究的观测、假设和陈述中起着至关重要的作用。然而，目前基于科学文献的知识图谱（SciKG）都将事实知识表示为以概念为基础的平坦关系网络，与通用领域知识图谱一样，没有考虑事实有效的条件，这就失去了推理和探索的重要依据。在这项工作中，我们提出了一个新的SciKG表示，拥有三层知识表示结构。第一层由概念节点、属性节点、以及从属性到概念的附加链接组成。第二层代表事实元组和条件元组。每个元组都是关系名称的节点，链接到对应的主语和宾语，即第一层中的概念或属性节点。第三层具有可追溯到原始论文和作者的语句句子节点。每个语句节点都链接到第二层中的事实元组或条件元组。受最近将开放信息提取（OpenIE）视为序列标记任务的工作的启发，我们设计了一种半监督多输入多输出（MIMO）序列标记模型，该模型学习来自多个信号的序列标签之间的复杂依赖性并生成输出序列，从中抽取事实和条件元组。它具有多种策略的自训练模块，可在标注数据受限时利用海量科学数据获得更好的性能。在拥有1.4亿句子的数据集上的实验表明我们的模型优于现有方法，我们构建的SciKG可以很好地理解科学语句。

Abstract: Conditions play an essential role in scientific observations, hypotheses, and statements. Unfortunately, existing scientific knowledge graphs (SciKGs) represent factual knowledge as a flat relational network of concepts, as same as the KGs in general domain, without considering the conditions of the facts being valid, which loses important contexts for inference and exploration. In this work, we propose a novel representation of SciKG, which has three layers. The first layer has concept nodes, attribute nodes, as well as the attaching links from attribute to concept. The second layer represents both fact tuples and condition tuples. Each tuple is a node of the relation name, connecting to the subject and object that are concept or attribute nodes in the first layer. The third layer has nodes of statement sentences traceable to the original paper and authors. Each statement node connects to a set of fact tuples and/or condition tuples in the second layer. Inspired by a recent work that considers open information extraction as a sequence labeling task, we design a semi-supervised Multi-Input Multi-Output (MIMO) sequence labeling model that learns complex dependencies between the sequence tags from multiple signals and generates output sequences for fact and condition tuples. It has a new self-training module of multiple strategies to leverage the massive scientific data for better performance when manual annotation is limited. Experiments on a data set of 141M sentences show that our model outperforms existing methods and the SciKGs we constructed provide a good understanding of the scientific statements.