数字人文研究 ›› 2022, Vol. 2 ›› Issue (4): 74-92.

• 攻玉以石 • 上一篇    下一篇

基于数据驱动方法的历史报纸词汇变化研究

西蒙·恒晨Simon Hengchen,通讯作者),瑞典哥德堡大学瑞典语系,Emailsimon. hengchen@ gu. se鲁本·罗斯(Ruben Ros),卢森堡大学当代与数字历史研究中心博士候选人亚尼·马尔亚宁(Jani Marjanen),芬兰赫尔辛基大学数字人文系米科·托洛宁(Mikko Tolonen), 芬兰赫尔辛基大学数字人文系方华康译者),上海师范大学人文学院硕士研究生,Emailfhk1819436860@ 163. com
  

  • 出版日期:2022-11-08 发布日期:2023-03-06
  • 基金资助:
    本研究得到欧盟地平线 2020研究与创新项目 770299(NewsEye)的支持,计算资源由 CSC-IT 科学中心有限公司提供。 S. H. 受到由瑞典研究委员会支持的计算词汇语义变化检测项目(2019-2022;dnr 2018-01184)资助。 

A Data-driven Approach to Studying Changing Vocabularies in Historical Newspaper Collections

  • Online:2022-11-08 Published:2023-03-06

摘要:

“民族”(nation)和“民族性”(nationhood)属于思想史领域最常研究的概念,而“民族”一词及其历史用法又十分模糊。文章旨在开发一种利用依存分析和神经词嵌入的数据驱动方法,以澄 清这一概念的演变过程。为此提出以下两个步骤。首先,使用语言处理,创建一个与“民族”主题相 关的大型单词集合。其次,训练历时词嵌入,并使用它们来量化这些词之间语义相似性的强度,从而创建有意义的聚类,然后将之历时排列。为了说明该方法在跨语言、多时间段及大型数据集研究上的稳健性,将其应用于荷兰语、瑞典语、芬兰语和英语共五份全套历史报纸档案集合。迄今为止,还 没有如此大规模的比较研究———以数据驱动方法掌握多达四种不同语言的长期发展。文章所描述 的方法还有一个特殊优势:通过设计,该方法可扩展应用至其他问题,而不仅限于对“民族性”的研究,并且可在不同语境中重复使用。

关键词:

数字人文, 数据驱动, 历史报纸, 词汇变化

Abstract: Nation and nationhood are among themost frequently studied concepts in the field ofintellectual history. At the same time,theword ‘nation and its historical usage are veryvague. The aim in this article was to develop a data-drivenmethod using dependencyparsing and neuralword embeddings to clarify some of the vagueness in the evo- lutionthis concept. To this end,we propose the following two-step method. First,usinglinguistic processing,we create a large set of words pertaining to the topic of nation. Second,we traindiachronicwordembeddings anduse themto quantify the strength ofthe semantic similarity between these words and thereby create meaningful clusters,which are then a- ligned diachronically. To illustrate the robustness of the study acrosslanguages,time spans,as well as large datasets,we apply it to the entirety of fivehistorical newspaper archives in Dutch,Swedish,Finnish,and English. To our knowledge, thus far there have been no large-scale comparative studies of this kind thatpurport to grasp long-term developments in as many as four different languages in adata-driven way. A particular strength of themethod we describe in this article is that,by design,it is not limited to the study of nationhood,but rather expands beyond it toother research questions and is reusable in different contexts.

Key words:

digital humanities, data-driven, historical newspapers, vocabulary change

中图分类号: