Chinese Journal of Organic Chemistry ›› 2025, Vol. 45 ›› Issue (6): 2189-2198.DOI: 10.6023/cjoc202409001 Previous Articles     Next Articles

ARTICLES

基于语言表达模式和自然语言处理的有机化学文献数据自动识别提取方法

陈维明*(), 戴静芳, 李英勇, 周俊红, 高犇赵英莉, 徐挺军, 薛小松*()   

  1. 中国科学院上海有机化学研究所 先进氟氮材料全国重点实验室 上海 200032
  • 收稿日期:2024-09-01 修回日期:2024-11-10 发布日期:2024-12-18
  • 通讯作者: 陈维明, 薛小松
  • 基金资助:
    中国科学院2020年度科学传播项目; 国家重点研发计划(2021YFF0701700); 国家自然科学基金(22122104); 国家自然科学基金(22193012); 国家自然科学基金(21933004); 中国科学院先导专项(XDB0590000); 稳定支持基础研究领域青年团队计划(YSBR-052); 稳定支持基础研究领域青年团队计划(YSBR-095)

An Automatic Identification and Extraction Method for Organic Chemistry Literature Data Based on Language Expression Pattern and Natural Language Processing

Weiming Chen*(), Jingfang Dai, Yingyong Li, Junhong Zhou, Ben,Zhao,Yingli Gao, Tingjun Xu, Xiaosong Xue*()   

  1. State Key Laboratory of Fluorine and Nitrogen Chemistry and Advanced Materials, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai 200032
  • Received:2024-09-01 Revised:2024-11-10 Published:2024-12-18
  • Contact: Weiming Chen, Xiaosong Xue
  • Supported by:
    Science Communication Project of Chinese Academy of Sciences in 2020; National Key R&D Program of China(2021YFF0701700); National Natural Science Foundation of China(22122104); National Natural Science Foundation of China(22193012); National Natural Science Foundation of China(21933004); Chinese Academy of Sciences Pilot Project(XDB0590000); Youth Team Programme of the Chinese Academy of Sciences for Stable Support of Basic Research Areas(YSBR-052); Youth Team Programme of the Chinese Academy of Sciences for Stable Support of Basic Research Areas(YSBR-095)

Journal literature is an important source of scientific data. The manual indexing method was used to identify and extract scientific data for long time. With the development of information technology and artificial intelligence methods, it is gradually becoming possible to automatically identify and extract scientific data from journal literature. In this paper, the method of automatic identification and extraction of chemical data from journal articles was studied by language expression patterns and rule-based natural language processing (NLP) technology, and the automatic identification and extraction of chemical data from 3275 experimental research articles of Chinese Journal of Organic Chemistry in 10 years from 2013 to 2022 was completed, and more than 30 kinds of chemical data including product characteristics, synthetic reaction parameters, physical property data, and spectral data were extracted. After data extraction, the corresponding databases have been built, and the knowledge service of the Chinese Journal of Organic Chemistry has been provided. A performance test of all 422 articles in the Journal of Organic Chemistry in 2022 showed that the accuracy of optical rotation data identification and extraction was 100%, melting point data was 99.85%, fluorine nuclear magnetic spectroscopy was 99.55%, carbon nuclear magnetic spectroscopy was 99.80%, material form data was 99.47%, and product name was 98.76% (a total of 4665 product names were extracted, of which 58 were problematic product names). The current method to identify product name uses irrelevant content exclusion method based on local scenes, and the accuracy of product name recognition is expected to be improved if an identification method of system and semi-system nomenclature is used. Logically, the automatic identification and extraction method based on language expression patterns and natural language processing technology is not limited by disciplines and is suitable for all scientific data.

Key words: chemical data, identification and extraction, language expression pattern, natural language processing