[关键词]
[摘要]
针对核心专利识别准确率低的问题,重构指标体系;针对传统核心专利识别方法处理不平衡数据效果欠佳,提出重采样技术与集成算法的组合模型。首先,在传统指标构建基础上加入专利发明人相关指标;其次,使用合成少数类过采样算法(SMOTE)增加少数类样本解决数据不平衡问题,采用局部离群因子(LOF)算法对新生成样本进行降噪处理,并与自适应集成算法(Adaboost)组合成SMOTE-LOF-Adaboost模型。最后,以智慧芽专利数据库中2012-2016年共22077条光伏领域专利数据为例,使用SVM、Adaboost、SMOTE-Adaboost、SMOTE-LOF-Adaboost进行实证分析,结果显示SMOTE-LOF-Adaboost模型AUC均值0.977 6,Recall均值0.986 0,F1均值0.960 7均优于其他三种模型,且各指标的标准差更小,表明SMOTE-LOF-Adaboost模型不仅提高核心专利预测的准确性,并且有更高的模型稳定性。
[Key word]
[Abstract]
To address the issue of low accuracy in identifying core patents, the indicator system was reconstructed. To address the problem of the traditional core patent identification method"s poor performance in handling imbalanced data, a combined model of resampling techniques and ensemble algorithms was proposed. First, patent inventors" relevant indicators were added to the traditional indicator construction foundation. Second, the Synthetic Minority Over-sampling Technique (SMOTE) algorithm was used to increase the number of minority samples to solve the data imbalance problem. Then, the Local Outlier Factor (LOF) algorithm was used to denoise the newly generated samples, and combined with the Adaptive Boosting (Adaboost) algorithm to form the SMOTE-LOF-Adaboost model. Finally, taking the 22,077 photovoltaic field patent data from the Patsnap patent database from 2012 to 2016 as an example, SVM, Adaboost, SMOTE-Adaboost, and SMOTE-LOF-Adaboost were used for empirical analysis. The results showed that the SMOTE-LOF-Adaboost model had a mean AUC of 0.9776, a mean Recall of 0.9860, and a mean F1 score of 0.9607, which were superior to the other three models, and the standard deviation of each indicator was smaller. This indicates that the SMOTE-LOF-Adaboost model not only improves the accuracyof core patent prediction but also has higher model stability.
[中图分类号]
G306?????
[基金项目]
浙江省重点研发计划项目“基于互联网的新材料发光产业链检测关键技术研究与开发”(2021C01027);浙江省自然科学“基于知识开放的众创式创新社区集体智慧涌现的机制研究”(LY20G01008)