文本相似度计算.doc

下载文档 降价啦

93
0
约1.67万字
约 29页
2018-02-03 发布于江西
举报
版权申诉
保障服务

文本相似度计算.doc

1、本文档共29页，可阅读全部内容。
2、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

文本相似度计算系统摘要在中文信息处理中，文本相似度的计算广泛应用于信息检索、机器翻译、自动问答系统、文本挖掘等领域，是一个非常基础而关键的问题，长期以来一直是人们研究的热点和难点。本次毕设的设计目标就是用两种方法来实现文本相似度的计算。本文采用传统的设计方法，第一种是余弦算法。余弦算法是一种易于理解且结果易于观察的算法。通过余弦算法可以快捷的计算出文本间相似度，并通过余弦算法的结果（0、1之间）判断出相似度的大小。由于余弦计算是在空间向量模型的基础上，所以说要想用余弦算法来完成本次系统，那么必须要将文本转化成空间向量模型。而完成空间向量模型的转换则要用到加权。在空间向量模型实现之前，必须要进行文本的去停用词处理和特征选择的处理。第二种算法是BM25算法，本文将采用最基础的循环来完成，目的是观察余弦算法中使用倒排索引效率是否提高有多大提高。本次文本相似度计算系统的主要工作是去除停用词、文本特征选择、加权，在加权之后用余弦算法计算文本的相似度。在文本特征选择之后用BM25计算相似度。由于为了使系统的效率提高，在程序设计中应用了大量的容器知识以及内积、倒排算法。关键词：文本相似度；余弦；BM25；容器 Text Similarity Algorithm Research Abstract In Chinese information processing，text similarity computation is widely used in the area of information retrieval，machine translation,automatic question—answering，text mining and etc．It is a very essential and important issue that people study as a hotspot and difficulty for a long time．Currently，most text similarity algorithms are based on vector space model(VSM)．However,these methods will cause problems of high dimension and sparseness．Moreover，these methods do not effectively solve natural language problems existed in text data．These natural language problems are synonym and polyseme．These problems sidturb the efficiency and accuracy of text similarity algorithms and make the performance of text similarity computation decline． This paper uses a new thought which gets semantic simirality computation into traditional text similarity computation to prove the performance of text similarity algorithms．This paper deeply discusses the existing text similarity algorithms and samentic text computation and gives a Chinese text similarity algorithm which is based on semantic similarity．There is an online information management system which is used to manage students’graduate design papers．Those papers ale used to calculate similarity by that the algorithm to validate that algorithm． This text similarity computing system's main job is to stop word removal, text feature selection, weighting, after weighting using cosine algorithm