preDict:基于Commerce-Experts的SymSpell的闪电快速拼写校正模糊搜索库

  • D6_531448
    了解作者
  • 6.6MB
    文件大小
  • zip
    文件格式
  • 0
    收藏次数
  • VIP专享
    资源类型
  • 0
    下载次数
  • 2022-05-11 05:42
    上传日期
preDict(CE)社区版 search | hub可实现大规模的快速独立于语言的拼写纠正 有关拼写纠正问题的一些基础知识 编辑距离 preDict基于拼写校正模糊搜索库并进行了一些自定义和优化: SymSpell的基本优点是对称删除拼写校正算法,该算法可减少给定编辑距离下编辑候选者生成和字典查找的复杂性。 与语言无关,它快了六个数量级(比标准的删除+转置+替换+插入方法快)。 另外,仅需要删除,而无需转置+替换+插入。 输入短语的转置+替换+插入被转换为字典项的删除。 替换和插入很昂贵,而且取决于语言:例如,中文有70,000个Unicode汉字! preDict定制 我们的主要目标是通过添加以下内容来提高准确性,同时保持不断增长的速度: 我们用加权的Damerau-Levenshtein实现替换了Damerau-Levenshtein实现:每个操作(删除,插入,交换,替换)可以
preDict-master.zip
内容介绍
# [search|hub](https://www.searchhub.io) preDict (CE) Community Edition ## search|hub enables blazing fast language independent spell correction at scale Some basics about the spell correction problem * [A closer look into the spell correction problem — Part 1](https://medium.com/@searchhub.io/a-closer-look-into-the-spell-correction-problem-part-1-a6795bbf7112) * [A closer look into the spell correction problem — Part 2 — introducing preDict](https://medium.com/@searchhub.io/a-closer-look-into-the-spell-correction-problem-part-2-introducing-predict-8993ecab7226) * [A closer look into the spell correction problem - Part 3 — the bells and whistles](https://medium.com/@searchhub.io/a-closer-look-into-the-spell-correction-problem-part-3-the-bells-and-whistles-19697a34011b) --- ### Edit Distance preDict is based on the spell correction fuzzy search library [SymSpell](https://github.com/wolfgarbe/symspell) with a few customizations and optimization: * The fundamental beauty of SymSpell is the Symmetric Delete spelling correction algorithm which reduces the complexity of edit candidate generation and dictionary lookup for a given edit distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent. * Additionally only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input phrase are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters! ### preDict customizations Our main goal was to increase accuracy whilst maintaining the increabible speed by adding: * We replaced the Damerau-Levenshtein implementation with a weighted Damerau-Levenshtein implementation: where each operation (delete, insert, swap, replace) can have different edit weights. * We added some customizing "hooks" that are used to rerank the top-k results (candidate list). The results are then reordered based on a combined proximity : * added a phonetic proximity algorithm (Eudex) * added a prefix proximity algorithm (Jaro-Winkler) * added a fragment proximity algorithm (dice coefficient) * added keyboard-distance to get a dynamic replacement weight (since letters close to each other are more likely to be replaced) * do some query normalization before search ## Benchmark Results Run on Windows10 with a Intel(R) Core(TM) i7-6700 CPU (2.60GHz) with Java(TM) 1.8.0_121 ``` Benchmark Mode Cnt Score Error Units SearchHub with PreDict EE * thrpt 200 96,019 ± 1,007 ops/ms PreDict CE thrpt 200 82,116 ± 1,149 ops/ms Original SymSpell (Port) thrpt 200 68,105 ± 0,977 ops/ms Lucene (FST Based FuzzySuggester) thrpt 5 17,588 ± 0,690 ops/ms Lucene (Fuzzy Field-Search) thrpt 200 0,749 ± 0,017 ops/ms ``` ## Quality Results Based on data we collected for a few months. The test data is attached to the comparison project and can be changed. Changes to the data will, of course, change the results, but the differences shouldn't be that dramatical. ``` Benchmark Accuracy TP TN Fail-Rate FN FP SearchHub with PreDict EE * 98,87% 7452 355 1,13% 52 37 PreDict CE 90,04% 6937 320 9,96% 496 307 Original SymSpell (Port) 88,87% 6842 321 11,13% 591 306 Lucene (Fuzzy Field-Search) 88,87% 6803 360 11,13% 630 267 Lucene (FST based FuzzySuggester) 78,96% 5883 481 21,04% 1550 146 ``` *SearchHub with PreDict EE represents our commercial offering https://www.searchhub.io This offering is a search platform independent, AI-powered search query intelligence API containing our concept of controlled precision reduction and PreDict EE (Enterprise Edition) which is capable of handling language agnostic term decomposition and disambiguation.
评论
    相关推荐
    • 算法
      算法 算法
    • 程序员算法
      这是一个算法文档压缩包,其中包括《可能与不可能的边界》、《具体数学》、《算法的乐趣》、《啊哈!算法》。这些书很适合对算法感兴趣的朋友,书籍讲解算法非常有趣。注意,其中有些文档是试读版本。
    • 算法实验
      算法实验算法实验算法实验算法实验算法实验算法实验算法实验算法实验
    • SIFT 算法
      SIFT 算法SIFT 算法SIFT 算法SIFT 算法
    • RSA算法
      RSA算法是公钥加密算法中重要的算法之一,本算法即实现RSA的加解密过程。
    • 分词算法介分词算法
      算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语...
    • unify算法
      unify算法unify算法unify算法unify算法unify算法unify算法unify算法unify算法unify算法unify算法unify算法unify算法unify算法
    • 寻路算法
      寻路算法 寻路封装
    • dsp算法算法算法算法
      dsp各种算法
    • 大数据算法
      本书共分为10章,第1章概述大数据算法,第2章介绍时间亚线性算法,第3章介绍空间亚线性算法,第4章概述外存算法,第5章介绍大数据外存查找结构,第6章讲授外存图数据算法,第7章概述MapReduce算法,第8章通过一系列...