• D6_531448
  • 6.6MB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-05-11 05:42
preDict(CE)社区版 search | hub可实现大规模的快速独立于语言的拼写纠正 有关拼写纠正问题的一些基础知识 编辑距离 preDict基于拼写校正模糊搜索库并进行了一些自定义和优化: SymSpell的基本优点是对称删除拼写校正算法,该算法可减少给定编辑距离下编辑候选者生成和字典查找的复杂性。 与语言无关,它快了六个数量级(比标准的删除+转置+替换+插入方法快)。 另外,仅需要删除,而无需转置+替换+插入。 输入短语的转置+替换+插入被转换为字典项的删除。 替换和插入很昂贵,而且取决于语言:例如,中文有70,000个Unicode汉字! preDict定制 我们的主要目标是通过添加以下内容来提高准确性,同时保持不断增长的速度: 我们用加权的Damerau-Levenshtein实现替换了Damerau-Levenshtein实现:每个操作(删除,插入,交换,替换)可以
# [search|hub]( preDict (CE) Community Edition ## search|hub enables blazing fast language independent spell correction at scale Some basics about the spell correction problem * [A closer look into the spell correction problem — Part 1]( * [A closer look into the spell correction problem — Part 2 — introducing preDict]( * [A closer look into the spell correction problem - Part 3 — the bells and whistles]( --- ### Edit Distance preDict is based on the spell correction fuzzy search library [SymSpell]( with a few customizations and optimization: * The fundamental beauty of SymSpell is the Symmetric Delete spelling correction algorithm which reduces the complexity of edit candidate generation and dictionary lookup for a given edit distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent. * Additionally only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input phrase are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters! ### preDict customizations Our main goal was to increase accuracy whilst maintaining the increabible speed by adding: * We replaced the Damerau-Levenshtein implementation with a weighted Damerau-Levenshtein implementation: where each operation (delete, insert, swap, replace) can have different edit weights. * We added some customizing "hooks" that are used to rerank the top-k results (candidate list). The results are then reordered based on a combined proximity : * added a phonetic proximity algorithm (Eudex) * added a prefix proximity algorithm (Jaro-Winkler) * added a fragment proximity algorithm (dice coefficient) * added keyboard-distance to get a dynamic replacement weight (since letters close to each other are more likely to be replaced) * do some query normalization before search ## Benchmark Results Run on Windows10 with a Intel(R) Core(TM) i7-6700 CPU (2.60GHz) with Java(TM) 1.8.0_121 ``` Benchmark Mode Cnt Score Error Units SearchHub with PreDict EE * thrpt 200 96,019 ± 1,007 ops/ms PreDict CE thrpt 200 82,116 ± 1,149 ops/ms Original SymSpell (Port) thrpt 200 68,105 ± 0,977 ops/ms Lucene (FST Based FuzzySuggester) thrpt 5 17,588 ± 0,690 ops/ms Lucene (Fuzzy Field-Search) thrpt 200 0,749 ± 0,017 ops/ms ``` ## Quality Results Based on data we collected for a few months. The test data is attached to the comparison project and can be changed. Changes to the data will, of course, change the results, but the differences shouldn't be that dramatical. ``` Benchmark Accuracy TP TN Fail-Rate FN FP SearchHub with PreDict EE * 98,87% 7452 355 1,13% 52 37 PreDict CE 90,04% 6937 320 9,96% 496 307 Original SymSpell (Port) 88,87% 6842 321 11,13% 591 306 Lucene (Fuzzy Field-Search) 88,87% 6803 360 11,13% 630 267 Lucene (FST based FuzzySuggester) 78,96% 5883 481 21,04% 1550 146 ``` *SearchHub with PreDict EE represents our commercial offering This offering is a search platform independent, AI-powered search query intelligence API containing our concept of controlled precision reduction and PreDict EE (Enterprise Edition) which is capable of handling language agnostic term decomposition and disambiguation.
    • 算法
      算法 算法
    • 程序员算法
    • 算法实验
    • SIFT 算法
      SIFT 算法SIFT 算法SIFT 算法SIFT 算法
    • RSA算法
    • 分词算法介分词算法
      算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语分词介绍分词算法 汉语...
    • unify算法
    • 寻路算法
      寻路算法 寻路封装
    • dsp算法算法算法算法
    • 大数据算法