erl_fuzzy_match

所属分类:其他
开发工具:Erlang
文件大小:0KB
下载次数:0
上传日期:2015-02-11 16:52:23
上 传 者sh-1993
说明:  Erlang模糊字符串匹配器
(Erlang Fuzzy String Matcher)

文件列表:
ebin/
src/
test/
Emakefile
LICENSE
Makefile
rebar.config

# Erlang Fuzzy String Matcher The Fuzzy String Matcher for Erlang pulls together a handful of algorithms to make fuzzy string matching available as a library to your Erlang programs. This little piece of code was inspired by the Python code presented in http://www.sportshacker.net/posts/fuzzy_string_matching.html ## Overview The fuzzy matcher runs as a node-local, named `gen_server` that keeps a dictionary of translations. If a string is not found in the lookup dictionary, it is sent through a list of string matching algorithms (abbreviations, levenshtein, tokenized). A valid match is added as a new entry in the dictionary. If no match was found, the string is taken as a new entry in the dictionary, matching itself. An example, using English Premier league football team names ('14/'15 season): ```Erlang 1> Teams = [ <<"Arsenal">>, <<"Aston Villa">>, <<"Burnley">>, <<"Chelsea">>, <<"Crystal Palace">>, <<"Everton">>, <<"Hull City">>, <<"Leicester City">>, <<"Liverpool">>, <<"Manchester City">>, <<"Manchester United">>, <<"Newcastle United">>, <<"Queens Park Rangers">>, <<"Southampton">>, <<"Stoke City">>, <<"Sunderland">>, <<"Swansea City">>, <<"Tottenham Hotspur">>, <<"West Bromwich Albion">>, <<"West Ham United">> ]. [<<"Arsenal">>,<<"Aston Villa">>,<<"Burnley">>, <<"Chelsea">>,<<"Crystal Palace">>,<<"Everton">>, <<"Hull City">>,<<"Leicester City">>,<<"Liverpool">>, <<"Manchester City">>,<<"Manchester United">>, <<"Newcastle United">>,<<"Queens Park Rangers">>, <<"Southampton">>,<<"Stoke City">>,<<"Sunderland">>, <<"Swansea City">>,<<"Tottenham Hotspur">>, <<"West Bromwich Albion">>,<<"West Ham United">>] 2> OddNames = [ {<<"Saints">>, <<"Southampton">>}, {<<"Spurs">>, <<"Tottenham Hotspur">>} ]. [{<<"Saints">>,<<"Southampton">>}, {<<"Spurs">>,<<"Tottenham Hotspur">>}] 3> erl_fuzzy_match:start_link(premier, dict:from_list(OddNames), Teams). {ok,<0.36.0>} 4> erl_fuzzy_match:translate(premier, <<"Chelsea">>). <<"Chelsea">> 5> erl_fuzzy_match:translate(premier, <<"Manchester Utd">>). <<"Manchester United">> 6> erl_fuzzy_match:translate(premier, <<"West Ham">>). <<"West Ham United">> 7> erl_fuzzy_match:translate(premier, <<"Newcastle">>). <<"Newcastle United">> 8> erl_fuzzy_match:translate(premier, <<"QPR">>). <<"Queens Park Rangers">> ``` But some caution may still be required (!): ```Erlang 9> erl_fuzzy_match:translate(premier, <<"W.B.A">>). <<"W.B.A">> 10> erl_fuzzy_match:translate(premier, <<"Bristol City">>). <<"Hull City">> ``` ## Starting And Stopping A fuzzy matcher is started by one of the `start_link` functions: ```Erlang start_link(Name) start_link(Name, Dict, Canon) ``` where `Name` is an atom, `Dict` a mapping from UTF-8 binary strings to UTF-8 binaries strings ( [`dict(binary(),binary())`](http://www.erlang.org/doc/man/dict.html) ), and `Canon` a list of UTF-8 binary strings. Multiple fuzzy matchers may run at the same time. The name supplied at startup allows selecting which fuzzy matcher to use when using the other API calls. A list of known canonical strings may be provided when starting the fuzzy matcher, otherwise it is initially empty. This can be done by either providing an empty dictionary and a list of canonical UTF-8 binary strings, a pre-populated dictionary of known matches (from UTF-8 binary strings to UTF-8 binary strings), or a mixture of the two. Under normal operations it is usually desirable to enter new, unmatched strings into the list of canonical strings and into the dictionary. If the extension of the initial set of canonical strings and mappings is *not* desired, an option `[fixed]` may be provided via the additional start function: ```Erlang start_link(Name, Dict, Canon, [fixed]) ``` A fuzzy matcher may be stopped either explicitly via the `stop` function ```Erlang stop(Name) ``` A fuzzy matcher may also get started as part of a supervision tree, in which case the normal supervision shutdown sequence will lead to graceful termination of the fuzzy matcher. ## String Translation A subject string may be fuzzily matched and translated into canonical form using ```Erlang translate(Name, S) translate(Name, S, Matchers) ``` where `Name` is an atom (the given name of the fuzzy matcher when it was started), `S` is a UTF-8 binary string to translate, and `Matchers` is an optional list of functions to use as the fuzzy matching algorithms. A matcher function must obey the type ```Erlang Matcher = fun(S :: binary(), [Canon :: binary()]) -> undefined | {ok, Name :: binary()} ``` that is, take the subject string as a UTF-8 binary and a list of canonical UTF-8 binary strings returning either the atom `undefined` if no match was found or `{ok, Canonical}` if `Canonical` was found as a suitable match for the subject. If a list of matcher functions is provided, they are tried in the order provided in the list, if a matcher function returns 'undefined', the next one on the list is tried. The default list of match functions are: - exact match (via the dictionary) - abbreviation match (using [this](http://stackoverflow.com/a/7331558) algorithm) - Levenshtein difference between subject and canonical string is less than or equal to two (2) - all tokens (separated by whitespace) of subject and canonical matched according to exact match or levenshtein If no matcher function finds a match, the `translate/2,3` functions return the original subject string. Additionally, if the start option `[fixed]` was *not* provided, the subject string is entered into the list of canonical strings. ## Data Access The list of matcher functions is returned by ```Erlang matchers(Name) ``` The current lookup dictionary is returned by ```Erlang dict(Name) ``` The current list of canonical strings is returned by ```Erlang canonicals(Name) ``` ## Possible Extensions - Configurability of the Levenshtein distance threshold. - Upfront configurability of the default matcher sequence. - Upfront configurability of externally defined matchers. - Lowercase strings before matching. - Strip out punctuation characters before matching.

近期下载者

相关文件


收藏者