StringMix

所属分类:人工智能/神经网络/深度学习
开发工具:C#
文件大小:26KB
下载次数:0
上传日期:2017-04-28 21:21:44
上 传 者sh-1993
说明:  StringMix、Humble.net字符串标记器、分类器标记器和转换器
(StringMix,Humble .net String Tokenizer, Classifier Tagger, and Transformer)

文件列表:
LICENSE (1067, 2017-04-29)
src (0, 2017-04-29)
src\StringMix.Test (0, 2017-04-29)
src\StringMix.Test\Properties (0, 2017-04-29)
src\StringMix.Test\Properties\AssemblyInfo.cs (1404, 2017-04-29)
src\StringMix.Test\StringMix.Test.csproj (4516, 2017-04-29)
src\StringMix.Test\TaggerTest.cs (9569, 2017-04-29)
src\StringMix.Test\V2.cs (7330, 2017-04-29)
src\StringMix (0, 2017-04-29)
src\StringMix\Extensions.cs (11008, 2017-04-29)
src\StringMix\Internal (0, 2017-04-29)
src\StringMix\Internal\AbstractBaseMatcher.cs (3102, 2017-04-29)
src\StringMix\Internal\IMatcher.cs (694, 2017-04-29)
src\StringMix\Internal\ITransformer.cs (558, 2017-04-29)
src\StringMix\Internal\PatternMaker.cs (3425, 2017-04-29)
src\StringMix\Internal\RegexMatcher.cs (1443, 2017-04-29)
src\StringMix\Internal\Tagger.cs (3805, 2017-04-29)
src\StringMix\Model (0, 2017-04-29)
src\StringMix\Model\LexiconEntry.cs (1996, 2017-04-29)
src\StringMix\Model\MatchSet.cs (1497, 2017-04-29)
src\StringMix\Model\Pattern.cs (867, 2017-04-29)
src\StringMix\Model\StringMixOptions.cs (2027, 2017-04-29)
src\StringMix\Model\TaggedToken.cs (969, 2017-04-29)
src\StringMix\Properties (0, 2017-04-29)
src\StringMix\Properties\AssemblyInfo.cs (1376, 2017-04-29)
src\StringMix\StringMix.csproj (2912, 2017-04-29)
src\StringMix\StringMix.sln (1457, 2017-04-29)

# StringMix String tokenizer, part-of-string tagger, and transformation library for .net --- ### What: This library performs the following feats 1. Simple Tokenizing 1. Given a set of definitions or **lexicon**; apply a category, meaning, or type to each token using **tag** concepts 1. Summarize a list of tokens into **pattern**s that are concatinations of the tags that are applied to a token 1. Allow for expressing match criteria that patterns should be tested against 1. For patterns that have matched, allow for transformation into another concrete .net class using a **tranformer** 1. Provide a simple means to substitute key portions of the pipeline. In cases where a more specialized tokenizer is desired, that component can be created outside the library and injected in. ### Why: To parse, catagorize, and transform relatively small strings. Consider inputs like: - Fred Flintstone - Flintstone, Fred - Fred and Wilma Flintstone - Fred Flintstone;Wilma Flintstone - Flintstone, Fred and Wilma - Flintstone, Fred;Flintstone, Wilma Handling these unstructured strings as proper names is tough as they are. But if they were broken them up into word parts (tokenizing) and tags are applied to those parts (categorizing) it gets much easier to process them, even turning them into another class. ### Terms: - **Tokenizing**: The process of separating an input string into its component parts - **Token**: In most tokenizing schemes (including the default for this library) its roughly analogous to a word - **`LexiconEntry`**: An identification of expected input and the tags that should be applied to it if found. Ex: Fred, F [For First Name]. Each lexicon entry can define more than one tag that could apply. In the case of processing names, "Thomas" could be either a first or last name. Both "F" and "L" tags could be applied. - **Lexicon**: A collection of Lexicon Entries. Together, describing all of the expected values that tokens could be an what their meaning is/could be - **Tag**: A placeholder for a token in a pattern, a simple descriptor for that token in the collection of tokens. Determining what tags get applied to what tokens is controlled by the lexicon. Once a string is tokenized, a PatternMaker component is responsible for walking over each of the tokens and attempting to match each of the tokens to an entry in the lexicon. When a token matches, the tag(s) from the lexicon are applied to the patterns already identified. Since lexicon entries can have more than one tag assigned, a single input string might have several different patterns that apply. - **Pattern**: A sequence of tags that represent the meaning or type of the token represented in the original input string. Patterns are a way to summarize this meaning and allow for tokens of a type or meaning to be selected, rearranged, or processed in a meaningful way - **Patterns**: In cases where the lexicon contains entries that have more than one tag assigned, it is possible for a single input string to result in Lists of tokens that are summarized by more than one pattern. Consider a lexicon that tags "Sarah" with "F" and "Thomas" with "F" and "L". Also consider the input of "Sarah Thomas" The patterns that would summarize this input would be "FF" and "FL" ### Other concepts: Identification is a first step, which is covered in the readme pretty well. More advanced topics are covered in the [wiki](https://github.com/JasonKoopmans/StringMix/wiki) ### To consider: - The tokenizer is a simple wrapper over `String.split()`. `StringSplitOptions.Separators` which can be passed into the `.Tokenize()` method contains the characters that will be split by default. This property can be overridden for desired results. - Proper nouns like 'Eiffel Tower' using this library would yield two tokens. There's really no way currently to post process the tokens, apply rules to join them, and re-pattern. Initial focus on the libary is on a simpler single tokens. Pull requests welcomed. - At present, it makes the most sense that tags are defined single characters. The library could be enhanced to use multicharacter tags to be more descriptive or to offer subclassing of tags [ex: Verb-Transitive (VT)] that is common in some other AI-focused part of speech (POS) tagging. Since Regex is the heart of the matching and extraction scheme thoses efforts are much simpler with single character representations in patterns. ### How: The primary API for the library is accessible via extension methods that attach to the string class as well as the other model classes contained within the library. Several of the internal components of the library are extendable and replacable with overloads of these extension methods. // Define some Lexicon List lex = new List(); lex.Add(new LexiconEntry() { Value = "Fred", Tags = new List { "F" } // For FirstName }); lex.Add(new LexiconEntry() { Value = "Wilma", Tags = new List { "F" } // For FirstName }); lex.Add(new LexiconEntry() { Value = "Flintstone", Tags = new List { "L" } // For LastName }); // To Tokenize: List tokens = "Fred Flintstone Wilma Flintstone".Tokenize(lexicon); // Find Matches MatchSet matches = tokens.Match(@"^FLFL$"); // Transform a matchset to an object (even collections of objects) List names1 = matches.Transform(new NameTransformer>() ); /* : Results : names[0] = {First=Fred, Last=Flintstone}; names[1] = {First=Wilma, Last=Flintstone} */ // Transforming using a convenience method to Transform right from a string List names = "Fred Flintstone Wilma Flintstone".Transform(lexicon, "^FLFL$", new NameTransformer>()) ; /* : Results : names[0] = {First=Fred, Last=Flintstone}; names[1] = {First=Wilma, Last=Flintstone} */ For other uses check out the [wiki](https://github.com/JasonKoopmans/StringMix/wiki)

近期下载者

相关文件


收藏者