MMRTokenizerXL

所属分类:特征抽取
开发工具:Others
文件大小:0KB
下载次数:0
上传日期:2023-09-05 13:44:39
上 传 者sh-1993
说明:  no intro
(Tokenizer for Myanmar Unicode Pyidaungsu Font (Visual Order),)

文件列表:
LICENSE (21284, 2023-09-05)
images/ (0, 2023-09-05)
images/MMRManipulator_sorting.gif (767667, 2023-09-05)
images/MMRTokenizerXL.png (232917, 2023-09-05)
images/formulas_for_wordcount_exploding.png (72962, 2023-09-05)
images/mmrtokenizer_newFuncs_v1.2.png (79139, 2023-09-05)

# MMRTokenizerXL ## 1.Tokenizer for Myanmar Unicode Pyidaungsu Font (Visual Order) ![MMRTokenizerXL](https://github.com/images/MMRTokenizerXL.png) ## 2.The Journey As usual, while I was haunting in my usual Facebook Excel groups, I came across somebody asking a question on how to sort a Myanmar Name in Myanmar font using the last name. As we, all Myanmar people, know, our names do not have a first name/last name basis but rather like a jumble of nice words joined together to get a beautiful or following the Myanmar astrological beliefs.\ The main problem with this naming system concerns something else. It is the way we write our language.\ We don't really need to use a white space to separate each word or ?syllable. I am not a linguist. So, I don't know how it is called.\ Anyway, the real problem is that I found it hard to find out the word break points. I mean the exact point where one word of a name ends and where the next word starts. Let's consider a name in Myanmar/Burmese e.g. (=Khin Moh Moh Aung) this is a female's name.\ In this case, even though we spell in English like so inside the parentheses, the Myanmar words do NOT have any white space between them. They just don't have to.\ If we want to separate/tokenize/split the name in English, we could easily do so in VBA using the SPLIT function or use some formulas in Excel worksheet UI.\ But we have no way of achieving the same goal with a Myanmar font, the Pyidaungsu font, namely, because it is not easy to programmatically identify where the consonants are, using VBA. I am so interested in Natural Language Process (NLP) but it is very hard for me to understand those papers written by NLP specialists because I don't have the basics of linguistics and I was never officially trained in programming languages, especially the VBA, I have currently and mostly written in.\ However, I persevered and read through many papers but I got no where.\ Then I found some github pages where some people wrote up something about a python package and they showed an image which shows, in turn, a tokenized Myanmar words and I was so addicted to it. Now that there is this question about how to sort Myanmar names using the last word of the name, I realized that it would be best to get the last consonant out and sort by that, much like we do with English names.\ So I tried to get the consonant of the last word from a name. It was very hard for me because I didn't know how Pyidaungsu font stitches each parts of a word. For example, for the same name, , (a beautiful name for a girl btw), we actually typed it up using the Windows10's Burmese Keyboard (Visual Order) in Pyidaungsu font like (, , , , , , , , , , , , , , , ) because we need to type the way we spell as well as the way we see/read it. There is another Burmese Keyboard using Phonetic Order but I am NOT using it and I don't develop code for that because the typing is pretty hard on it.\ From trial and error and trying to convert each typed parts into Unicode values using AscW and ChrW, I realized that, the developers of the Pyidaungsu font decided to put together a system (albeit an incomplete one, more on this later) which, in a way, places the consonants in front of every other appendages/diacritics? of a Burmese word spelling. What I meant is for the last part of the name, , they placed the consonant in front of the other stuff. And they bind them together like , , , , , and if that is a uniform method, I could come up with a UDF for tokenization quite easily! So I tried to loop from the right-most part of the name and bam, the first obstacle hits me, right in the face.\ The under the , the is actually a consonant but it must be combined with the Athat/diacritic? or , to create the whole word. So I had to ignore the first consonant that comes after the Athat.\ The real problem now begins because Myanmar words have other part that comes between the Athat and the consonant!!! And the number of stuff that can come between is variable and can be more than 2 or 3. I wrote incompleteness above because in some cases like, , we typed it up like , , , so I was thinking like the Athat would come second last and the would come after it and I was totally wrong. In reality, the Athat comes rightmost then the comes after that which wreaks havoc in my algorithm and I had to spend hours trying to fix that.\ Therefore, I had to come up with a way to filter through, to reach the consonant of the last word in a name. And I did.\ The algorithm is not graceful nor very cool but it works! I was like OK, I can get the consonant of the last word in a name already. Why not get all the consonants using this algorighm.\ By this time, I can sort a column using the consonant of the last word of a name!\ I expanded and improved my code to be able extract all the consonants of a name!!! The idea now was that all the components of each words of a name must be between the consonants! because Pyidaungsu font was made to combine words like that! Thank you, Pyidaungsu font dev team!!\ So, I just had to put a separator/delimited just before each consonant and then BAM! the name is tokenized! Then I found that using the consonant of the last word of a name for filtering is not very accurate because in Burmese language, we have a system of laying out the alphabetical order of words, which we called as, . This topic was taungt in the first year of my bachelor's degree and I failed this subject, the Myanmar language!\ I tried using the consonant extraction UDF and compared it to sorting by using only the consonant from the last word. And I found that the difference is quite impactful.\ This drives me to finish the whole Tokenization UDF because the Burmese language sorting system uses not just consonants but other parts in the building of a word like those I already mentioned above for example, or or or or etc of may of those things/diacritics?.\ Enough said and I wish to admit that I used just string manipulation functions in this UDF rather than the NLP methods. I am not employing ML or AI with all this. ## 3.The UDFs The source code of the UDFs may be released as plain text.\ There are currently 4 UDFs in the .bas and .xlsm files upon release. 1. [MMRTokenizer](https://github.comhttps://github.com/4R3B3LatH34R7/MMRTokenizerXL#11mmrtokenizer) 2. [MMRManipulator](https://github.comhttps://github.com/4R3B3LatH34R7/MMRTokenizerXL#12mmrmanipulator) 3. [getMMRConsonants](https://github.comhttps://github.com/4R3B3LatH34R7/MMRTokenizerXL#13getmmrconsonants) 4. [MMRParser](https://github.comhttps://github.com/4R3B3LatH34R7/MMRTokenizerXL#14mmrparser) ### 3.1.MMRTokenizer MMRTokenizer is designed to be used mainly for tokenization of Myanmar words without additional bells and whistles, as this UDF was purported to be used for further processing into NLP methods, rather than intended for general everyday use.\ If extra functionality is required, the users are encouraged to use [MMRManipulator UDF](https://github.comhttps://github.com/4R3B3LatH34R7/MMRTokenizerXL#mmrmanipulator).\ Since this UDF is mainly intended for NLP-related usage, it's users are expected to be able to manipulate the VBA source code directly to change the separator to their whims, so no switching arguments are included for that purpose. Users can directly copy the UDF code below instead of downloading the .xlsm or .bas modules from [Releases Section](https://github.comhttps://github.com/4R3B3LatH34R7/MMRTokenizerXL/releases). ```VBA Option Explicit '********************************************************************************************************************************** '*Users of the following VBA code are not allowed to share the code commercially without written approval from the developer. * '*Any commercial distribution of the code herein requires acknowledgement, consent and approval from the author. * '*The developer of the code holds complete and thorough copyrights, however, no authorization is required for educational and * '*humanitarian uses, in which case, this whole declaration section must be included wheresoever the code herein is placed. * '*Failure to comply with above declarations shall be liable to the full extent of the law. * '*The VBA code provided herewith has no guarantee whatsoever with it and any untoward effect(s) that occur(s) shall not be held * '*liable to the developer and it is taken as a legally binding fact that the user(s) of said code must have agreed to this * '*disclaimer, in order to use it. * '*Contact info can be found at https://github.com/4R3B3LatH34R7 * '********************************************************************************************************************************** 'Can place the constants in each function if only some functions were required Public Const kagyi = 4096 Public Const ah = 4129 '+9 to include ou Public Const athat = 4154 Public Const shiftF = 4153 'for typing something under something Public Const witecha = 4140 Public Const moutcha = 4139 'Return a tokenized Myanmar String Function MMRTokenizer(target As Range) As String Dim ch As String Dim returnString As String Dim charCounter As Integer Dim previousChIsAthat As Boolean Dim shiftFfound As Boolean Dim previousCharAt As Long If target.Cells.CountLarge > 1 Then MMRTokenizer = ">1Cell!": Exit Function returnString = "": previousChIsAthat = False: shiftFfound = False: previousCharAt = Len(target.Value) + 1 If target.CountLarge = 1 Then If target.Value <> "" Then For charCounter = Len(target.Value) To 1 Step -1 ch = Mid(target.Value, charCounter, 1) If AscW(ch) <> shiftF Then If Not shiftFfound Or AscW(ch) = athat Then If AscW(ch) <> athat Then If AscW(ch) >= kagyi And AscW(ch) < ah + 9 Then If Not previousChIsAthat Then returnString = Mid(target.Value, charCounter, previousCharAt - charCounter) & IIf(Len(returnString) > 0, "|", "") & returnString previousCharAt = charCounter Else previousChIsAthat = False End If Else If AscW(ch) = witecha Or AscW(ch) = moutcha Then previousChIsAthat = False End If End If Else previousChIsAthat = True If shiftFfound Then shiftFfound = False End If Else shiftFfound = False If previousChIsAthat Then previousChIsAthat = False End If Else shiftFfound = True End If Next charCounter End If End If MMRTokenizer = returnString End Function ``` ### 3.2.MMRManipulator This is a tool spawned from being able to tokenize Myanmar/Burmese words typed using Burmese Visual Order Keyboard in Windows, deviating from this would result in lesser performance.\ It can be used to tokenize the words, for any purpose, like for sorting, counting, replacing...etc...with the sky at the limit of the users' imagination.\ With argument switch(es), users can change the tokenization character to become anything, any text, any string, even nothing!\ The users can also reverse the whole Myanmar sentence word by word with the first word becoming the last word and vice versa...\ The result of using this tool can be found in the [photo](https://github.com/images/MMRTokenizerXL.png) under the right most column (Column F). If cell A1 contains , then calling from inside cell B1 like, =MMRTokenizer(A1), shall return ||. The following steps can guide users use MMRManipulator to sort Myanmar names, words, sentences in reverse in the following short .gif. ![MMRManipulator](https://github.com/images/MMRManipulator_sorting.gif) The process is simple in that, users just need to use the UDF to reverse the range containing Myanmar words.\ The UDF only requires the first argument, out of the existing 5: 1. the target range, containing the target text string, is essential 2. the second argument is for defining the left or starting wrapper and can be anything text/string, for example, "(" or "<" or "{" or "\[". 3. the third argument is for defining the separator/delimiter for the output of the UDF, which can be anything from ""(vbNullString in VBA) or blank cell (in Excel) or a space or any word in Burmese or English and since no check was performed on this argument's validity, it can be quite powerful and dangerous at the same time. 4. the fourth argument is for defining the right or ending wrapper and can be anything text/string, including but not limited to e.g., ")" or ">" or "}" or "]". 5. the fifth argument is a boolean variable which acts as a switch for reversing the output of the UDF. So, if cell A1 contains "" and from inside cell B1, if we call the UDF as: (let "->" denotes "returns") 1. =MMRManipulator(A1) -> || 2. =MMRManipulator(A1,,"@",,) -> @@ 3. =MMRManipulator(A1,,"",,TRUE) -> (please note that "" is not space but denotes nothing) Apart from the target cell reference (the 1st argument), the remaining 4 arguments are optional, thus, calling like =MMRManipulator(A1) or =MMRManipulator(A1,,,,) is legitimate and will return || anyway.\ The reason behind including starting and ending wrappers is to make the output more in line with lists in other programming languages like Python.\ For example, users can call the UDF as =MMRManipulator(A1,"""",",",CHAR(34)) will return "","","" and please note here that the Double quote character must be called as 4xDouble Quotes or CHAR(34) which is a requirement of VBA. Another good example would be calling like =MMRManipulator(A1,"(","-",")",TRUE) which would return something like ()-()-(). Users can directly copy the UDF code below instead of downloading the .xlsm or .bas modules from [Releases Section](https://github.comhttps://github.com/4R3B3LatH34R7/MMRTokenizerXL/releases). ```VBA Option Explicit '********************************************************************************************************************************** '*Users of the following VBA code are not allowed to share the code commercially without written approval from the developer. * '*Any commercial distribution of the code herein requires acknowledgement, consent and approval from the author. * '*The developer of the code holds complete and thorough copyrights, however, no authorization is required for educational and * '*humanitarian uses, in which case, this whole declaration section must be included wheresoever the code herein is placed. * '*Failure to comply with above declarations shall be liable to the full extent of the law. * '*The VBA code provided herewith has no guarantee whatsoever with it and any untoward effect(s) that occur(s) shall not be held * '*liable to the developer and it is taken as a legally binding fact that the user(s) of said code must have agreed to this * '*disclaimer, in order to use it. * '*Contact info can be found at https://github.com/4R3B3LatH34R7 * '********************************************************************************************************************************** 'Can place the constants in each function if only some functions were required Public Const kagyi = 4096 Public Const ah = 4129 '+9 to include ou Public Const athat = 4154 Public Const shiftF = 4153 'for typing something under something Public Const witecha = 4140 Public Const moutcha = 4139 'Return tokenized words using user-selectable optional separator and ability to reverse the Myanmar word string Function MMRManipulator( _ target As Range, _ Optional lWrapper As String = "", _ Optional separator As String = "|", _ Optional rWrapper As String = "", _ Optional reversed As Boolean = False) As String Dim ch As String Dim returnString As String Dim charCounter As Integer Dim previousChIsAthat As Boolean Dim shiftFfound As Boolean Dim previousCharAt As Integer '?long Const defaultSeparator As String = "|" If target.Cells.CountLarge > 1 Then MMRManipulator = ">1Cell!": Exit Function returnString = "": previousChIsAthat = False: shiftFfound = False: previousCharAt = Len(target.Value) + 1 If target.CountLarge = 1 Then If target.Value <> "" Then For charCounter = Len(target.Value) To 1 Step -1 ch = Mid(target.Value, charCounter, 1) If AscW(ch) <> shiftF Then If Not shiftFfound Or AscW(ch) = athat Then If AscW(ch) <> athat Then If AscW(ch) >= kagyi And AscW(ch) < ah + 9 Then If Not previousChIsAthat Then returnString = IIf(reversed, returnString, Mid(target.Value, charCounter, previousCharAt - charCounter)) & _ IIf(Len(returnString) > 0, defaultSeparator, "") & _ IIf(reversed, Mid(target.Value, charCounter, previousCharAt - charCounter), returnString) previousCharAt = charCounter Else previousChIsAthat = False End If Else If AscW(ch) = witecha Or AscW(ch) = moutcha Then previousChIsAthat = False End If End If Else previousChIsAthat = True If shiftFfound Then shiftFfound = False End If Else shiftFfound = False If previousChIsAthat Then previousChIsAthat = False End If Else shiftFfound = True End If Next charCounter If InStr(returnString, defaultSeparator) > 0 Then 'check for names like may?? returnString = Replace(returnString, defaultSeparator, separator) End If returnString = lWrapper & Join(Split(returnString, separator), rWrapper & separator & lWrapper) & rWrapper End If End If MMRManipulator = returnString End Function ``` ### 3.3.getMMRConsonants This UDF was designed in the earlier stages of development of MMRTokenizer to help me identify, check and confirm the location of Myanmar consonants in a cell containing Myanmar word(s).\ There are altogether 4 possible arguments that can be passed when calling it. 1. target range (required) 2. reversed order (optional with default=false) 3. last character only (optional with default=false) 4. location of consonants (optional with default=false) Apart from the target range, the rest are optional.\ The arguments are pretty obvious and I ... ...

近期下载者

相关文件


收藏者