ziglyph

所属分类:模式识别(视觉/语音等)
开发工具:Zig
文件大小:1620KB
下载次数:0
上传日期:2023-03-27 22:16:42
上 传 者sh-1993
说明:  Zig编程语言的Unicode文本处理。
(Unicode text processing for the Zig programming language.)

文件列表:
LICENSE (1077, 2023-08-16)
build.zig (579, 2023-08-16)
build.zig.zon (52, 2023-08-16)
src (0, 2023-08-16)
src\akcompress.zig (7946, 2023-08-16)
src\ascii.zig (15095, 2023-08-16)
src\autogen (0, 2023-08-16)
src\autogen\blocks.zig (60290, 2023-08-16)
src\autogen\case_folding.zig (65417, 2023-08-16)
src\autogen\derived_combining_class.zig (10280, 2023-08-16)
src\autogen\derived_core_properties.zig (357665, 2023-08-16)
src\autogen\derived_east_asian_width.zig (76124, 2023-08-16)
src\autogen\derived_general_category.zig (118075, 2023-08-16)
src\autogen\derived_normalization_props.zig (230587, 2023-08-16)
src\autogen\derived_numeric_type.zig (7787, 2023-08-16)
src\autogen\emoji_data.zig (38622, 2023-08-16)
src\autogen\grapheme_break_property.zig (41903, 2023-08-16)
src\autogen\hangul_syllable_type.zig (22854, 2023-08-16)
src\autogen\lower_map.zig (36884, 2023-08-16)
src\autogen\prop_list.zig (50715, 2023-08-16)
src\autogen\sentence_break_property.zig (79898, 2023-08-16)
src\autogen\title_map.zig (37398, 2023-08-16)
src\autogen\upper_map.zig (37302, 2023-08-16)
src\autogen\word_break_property.zig (43566, 2023-08-16)
src\category (0, 2023-08-16)
src\category\letter.zig (7051, 2023-08-16)
src\category\mark.zig (527, 2023-08-16)
src\category\number.zig (2449, 2023-08-16)
src\category\punct.zig (1835, 2023-08-16)
src\category\symbol.zig (1049, 2023-08-16)
src\collator (0, 2023-08-16)
src\collator\Collator.zig (22133, 2023-08-16)
src\data (0, 2023-08-16)
src\data\license (0, 2023-08-16)
src\data\license\UnicodeLicenseAgreement.html (3195, 2023-08-16)
src\data\license\standard_styles.css (8678, 2023-08-16)
... ...

# ziglyph Unicode text processing for the Zig Programming Language. ## In-Depth Articles on Unicode Processing with Zig and Ziglyph The [Unicode Processing with Zig](https://zig.news/dude_the_builder/series/6) series of articles over on ZigNEWS covers important aspects of Unicode in general and in particular how to use this library to process Unicode text. Note that the examples in that series are pre-Zig v0.11, so changes may be necessary to make them work. ## Status This is pre-1.0 software. Although breaking changes are less frequent with each minor version release, they will still occur until we reach 1.0. ## Zig Version The main branch follows Zig's master branch, which is the latest dev version of Zig. There will also be branches and tags that will work with the previous two (2) stable Zig releases. ## Integrating Ziglyph in your Project ### Zig Package Manager In a `build.zig.zon` file add the following to the dependencies object. Currently only tar.gz urls are supported. ```zig .ziglyph = .{ .url = "https://github.com/jecolon/ziglyph/archive/refs/heads/main.tar.gz", .hash = "1220400c10661a8b8c88bed1c195b0f71fc2636a044fd312cc0d9b2b46232424b258", } ``` If you get a hash mismatch error, update the hash field with whatever hash the compiler tells you it found. Then in your `build.zig` file add the following to the `exe` section for the executable where you wish to have Ziglyph available. ```zig const ziglyph = b.dependency("ziglyph", .{ .target = target, .optimize = optimize, }); // for exe, lib, tests, etc. exe.addModule("ziglyph", ziglyph.module("ziglyph")); ``` Now in the code, you can import components like this: ```zig const ziglyph = @import("ziglyph"); const letter = @import("ziglyph").letter; // or const letter = ziglyph.letter; const number = @import("ziglyph").number; // or const number = ziglyph.number; ``` ### Using the `ziglyph` Namespace The `ziglyph` namespace provides convenient acces to the most frequently-used functions related to Unicode code points and strings. ```zig const ziglyph = @import("ziglyph"); test "ziglyph namespace" { const z = 'z'; try expect(ziglyph.isLetter(z)); try expect(ziglyph.isAlphaNum(z)); try expect(ziglyph.isPrint(z)); try expect(!ziglyph.isUpper(z)); const uz = ziglyph.toUpper(z); try expect(ziglyph.isUpper(uz)); try expectEqual(uz, 'Z'); // String toLower, toTitle, and toUpper. var allocator = std.testing.allocator; var got = try ziglyph.toLowerStr(allocator, "AbC123"); errdefer allocator.free(got); try expect(std.mem.eql(u8, "abc123", got)); allocator.free(got); got = try ziglyph.toUpperStr(allocator, "aBc123"); errdefer allocator.free(got); try expect(std.mem.eql(u8, "ABC123", got)); allocator.free(got); got = try ziglyph.toTitleStr(allocator, "thE aBc123 moVie. yes!"); defer allocator.free(got); try expect(std.mem.eql(u8, "The Abc123 Movie. Yes!", got)); } ``` ### Category Namespaces Namespaces for frequently-used Unicode General Categories are available. See [ziglyph.zig](src/ziglyph.zig) for a full list of all components. ```zig const letter = @import("ziglyph").letter; const punct = @import("ziglyph").punct; test "Category namespaces" { const z = 'z'; try expect(letter.isletter(z)); try expect(!letter.isUpper(z)); try expect(!punct.ispunct(z)); try expect(punct.ispunct('!')); const uz = letter.toUpper(z); try expect(letter.isUpper(uz)); try expectEqual(uz, 'Z'); } ``` ## Normalization In addition to the basic functions to detect and convert code point case, the `Normalizer` struct provides code point and string normalization methods. All normalization forms are supported (NFC, NFKC, NFD, NFKD.). ```zig const Normalizer = @import("ziglyph").Normalizer; test "normalizeTo" { var allocator = std.testing.allocator; var normalizer = try Normalizer.init(allocator); defer normalizer.deinit(); // Canonical Composition (NFC) const input_nfc = "Complex char: \u{03D2}\u{0301}"; const want_nfc = "Complex char: \u{03D3}"; var got_nfc = try normalizer.nfc(allocator, input_nfc); defer got_nfc.deinit(); try testing.expectEqualSlices(u8, want_nfc, got_nfc.slice); // Compatibility Composition (NFKC) const input_nfkc = "Complex char: \u{03A5}\u{0301}"; const want_nfkc = "Complex char: \u{038E}"; var got_nfkc = try normalizer.nfkc(allocator, input_nfkc); defer got_nfkc.deinit(); try testing.expectEqualSlices(u8, want_nfkc, got_nfkc.slice); // Canonical Decomposition (NFD) const input_nfd = "Complex char: \u{03D3}"; const want_nfd = "Complex char: \u{03D2}\u{0301}"; var got_nfd = try normalizer.nfd(allocator, input_nfd); defer got_nfd.deinit(); try testing.expectEqualSlices(u8, want_nfd, got_nfd.slice); // Compatibility Decomposition (NFKD) const input_nfkd = "Complex char: \u{03D3}"; const want_nfkd = "Complex char: \u{03A5}\u{0301}"; var got_nfkd = try normalizer.nfkd(allocator, input_nfkd); defer got_nfkd.deinit(); try testing.expectEqualSlices(u8, want_nfkd, got_nfkd.slice); // String comparisons. try testing.expect(try normalizer.eql(allocator, "fo", "foe\u{0301}")); try testing.expect(try normalizer.eql(allocator, "fo“", "fo\u{03D2}\u{0301}")); try testing.expect(try normalizer.eqlCaseless(allocator, "Fo“", "fo\u{03D2}\u{0301}")); try testing.expect(try normalizer.eqlCaseless(allocator, "FO‰", "foe\u{0301}")); // fo‰ == fo // Note: eqlIdentifiers is not a method, it's just a function in the Normalizer namespace. try testing.expect(try Normalizer.eqlIdentifiers(allocator, "Fo", "fo")); // Unicode Identifiers caseless match. } ``` ## Collation (String Ordering) One of the most common operations required by string processing is sorting and ordering comparisons. The Unicode Collation Algorithm was developed to attend this area of string processing. The `Collator` struct implements the algorithm, allowing for proper sorting and order comparison of Unicode strings. ```zig const Collator = @import("ziglyph").Collator; test "Collation" { var c = try Collator.init(std.testing.allocator); defer c.deinit(); // Ascending / descending sort var strings = [_][]const u8{ "def", "xyz", "abc" }; var want = [_][]const u8{ "abc", "def", "xyz" }; std.mem.sort([]const u8, &strings, c, Collator.ascending); try std.testing.expectEqualSlices([]const u8, &want, &strings); want = [_][]const u8{ "xyz", "def", "abc" }; std.mem.sort([]const u8, &strings, c, Collator.descending); try std.testing.expectEqualSlices([]const u8, &want, &strings); // Caseless sorting strings = [_][]const u8{ "def", "Abc", "abc" }; want = [_][]const u8{ "Abc", "abc", "def" }; std.mem.sort([]const u8, &strings, c, Collator.ascendingCaseless); try std.testing.expectEqualSlices([]const u8, &want, &strings); want = [_][]const u8{ "def", "Abc", "abc" }; std.mem.sort([]const u8, &strings, c, Collator.descendingCaseless); try std.testing.expectEqualSlices([]const u8, &want, &strings); // Caseless / markless sorting strings = [_][]const u8{ "bc", "Abc", "abc" }; want = [_][]const u8{ "bc", "Abc", "abc" }; std.mem.sort([]const u8, &strings, c, Collator.ascendingBase); try std.testing.expectEqualSlices([]const u8, &want, &strings); std.mem.sort([]const u8, &strings, c, Collator.descendingBase); try std.testing.expectEqualSlices([]const u8, &want, &strings); } ``` ### Tailoring with allkeys.txt You can tailor the sorting of Unicode text by modifying the sort element weights found in [allkeys.txt.gz](src/data/uca/allkeys.txt.gz). Uncompress the file with `gunzip` and modify it as needed. To prepare the file for use with Ziglyph, you need to process and compress the data as follows: ```sh $ cd /src $ zig build-exe -D ReleaseSafe akcompress.zig $ mkdir $ mv akcompress / $ cp data/uca/allkeys.txt.gz / $ cp data/uca/allkeys-diffs.txt.gz data/uca/allkeys-diffs.txt.gz.bak $ cd $ gunzip allkeys.txt.gz $ vim allkeys.txt # <- Modify the file $ ./akcompress $ gzip -9 allkeys-diffs.txt $ cp allkeys-diffs.txt.gz /src/data/uca/ ``` Now when you use the `Collator` it will reflect the sort element weights you modified. ## Text Segmentation (Grapheme Clusters, Words, Sentences) Ziglyph has iterators to traverse text as Grapheme Clusters (what most people recognize as *characters*), Words, and Sentences. All of these text segmentation functions adhere to the Unicode Text Segmentation rules, which may surprise you in terms of what's included and excluded at each break point. Test before assuming any results! ```zig const Grapheme = @import("ziglyph").Grapheme; const GraphemeIterator = Grapheme.GraphemeIterator; const SentenceIterator = Sentence.SentenceIterator; const ComptimeSentenceIterator = Sentence.ComptimeSentenceIterator; const Word = @import("ziglyph").Word; const WordIterator = Word.WordIterator; test "GraphemeIterator" { const input = "H\u{0065}\u{0301}llo"; var iter = GraphemeIterator.init(input); const want = &[_][]const u8{ "H", "\u{0065}\u{0301}", "l", "l", "o" }; var i: usize = 0; while (iter.next()) |grapheme| : (i += 1) { try testing.expect(grapheme.eql(input, want[i])); } // Need your grapheme clusters at compile time? comptime { var ct_iter = GraphemeIterator.init(input); var j = 0; while (ct_iter.next()) |grapheme| : (j += 1) { try testing.expect(grapheme.eql(input, want[j])); } } } test "SentenceIterator" { var allocator = std.testing.allocator; const input = \\("Go.") ("He said.") ; var iter = try SentenceIterator.init(allocator, input); defer iter.deinit(); // Note the space after the closing right parenthesis is included as part // of the first sentence. const s1 = \\("Go.") ; const s2 = \\("He said.") ; const want = &[_][]const u8{ s1, s2 }; var i: usize = 0; while (iter.next()) |sentence| : (i += 1) { try testing.expectEqualStrings(sentence.bytes, want[i]); } // Need your sentences at compile time? @setEvalBranchQuota(2_000); comptime var ct_iter = ComptimeSentenceIterator(input){}; const n = comptime ct_iter.count(); var sentences: [n]Sentence = undefined; comptime { var ct_i: usize = 0; while (ct_iter.next()) |sentence| : (ct_i += 1) { sentences[ct_i] = sentence; } } for (sentences) |sentence, j| { try testing.expect(sentence.eql(want[j])); } } test "WordIterator" { const input = "The (quick) fox. Fast! "; var iter = try WordIterator.init(input); const want = &[_][]const u8{ "The", " ", "(", "quick", ")", " ", "fox", ".", " ", "Fast", "!", " " }; var i: usize = 0; while (iter.next()) |word| : (i += 1) { try testing.expectEqualStrings(word.bytes, want[i]); } // Need your words at compile time? @setEvalBranchQuota(2_000); comptime { var ct_iter = try WordIterator.init(input); var j = 0; while (ct_iter.next()) |word| : (j += 1) { try testing.expect(word.eql(want[j])); } } } ``` ## Code Point and String Display Width When working with environments in which text is rendered in a fixed-width font, such as terminal emulators, it's necessary to know how many cells (or columns) a particular code point or string will occupy. The `display_width` namespace provides functions to do just that. ```zig const dw = @import("ziglyph").display_width; test "Code point / string widths" { // The width methods take a second parameter of value .half or .full to determine the width of // ambiguous code points as per the Unicode standard. .half is the most common case. // Note that codePointWidth returns an i3 because code points like backspace have width -1. try expectEqual(dw.codePointWidth('', .half), 1); try expectEqual(dw.codePointWidth('', .half), 2); try expectEqual(dw.codePointWidth('', .half), 2); var allocator = std.testing.allocator; // strWidth returns usize because it can never be negative, regardless of the code points it contains. try expectEqual(try dw.strWidth("Hello\r\n", .half), 5); try expectEqual(try dw.strWidth("\u{1F476}\u{1F3FF}\u{0308}\u{200D}\u{1F476}\u{1F3FF}", .half), 2); try expectEqual(try dw.strWidth("Hllo ·", .half), 8); try expectEqual(try dw.strWidth("\u{26A1}\u{FE0E}", .half), 1); // Text sequence try expectEqual(try dw.strWidth("\u{26A1}\u{FE0F}", .half), 2); // Presentation sequence // padLeft, center, padRight const right_aligned = try dw.padLeft(allocator, "ww", 10, "-"); defer allocator.free(right_aligned); try expectEqualSlices(u8, "------ww", right_aligned); const centered = try dw.center(allocator, "ww", 10, "-"); defer allocator.free(centered); try expectEqualSlices(u8, "---ww---", centered); const left_aligned = try dw.padRight(allocator, "ww", 10, "-"); defer allocator.free(left_aligned); try expectEqualSlices(u8, "ww------", left_aligned); } ``` ## Word Wrap If you need to wrap a string to a specific number of columns according to Unicode Word boundaries and display width, you can use the `display_width` struct's `wrap` function for this. You can also specify a threshold value indicating how close a word boundary can be to the column limit and trigger a line break. ```zig const dw = @import("ziglyph").display_width; test "display_width wrap" { var allocator = testing.allocator; var input = "The quick brown fox\r\njumped over the lazy dog!"; var got = try dw.wrap(allocator, input, 10, 3); defer allocator.free(got); var want = "The quick\n brown \nfox jumped\n over the\n lazy dog\n!"; try testing.expectEqualStrings(want, got); } ```

近期下载者

相关文件


收藏者