tsPEG

所属分类:操作系统开发
开发工具:TypeScript
文件大小:0KB
下载次数:0
上传日期:2023-06-29 20:34:01
上 传 者sh-1993
说明:  TypeScript的PEG分析器生成器
(PEG Parser Generator for TypeScript)

文件列表:
.eslintignore (51, 2023-12-01)
.eslintrc.js (1684, 2023-12-01)
CHANGELOG.md (3653, 2023-12-01)
LICENSE (16725, 2023-12-01)
assets/ (0, 2023-12-01)
assets/calc.png (27476, 2023-12-01)
assets/computed.png (25744, 2023-12-01)
assets/example1.png (46375, 2023-12-01)
assets/no_eof_symb.png (14175, 2023-12-01)
assets/sum.png (18994, 2023-12-01)
assets/with_eof_symb.png (19464, 2023-12-01)
demos/ (0, 2023-12-01)
demos/calculator/ (0, 2023-12-01)
demos/calculator/cli.ts (480, 2023-12-01)
demos/calculator/eval.ts (776, 2023-12-01)
demos/calculator/grammar.peg (877, 2023-12-01)
demos/calculator/tsconfig.json (125, 2023-12-01)
demos/calculator_with_computed_properties/ (0, 2023-12-01)
demos/calculator_with_computed_properties/cli.ts (441, 2023-12-01)
demos/calculator_with_computed_properties/grammar.peg (1477, 2023-12-01)
demos/calculator_with_computed_properties/tsconfig.json (125, 2023-12-01)
demos/json_parser/ (0, 2023-12-01)
demos/json_parser/grammar.peg (2036, 2023-12-01)
demos/json_parser/test.ts (448, 2023-12-01)
demos/json_parser/tsconfig.json (125, 2023-12-01)
demos/lookahead/ (0, 2023-12-01)
demos/lookahead/grammar.peg (1270, 2023-12-01)
demos/lookahead/test.ts (1151, 2023-12-01)
demos/lookahead/tsconfig.json (125, 2023-12-01)
gen-tests.sh (400, 2023-12-01)
... ...

![NPM Build Status](https://github.com/EoinDavey/tsPEG/actions/workflows/npm.yaml/badge.svg) # tsPEG : A PEG Parser Generator for TypeScript *tsPEG* is a PEG Parser generator for TypeScript. *tsPEG* takes in an intuitive description of a grammar and outputs a fully featured parser that takes full advantage of the TypeScript type system. ## Installation `tspeg` can be installed by running ``` npm install -g tspeg ``` ## Features - Fully featured PEG support, more powerful than CFGs. - Infinite lookahead parsing, no restrictions. - Regex based lexing, implicit tokenisation in your grammar specification. - Tight typing, generates classes for all production rules, differentiable using discriminated unions. - Memoisation (packrat parsing) support for guaranteed `O(n)` parse times. ## Demos The `demos` directory in the git repo contains several example grammars with explanations. These include 2 variants on a "calculator" which can evaluate expressions like `10 * (2-4)` and a fully functional JSON parser. ## CLI Usage The CLI invocation syntax is as follows `tspeg [output-file]` This generates a TypeScript ES6 module that exports a parse function, as well as classes that represent your AST. If `[output-file]` is omitted the parser is written to STDOUT. Run `tspeg --help` to see usage and available flags. Flags supported: - `--enable-memo`: Enable [memoisation](#memoisation); Get better performance at the expense of increased memory usage. - `--num-enums`: Use numeric enums instead of strings for AST kinds. Slightly reduces memory footprint of syntax trees. - `--regex-flags`: Add additional flags to all generated regex expressions. For example: Set `--regex-flags=u` to enable [Unicode property escapes.](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Unicode_Property_Escapes) ## Parser Usage The generated module exports a `parse` function that accepts an input string, and returns a `ParseResult` object like this: ```TypeScript class ParseResult { ast: START | null; errs: SyntaxErr[]; } export function parse(input: string): ParseResult { ... } ``` If the `errs` field is non-empty, then syntax errors were found, otherwise the AST is stored in the `ast` field. (`START` will be replaced with the type of your starting grammar rule). ## Grammar Syntax *tsPEG* grammars are specified with a simple syntax, similar to the classic EBNF syntax. Grammars are composed of a sequence of rules, each rule defines what text that it should match, using regex literals, names of rules, and powerful [operators](#operators) like `|` for choice. Each rule is defined by a name, followed by a `:=` sign and then a "rule description". Rule descriptions are easiest seen by example here: ``` hello := 'Hello ' helloWorld := hello 'World' helloChoice := hello 'Mars' | helloWorld ``` - The first line defines a rule `hello` which matches the string 'Hello ' directly. - The next line defines the rule `helloWorld`, this rule first matches our first rule `hello`, then matches the string 'World', i.e it matches the string "Hello World". - The third line uses the `|` operator to make a choice between two options. 1. First this rule will try to match the left hand side of the operator: `hello 'Mars'`, which as before will first match the rule `hello`, then the regex literal `Mars`. 2. If this left hand side fails, we move on to trying the right hand side, which is just a reference to the `helloWorld` rule. In practice this means it either matches "Hello World" or "Hello Mars" *tsPEG* **starts parsing with the first rule in the grammar** so to match the `helloChoice` rule we should either move it to the start or defined a `start` rule that points to it like this: ``` start := helloChoice hello := 'Hello ' helloWorld := hello 'World' helloChoice := hello 'Mars' | helloWorld ``` Putting this grammar into a file called "grammar.peg" and running `tspeg grammar.peg parser.ts` we get our parser for this simple grammar in the file "parser.ts". We can compile this (target at least es2015) and load it into NodeJS to test it. ![Example 1](assets/example1.png) As you can see we get no errors for 'Hello World', and 'Hello Mars', this means the match was successful. But when we try 'Hello Jupiter' we get do get an error in the error list. It lists the location of the error, and the expected matches at that location, namely it expected to have a regex match for 'Mars' or 'World'. You can see more about errors in the [Syntax Error section.](#syntax-errors) Notably the `ast` field of the parse result is empty for 'Hello World' and 'Hello Mars', it only contains a `kind` field that tell us which rule was matched. ([See more on kinds](#kind-checking)). This is because we haven't told *tsPEG* what elements of the grammar we would like to be returned. When we write a rule definition, we can specify the fields we'd like to save in the AST by assigning them with an `=` sign. For example, say we want to match a string that's the sum of 2 numbers, e.g. "2+3", "123+456", we'd like to save both side of the sum in our AST, we can do this by writing a grammar like the following. *Note that we had to write '\\+' to escape the '+' symbol as the '+' symbol has special meaning in regex*. ``` sum := left=num '\+' right=num num := '[0-9]+' ``` Running *tsPEG* on this input and testing the parser in node we can see that the parser result has saved the left and right numbers of our sum. Note that we didn't have to assign the result of the `[0-9]+` rule in the `num` rule. This is because rules that are just references to other rules are saved implicitly. ![Sum example](assets/sum.png) Now that we have saved the numbers of our sum, it's easy to write a program to compute the result of adding the numbers. Each rule that assigns results to the AST is exported as a class or interface with the same name as the rule, so we can just import the `sum` interface from the parser to write our function. For this grammar we get an interface like: ```TypeScript interface sum { left: string; right: string; } ``` Allowing us to write our calculator function like: ```TypeScript import { sum } from "./parser"; export function add(ast: sum): number { return parseInt(ast.left) + parseInt(ast.right); } ``` Calling this file "calc.ts" we can use it to calculate our sums: ![Calc screenshot](assets/calc.png) We can also use [computed properties](#computed-properties) to calculate the result of the sum during the parsing process. ## Operators *tsPEG* supports a set of powerful operators to build complex grammars. The only operator we have seen so far is the `|` operator, which allows us to make choices between two rule expressions, however there are many more. - The `?` operator is used to make a match optional. For example in the rule ``` rule := 'I ' 'really '? 'love tsPEG' ``` The match for the 'really ' string has been marked optional, so this rule can match either "I really love tsPEG", or "I love tsPEG". - The `+` operator matches 1 or more copies of the match it's applied to, for example: ``` rule := 'It\'s a ' 'long '+ 'way away' ``` This rule matches "It's a long way away", "It's a long long way away", etc. for any amount of "long"s that's not 0. When this operator is used, a list of results is attached to the AST. - The `*` operator is the same as the `+` operator but it allows zero matches. - You can specify a specific number of matches by adding square brackets notation: `rule := 'Hello'[2]` will match exactly "HelloHello" (2 matches). You can specify only a lower bound: `rule := 'World'[3, ]` will match any sequence like "WorldWorldWorldWorld" with at least 3 matches (but will not match less). You can specify both an upper and lower bound: `rule := 'tsPEG'[5, 10]` to limit the number of possible matches. - The `!` is called the "*negative lookahead*" operator, and it does exactly what it says on the tin. This operator inverts the result of the match, meaning you can specify a rule by what it should not match. For example the rule ``` rule := 'The banned word is ' !'Macbeth' '[a-zA-Z]+' ``` This will match the phrase "The banned word is X" for any value of X, except when X is 'Macbeth' - The `&` operator is called the "*positive lookahead*" operator, this operator will change a match so that it will test for the match, and fail if it doesn't work, but it will not consume the input. This allows you to lookahead at what comes next in the string, but not to consume it. ## Sub-rules Inline sub-rules can be specified in *tsPEG*. These use the `{` and `}` brackets for grouping, and allow you to write smaller rules inline in a larger rule. For example ``` rule := 'start' {some optional part}? 'finish' ``` We want the part in the middle to be optional, so we wrap it in `{` and `}` and apply the `?` operator. The `{}` brackets create a sub-rule. Note that if you want to assign the subrules values to the tree you do need to assign a name to it e.g. `rule := 'start' middle={some optional part}? 'finish'`. ## Syntax Errors A `SyntaxErr` object is composed of two fields, a `pos` field with the position of the error, and `expmatches` which contains a list of expected matches. It also supports a `toString` method to return a readable text error message. (In production though it is recommended to use a more ad-hoc custom method). Matches can be either a `RegexMatch`, or an `EOFMatch`. These can be disambiguated by checking the `kind` field (not the same as the AST `kind` fields). Each type contains a field "negated" denoting whether the match was expected to be successful or unsuccessful (e.g. if the `!` operator was used). `RegexMatch` objects are for attempts at matching regex literals, these are the most common. `EOFMatch` objects are for attempts at matching the `EOF` (`$`) match (See the [section on EOF matching](#eof-matching)). The definitions for each are below: See details of the `PosInfo` object in the [Position Tracking section.](#position-tracking). ```TypeScript interface RegexMatch { kind: "RegexMatch"; negated: boolean; literal: string; } type EOFMatch = { kind: "EOF"; negated: boolean }; type MatchAttempt = RegexMatch | EOFMatch; interface PosInfo { overallPos: number; line: number; offset: number; } class SyntaxErr { public pos: PosInfo; public expmatches: MatchAttempt[]; public toString(): string { // ... } } ``` ## Comments C-style comments can be added to grammar spec files. Anything on a line after a // is a comment, and is ignored by the generator. ## Regex Modifiers You can add one of 4 regex modifiers to any regex literal to change their behaviour. These modifiers are: - 'i': Ignore case: Makes regex case insensitive. - 'm': Multiline: Make it so that `^` and `$` match start/end of individual lines and `.` does not match newlines. - 's': Single Line: Make it so that `^` and `$` match start/end of input and `.` *does* match newlines. - 'u': Unicode Support: Enable [Unicode property escapes.](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Unicode_Property_Escapes). To add a regex modifier simply type it immediately after the regex literal: e.g. `'a'i` matches both 'a' and 'A'. You can specify multiple modifiers e.g. `'^select$'im` enables case insensitivity and multiline mode. ## Computed Properties As well as assigning parsing results to variables and storing them on the AST, *tsPEG* also allows you to create **computed properties**, which are fields on the AST that are computed when the parser is run. Computed properties are added to a rule by appending a new expression after the rule description like ``` . = { } ``` Returning to our sum example from earlier we had a grammar to match strings like "2+3", "40+200" etc. ``` sum := left=num '\+' right=num num := '[0-9]+' ``` We can add computed properties to this to compute the value of this sum at parse time, instead of writing our own function to do it after. First we add a computed property to `num` to store the value of the number: ``` sum := left=num '\+' right=num num := literal='[0-9]+' .value = number { return parseInt(this.literal); } ``` As you can see we've assigned the text match to the field `literal`, and added a computed property called `value` which is the `literal` field parsed as an integer. Now we can add a computed property to `sum` to do the arithmetic, this is very simple ``` sum := left=num '\+' right=num .sum = number { return this.left.value + this.right.value } num := literal='[0-9]+' .value = number { return parseInt(this.literal); } ``` We use the computed property of the `num` rule to compute our `sum` property. Let's see this in action! Let's try and parse the string "240+100". ![Computed property](assets/computed.png) As you can see the AST has a field called `sum` with the correct value of 340 in it. The `left` and `right` also have their computed `value` properties of 240 and 100. ## Header The introduction of computed properties means that now you might want to import some types or functions into your parser. Luckily you can specify a header in the file that will be inserted directly into the generated parser. Anything at the top of the grammar file between two lines of three dashes `---` will be inserted straight into the generated parser, allowing you to write grammars like ``` --- import { myFunc, myType } from "./mypackage"; --- rule := hello='Hello World' .value = myType { return myFunc(this.hello); } ``` Another possible use for the header is for suppressing linting errors as in [#51](https://github.com/EoinDavey/tsPEG/issues/51). I would recommend excluding any tsPEG generated files from linting directly through the linter config (e.g. `.eslintignore`) but if you choose you can instead disable specific linting rules using the header. ## Kind Checking *tsPEG* uses [discriminated unions](https://www.typescriptlang.org/docs/handbook/advanced-types.html#discriminated-unions) to distinguish between types of AST nodes. Each AST node type has a field called `kind` which contains some value from the `ASTKinds` enum. This `kind` field can be used to differentiate between AST results. ### Example ``` Choice := Word | Int Word := word='[a-z]+' Int := val='[0-9]+' ``` When writing a function to process the parse tree for this grammar, you can check if `kind === ASTKinds.Word` or `kind === ASTKinds.Int` to see which rule was matched. ```TypeScript import { Choice, Parser, ASTKinds } from "./parser"; function makeChoice(c: Choice) { if(c.kind === ASTKinds.Word) { console.log('Matched word ', c.word) } else { console.log('Matched int ', c.val) } } ``` ### Kind names The names of the ASTKinds enum entries vary. - For simples rules like `Rule := name='regex'` the kind will be `ASTKinds.Rule`. - For rules with multiple choices such as `Rule := choiceA='regexA' | choiceB='regexB'` the `kind` will either be one of `ASTKinds.Rule_1` or `ASTKinds.Rule_2` depending on which rule was matched. In general they are of the form `ASTKinds._N` for the `N`th choice. - Rules that directly reference a different rule like `rule := otherrule` don't get their own AST type, so inherit the `kind` from the other rule. - Sub-rules are given the kind `ASTKinds._$N` for the `N`th subrule of rule `ParentRule`. For example in the rule `Rule := sub={ name='regex' }` the `kind` for the sub-rule is `ASTKinds.Rule_$0`. - If in doubt it's simple to inspect the generated parser file to find what the correct kind name is. The compiler will also be sure to tell you when you're wrong. ## Position tracking tsPEG supports a special match expression "`@`" for storing the positions of matches. Using `@` you can store the current position of the parser. The `@` expression returns a `PosInfo` object: ```TypeScript export interface PosInfo { // overallPos is the index of the input string readonly overallPos: number; // line is the line of the input readonly line: number; // offset is the number of characters from the // start of the line readonly offset: number; } ``` ### Example Returning to our addition of two numbers from earlier, which matches "12+56", "123+456" etc. ``` sum := left=num '\+' right=num num := '[0-9]+' ``` If we want to store the location of the "+" operator we can use the @ match like this: ``` sum := left=num pluspos=@ '\+' right=num num := '[0-9]+' ``` This stores the position of the parser into the AST before the '+' symbol, the generated interface for the AST looks like: ```TypeScript interface sum { left: string pluspos: PosInfo right: string } ``` ## EOF Matching tsPEG supports another special match expression `$` that matches the end of the input (traditionally referred to as `EOF` for End-Of-File). The `$` match will succeed only if we are at the end of the input. This can be used to ensure that all of the input is consumed. ### Example Consider this grammar: ``` hello := 'Hello World' ``` This grammar will successfully match the string "Hello World" as expected, *but* it will also match the string "Hello World, Goodbye Moon". It will match the "Hello World" at the start of sentence and not try to go any further. ![No EOF Symbol](assets/no_eof_symb.png) To fix this issue we can use the `$` symbol to require that the parser only succeeds if the EOF has been reached. ``` hello := 'Hello World' $ ``` If the EOF is not matched by the parser you will get an `EOFMatch` object in your expected matches in the returned `SyntaxErr`. ![With EOF Symbol](assets/with_eof_symb.png) ## Memoisation If your generated parser speed is slow, it might be due to excessive backtracking being required to parse your input. Memoisation can be enabled to solve this problem. By passing the flag `--enable-memo` to tsPEG, your generated parser will cache partial parsing results as it goes, which can drastically improve parse times, at the expense of increasing the parsers memory usage.

近期下载者

相关文件


收藏者