Foreword
After completing before lexical analysis, it has been Token stream, then the next step is to implement parser to get the input stream Token abstract syntax tree (the Abstract Syntax Tree, AST) . But unlike in the completion of this parser lexical analyzer, direct hand line and good, still need some pre-knowledge.
These pre-knowledge before Bowen has ever mentioned
The complete code for the project in C2j-Compiler
What is parsing?
If we lexical analysis as a combination of words, the output word stream, parsing can be seen as check these words are not in line with the process of grammar. When using regular lexical analysis or by manual alignment to validate the word, the syntax analysis is context-free grammar (context-Free Grammar, of CFG) .
If a formal grammar G = (N, Σ, P , S) of production rules take the following form: V -> w, that of the. Which V∈N, w ∈ (N∪Σ) . Context-free grammar named reasons "context-free" is because the character string w V can always be replaced free, regardless of the context of the V character appears. In the form of a context-free language, it is irrelevant if the grammar generated by the context *
Form (BNF)
Backus-Naur Form (English: Backus Normal Form, BNF) is a context-free grammar for representing language.
Look at an example:
S –> AB
A –> aA | ε
B –> b | bB
Wherein SAB called nonterminal representative may be produced by deriving a new symbol, which is also nonterminal previously defined in the Token class; ab ε called terminal symbols , which no longer represents a new symbol is generated by derivation, [epsilon] It said air;
Each row is above a production rule, also called derivation, represents the transfer way nonterminals;
S is the start symbol.
Only the terminal symbol is called a symbol string sentence (sentence) .
For example, through these three productions, it can be concluded bbb grammatical rules.
Several methods of parsing
And speaking the same as before, is divided into top-down and bottom-up two kinds of
Before recording in learning a little bit of this in several ways, here is not to say
Recursive descent and LL (1) parsing
From bottom-up parsing
Here to tell you a little of this parsing method used, LALR (1), it belongs to the bottom-up analysis algorithms.
Syntax analysis bottom-up
A bottom-up parsing procedure corresponds to the process of parsing an input string configured book, it starts from the leaf node to reach the root node and reduce progressively upwardly through the shift operation
Bottom-up parsing need to store a stack of symbol resolution, for example, the following syntax:
0. statement -> expr
1. expr -> expr + factor
2. | factor
3. factor -> ( expr )
4. | NUM
2 + 1 to resolve
stack | input | |
---|---|---|
null | 1 + 2 | |
ON ONE | + 2 | Start reading a character, and resolved into the corresponding token stack, called shift operation |
factor | + 2 | The grammar derivation, factor -> NUM, the NUM the stack, the stack factor, this operation is called reduce |
expr | + 2 | Reduce operating here continue to do, but because the syntax derivation has two productions, so it is necessary to look ahead in order to comply with a judgment is shift or reduce, that is, parsing of LA |
expr + | 2 | shift operation |
NUM expression + | null | shift operation |
expr + factor | null | The fator performed reduce the production |
expr | null | reduce operating |
statement | null | reduce operating |
At this time reduced to the start symbol, and the input string is also empty, the representative successful parsing
So realize bottom-up parsing the key is to identify the stack should be on shift or reduce operations.
- Violent matching symbols on the stack and search all of derivation grammar match x
- A state machine configured to determine the state of the stack after the press-fitting operation or pop whether reduce
So the next task is to build a natural finite-state automaton to be able to guide the parser to operate.
summary
In fact, the so-called pre-knowledge is to understand parsing doing, and probably how to do it.
Token syntax analysis is to check the input stream is not in line with the process of grammar, syntax analysis algorithms to complete this step, take it from the bottom up, that is derived from the leaf node to process up to the top of the tree.
Also my github blog: https://dejavudwh.cn/