Explain Input Buffering. What is the drawback of using one buffer scheme and explain how it is overcome?
We often have to look one or more characters beyond the next lexeme, before we can be sure that we have the right lexeme.
For example, the symbols '<' or '=' could be the beginning of '<<', '<=', ==', etc.
Hence we need two buffer schemes to handle the large lookahead safely. The buffer pairs consist of a lexeme begin pointer (lbp) and a forward pointer (fp).
The lexeme begin pointer marks the beginning of the current lexeme, and the forward pointer scans ahead until a pattern match is found.
This concept is called Input Buffering.
When using a single buffer scheme, if the end of the current file/block is reached before the expression ends, the lexical analyzer might return an unexpected token.
For example, for input string E = M * C ** 2
... | ... | E | = | M | * | C | * | EOF | ... | ... |
lbp ↑ | fp ↑ |
When EOF is reached in the middle of the expression, the lexical analyzer emits '*' as a token; reloads the input buffer, emits the next '*' as another token.
Two '*'s are returned instead of '**', hence invalidating the expression.
This drawback is overcome by using two buffer schemes.
... | E | = | M | * | C | * | EOF | ... |
lbp ↑ |
... | * | 2 | EOF | ... |
fp ↑ |
When we use two buffers, when the input ends prematurely, the other buffer is loaded with the next file’s contents. Hence, the forward pointer moves to the next file to process the continuing input. The lexical analyzer hence emits '**' as the token.