THE FUNCTIONAL TREATMENT OF PARSING

by Rene Leermakers


FOREWORD

Formal-language theory, theoretical linguistics and computational
linguistics have shared roots in the 1950s, with the seminal work of
Kleene, Chomsky, Miller and Bar-Hillel on regular languages and
phrase-structure grammars. However, various social, cultural and
technological factors have since then conspired to split those
disciplines and weaken their understanding and appreciation of each
other.  Efficiency considerations and the fact that programming
languages are human artifacts may partly justify the focus on
deterministic languages and parsers in the theory of context-free
parsing. However, natural languages are highly ambiguous and thus
non-deterministic, making much of that theory seem irrelevant to
natural-language parsing.  It has thus been difficult to convince the
computational linguist of the importance of context-free-parsing
theory, if not for specific algorithms, then for concepts and
techniques essential to the rigorous analysis of natural-language
parsers.  Looking in the other direction, the formal-language theorist
is mostly unaware of the special problems of natural-language parsing,
and thus not only misses a potentially rich area for new research but
also fails to appreciate the efforts of computational linguists. For
these reasons, the publication of "The Functional Treatment of
Parsing" is doubly welcome.

For the computational linguist, Rene Leermakers's book brings out
the relevance of the theory of context-free parsing to
natural-language parsing.  His innovative use of functional notation
makes algorithms and their derivation less mysterious, and eliminates
much of the need for the laborious inductive proofs of correctness
found in other parsing theory texts.  In addition, the functional
approach ties well with the widespread acquaintance of current and
recent students with the functional programming paradigm through
languages such as Scheme and ML.  Delicate data structure issues in
parsing are clearly located in elegant abstractions representing
nondeterminism and result reuse.

For the computer scientist, "The Functional Treatment of Parsing"
offers a fresh and unified perspective on a variety of parsing
algorithms, some well-known and some less so. This new perspective
offers much simpler proofs of correctness and computational
complexity, and eliminates the artificial distinction between
stack-based and tabular parsers. The use of equational reasoning
rather than special-purpose inductive proofs to relate algorithms to
their specifications is an excellent application of an approach to
program derivation and verification that has received strong support
in the works of Boyer, Moore, Dijkstra and Gries. From a more
practical angle, Leermakers's approach provides a theoretical basis
for the parsing component of interactive language-development
environments, for which the standard deterministic parsing methods
have been proven unwieldy.

Parsing theory has many subtleties, requiring attentive and thoughtful
study. While the present book does not excuse the student from those
obligations, it will provide ample rewards to readers at all levels.
In addition to a self-contained and elegant treatment of all the main
ideas of context-free parsing, it brings out the underlying unity of
the subject as no other book I know of, and offers a wealth of
conceptual and technical riches, of which I particularly enjoyed the
application of Lambek types to the analysis of grammatical covers and
attribute grammars.

There have been increasing signs in the research literature of a
long-overdue convergence between formal-language theory and
computational linguistics, in particular in the area of context-free
parsing. "The Functional Treatment of Parsing" not only
demonstrates that convergence for the first time in book form, but
also revives context-free parsing theory as an interesting and
relevant topic for computational linguists and computer scientists
alike.

Fernando C.N. Pereira.


FOREWORD by Fernando Pereira                                  i
PREFACE                                                     vii
1 CONTEXT-FREE GRAMMAR                                        1
2 BUNCH NOTATION                                              7
   2.1 Bunches                                                8
   2.2 Algorithmic Interpretation                            12
3 GRAMMAR INTERPRETATIONS                                    15
   3.1 The natural interpretation                            15
   3.2 Derivation                                            20
   3.3 The Lambek types                                      23
   3.4 Recognition functions                                 26
   3.5 Generation                                            28
   3.6 Summary of interpretations                            29
4 RECURSIVE DESCENT                                          33
   4.1 The functional interpretation                         33
   4.2 Termination                                           35
   4.3 Complexity and memoization                            35
   4.4 Look ahead                                            38
   4.5 Error recovery                                        42
5 GRAMMAR TRANSFORMATIONS                                    45
   5.1 Making grammars bilinear                              45
   5.2 Recursive descent for E(G)                            48
   5.3 Partial elimination of left recursion                 49
   5.4 Recursive descent for F(G)                            56
6 RECURSIVE ASCENT                                           61
   6.1 The algorithm                                         62
   6.2 Termination                                           65
   6.3 A variant that works with strings                     66
   6.4 Complexity                                            67
   6.5 EBNF grammars                                         69
7 PARSE FOREST                                               73
   7.1 Informal introduction                                 73
   7.2 The grammar E'(G)                                     75
   7.3 Forest for bilinear grammars                          76
   7.4 The set Q                                             80
   7.5 Standard Earley parser                                83
   7.6 Earley versus Earley                                  84
8 ATTRIBUTE GRAMMARS                                         87
   8.1 Notational conventions                                88
   8.2 Attribute functions                                   90
   8.3 Example                                               92
   8.4 Function graphs                                       96
   8.5 Attribute grammar parser                              99
   8.6 Direct attribute evaluation                          100
9 LR PARSERS                                                111
   9.1 LR(0) recognizer                                     111
   9.2 The deterministic case                               116
   9.3 Implementation with stacks                           119
   9.4 Some variants                                        122
   9.5 Look ahead                                           124
   9.6 Attributes                                           126
   9.7 Continuations                                        127
   9.8 Error recovery                                       130
   9.9 The methods by Lang and Tomita                       133
   9.10 Evaluation w.r.t. standard approaches               134
   9.11 Earley versus LR                                    135
10 SOME NOTES
   10.1 Context-free grammars                               139
   10.2 Names                                               140
   10.3 Bunches                                             140
   10.4 Functional programming                              141
   10.5 Grammar transformations                             141
   10.6 Memo-functions                                      142
   10.7 Parse forests                                       142
   10.8 Earley                                              143
   10.9 Attribute grammars                                  143
   10.10 Natural language                                   144
   10.11 Other applications                                 144
   10.12 LR parsing                                         144
   10.13 EBNF                                               145
   10.14 Conclusion                                         145


PREFACE

The theory of parsing with respect to context-free grammars is one of the old
and established parts of computer science. The first contributions were
rather theoretical treatises within automata theory. Later on, contributions 
to the field came from people who were interested in practical applications 
such as compiler construction and natural language processing. 
The programming language community developed a vast amount of knowledge about 
deterministic (LL(k), PLR(k), LR(k), LALR(k), operator precedence, 
recursive descent ...) parsers. For analyzing natural language, these parsers 
are not useful. Instead, for the latter purpose, a large number of general 
parsing algorithms have been used. Among them are the algorithm attributed 
to Cocke, Younger and Kasami, and the Earley, chart, Sheil, and Tomita parsers.
The scientific communities of computational linguistics and compiler theory 
are rather different. It is my experience that the average researcher in either 
area underestimates the problems at the other side of the fence. One professor 
of computer science, for instance, once confided to me that it escaped him why 
``beautiful compiler construction tools like parser generators (he mentioned a 
specific one) are not being used for natural language processing''.
This professor had never realized that there are differences between
artificial languages and natural languages that have grave consequences
for parsing, such as the fact that natural language sentences are syntactically 
ambiguous.

This book might help bring both parsing communities closer together. In any
case, it brings together the techniques that are used at either side.
The current state of parsing theory reflects the status quo regarding the
two contributing communities. It suffers from a dichotomy that does not befit 
a mature field of knowledge. Parsers in compilers are typically implemented as 
deterministic push-down automata, the central data structure of which is a 
so-called parse stack. General parsing algorithms are mostly tabular, which 
means that a parse matrix is the central data structure.
In this essay, by contrast, deterministic and general parsing algorithms
are treated in a unified fashion. This is accomplished by adopting a 
functional formulation of parsing theory. In the new theory, factors 
that distinguish various parsing algorithms, such as stacks and parse 
matrices, are banned. Stacks are replaced by recursive functions, and 
parse matrices by memoizing functions (functions that remember past invocations). 
In the main course, some deficient parts of the existing body of knowledge 
are identified and repaired. A notable example of such a deficiency in the 
standard theory is the absence of simple functional implementations of LR
parsers. 

Many books on parsing theory are dedicated to the study of many classes of
grammars (such as LL(k), LR(k), ...). Such a class is determined 
by the requirement that a corresponding parser behaves deterministically. In 
this essay, whether or not a parser is deterministic is considered to be a 
marginal question. Our emphasis is on algorithms, not on grammar classes. 
Our main topic is theory, but the application of the theory is never far away, 
and the results are of direct practical relevance. Parsing theory is presented 
in a mathematical way, but the style of mathematics is not too rigorous. The
correctness of most algorithms is formally established, with proofs of 
calculational nature.

The theory presented in this essay leads to a new technique for implementing
parsers, which has been coined "recursive ascent parsing". The history 
of recursive ascent parsing is quite interesting. The standard formulation 
of LR parsing uses the concepts of automata theory. Looking for efficient 
implementations, Pennello came up with a technique borrowed from efficient 
recursive descent implementations, and in this way created, in a hidden way, 
the first recursive ascent LR parser. Based on this work, Roberts presented 
an implementation at a higher level of abstraction. Independently from Roberts 
and Pennello, Kruseman Aretz and Barnard and Cordy almost simultaneously 
proposed very similar ideas, with motivations of a theoretical kind. Their 
starting point was the standard LR parser.

Automata         This book's 
 Theory            theory
   |                 |
   |                 |
   V                 V
Standard          Recursive
LR parser -------> ascent
   |               parsing
   |                 |^
   |                 |
   ------------>  Penello

Figure 1

Figure 1 is a pictorial representation of this history, and displays the work 
reported here at the level of automata theory. More precisely, our theory is 
the functional equivalent of the theory of nondeterministic pushdown automata 
(NPDA), the class of automata that recognizes exactly the context-free
languages. The basic idea is simple but extraordinary powerful: a NPDA
state is `implemented' as a function. Then, state transitions correspond to 
function calls, and stack pops to function returns.

The expert reader may be surprised when he finds his favorite parsing algorithm 
described in this book. LR parsers, for example, are defined not only without 
parse stacks but also without parsing tables. The functional approach to 
parsing requires a new way of thinking, and in this respect it may sometimes 
be an advantage to be unacquainted with standard approaches. 

People that love functional programming will enjoy the new applications of 
this style of writing algorithms. For others, the same style may be a 
(hopefully temporary) stumbling block. For that reason, every now and then 
an example is given in an imperative style, that can be translated into any 
imperative programming language without much ado. The reader is adviced to 
actually do so whenever the level of abstraction becomes a problem.

My hope is that this essay will be of use to students, teachers, scientists 
and programmers. Parsing theory does not need heavy mathematics and only some 
standard mathematics skills are presupposed. The book is self-contained.
References to the literature are avoided in the content chapters. The last 
chapter is contemplative and contains many bibliographic notes. The results are 
presented without long explanatory elaborations, but this could make the book a 
bit terse for students. As an encouragement, it may be said that working through 
this book is a quick way to become an expert on parsing theory.
The book can be used to teach parsing, but it also contains interesting topics 
to touch upon in courses on formal language theory, compiler construction,
functional programming, computational linguistics, or program derivation.

This book was written to explain the new theory of chapters 6 to 9. The earlier
chapters are a prelude, providing background knowledge that is useful for 
building up some intuition about recursive ascent parsing algorithms. We start 
with an introduction to context-free grammars. Various interpretations of
context-free grammars are given.  One such interpretation involves a mapping 
from grammar symbols to multiple-valued functions, and leads directly to the 
recursive descent parsing method. As this method has some limitations, we are 
led to study grammar transformations as a way to adapt arbitrary grammars to 
these limitations. 
Historically, many results of chapters 6 and 9 have been obtained by grammar 
transformations. However, the chapters 6 to 9 do not depend on all of the 
preceding text, and in particular they can be understood without knowing about 
grammar transformations. One may therefore skip chapter 5 on first reading, 
if one is not interested in what is `behind it all'. 

In practical applications, context-free grammars are often used in conjunction 
with some form of attribute evaluation. Chapter 8 contains a functional 
treatment of attribute grammars, in such a way that parsing with respect to 
attribute grammars is a simple extension to pure context-free parsing. Two 
attribute grammar formalisms are presented. One is typically useful in compiler 
technology, where attributes are used for semantic purposes. The other is
meant for applications in which attributes play an important syntactic role, 
as in natural language parsing.

I am indebted to my family, for letting me work so many nights, and to 
various colleagues at Philips Research for having been instrumental. 
Jan Landsbergen stimulated me to express my views in the form of an essay. 
Frans Kruseman Aretz invented recursive ascent parsing and through his 
stimulating, if critical, comments contributed to this work in its early stages.
Lex Augusteijn profoundly influenced both my ideas and the way they appear in 
this book.


Rene Leermakers, February 1993.


-------------------------------------ORDER FORM------------------------------

Ref:  ftpser

Please send me: 
The Functional Treatment of Parsing, by Rene Leermakers
_____copy(ies) HB, ISBN 0-7923-9376-7, $79.95

  Payment enclosed to the amount of ___________________________

* Please invoice me 

* Please charge my credit card 

  Name of Card Holder: ______________________________________  

  Card. no.: ________________________________________________

  Expiry Date:______________________________________________

     Am. Ex.*          Visa*           Diners Club*           Mastercard*

Delivery address: 

Name: ___________________________________________________________________

Address: ________________________________________________________________

         ________________________________________________________________

         ________________________________________________________________

         ________________________________________________________________


Date:________________     Signature:_______________________________

To be sent to:


Outside North America                         In USA and Canada

KLUWER ACADEMIC PUBLISHERS GROUP              KLUWER ACADEMIC PUBLISHERS 
Order Dept.                                   Order Dept
P.O. Box 322                                  101 Philip Drive
3300 AH Dordrecht, The Netherlands            Norwell, 02016 MA
Tel: +31-78-524400                            Tel: 617-871-6600
Fax +31-78-524474.                            Fax: 617-871-6528
email:  vanderlinden@wkap.nl                  email: kluwer@world.std.com

Orders from individuals accompanied by payment or authorization to
charge a credit card account will ensure prompt delivery. Postage and
handling charges will be absorbed by the Publisher on all such orders.
Payment will be accepted in any convertible currency. Please check the
rate of exchange at your bank. For sales within the Netherlands please
add 6% VAT (BTW). Prices are subject to change without notice.

* Delete those that do not apply.