THE FUNCTIONAL TREATMENT OF PARSING by Rene Leermakers FOREWORD Formal-language theory, theoretical linguistics and computational linguistics have shared roots in the 1950s, with the seminal work of Kleene, Chomsky, Miller and Bar-Hillel on regular languages and phrase-structure grammars. However, various social, cultural and technological factors have since then conspired to split those disciplines and weaken their understanding and appreciation of each other. Efficiency considerations and the fact that programming languages are human artifacts may partly justify the focus on deterministic languages and parsers in the theory of context-free parsing. However, natural languages are highly ambiguous and thus non-deterministic, making much of that theory seem irrelevant to natural-language parsing. It has thus been difficult to convince the computational linguist of the importance of context-free-parsing theory, if not for specific algorithms, then for concepts and techniques essential to the rigorous analysis of natural-language parsers. Looking in the other direction, the formal-language theorist is mostly unaware of the special problems of natural-language parsing, and thus not only misses a potentially rich area for new research but also fails to appreciate the efforts of computational linguists. For these reasons, the publication of "The Functional Treatment of Parsing" is doubly welcome. For the computational linguist, Rene Leermakers's book brings out the relevance of the theory of context-free parsing to natural-language parsing. His innovative use of functional notation makes algorithms and their derivation less mysterious, and eliminates much of the need for the laborious inductive proofs of correctness found in other parsing theory texts. In addition, the functional approach ties well with the widespread acquaintance of current and recent students with the functional programming paradigm through languages such as Scheme and ML. Delicate data structure issues in parsing are clearly located in elegant abstractions representing nondeterminism and result reuse. For the computer scientist, "The Functional Treatment of Parsing" offers a fresh and unified perspective on a variety of parsing algorithms, some well-known and some less so. This new perspective offers much simpler proofs of correctness and computational complexity, and eliminates the artificial distinction between stack-based and tabular parsers. The use of equational reasoning rather than special-purpose inductive proofs to relate algorithms to their specifications is an excellent application of an approach to program derivation and verification that has received strong support in the works of Boyer, Moore, Dijkstra and Gries. From a more practical angle, Leermakers's approach provides a theoretical basis for the parsing component of interactive language-development environments, for which the standard deterministic parsing methods have been proven unwieldy. Parsing theory has many subtleties, requiring attentive and thoughtful study. While the present book does not excuse the student from those obligations, it will provide ample rewards to readers at all levels. In addition to a self-contained and elegant treatment of all the main ideas of context-free parsing, it brings out the underlying unity of the subject as no other book I know of, and offers a wealth of conceptual and technical riches, of which I particularly enjoyed the application of Lambek types to the analysis of grammatical covers and attribute grammars. There have been increasing signs in the research literature of a long-overdue convergence between formal-language theory and computational linguistics, in particular in the area of context-free parsing. "The Functional Treatment of Parsing" not only demonstrates that convergence for the first time in book form, but also revives context-free parsing theory as an interesting and relevant topic for computational linguists and computer scientists alike. Fernando C.N. Pereira. FOREWORD by Fernando Pereira i PREFACE vii 1 CONTEXT-FREE GRAMMAR 1 2 BUNCH NOTATION 7 2.1 Bunches 8 2.2 Algorithmic Interpretation 12 3 GRAMMAR INTERPRETATIONS 15 3.1 The natural interpretation 15 3.2 Derivation 20 3.3 The Lambek types 23 3.4 Recognition functions 26 3.5 Generation 28 3.6 Summary of interpretations 29 4 RECURSIVE DESCENT 33 4.1 The functional interpretation 33 4.2 Termination 35 4.3 Complexity and memoization 35 4.4 Look ahead 38 4.5 Error recovery 42 5 GRAMMAR TRANSFORMATIONS 45 5.1 Making grammars bilinear 45 5.2 Recursive descent for E(G) 48 5.3 Partial elimination of left recursion 49 5.4 Recursive descent for F(G) 56 6 RECURSIVE ASCENT 61 6.1 The algorithm 62 6.2 Termination 65 6.3 A variant that works with strings 66 6.4 Complexity 67 6.5 EBNF grammars 69 7 PARSE FOREST 73 7.1 Informal introduction 73 7.2 The grammar E'(G) 75 7.3 Forest for bilinear grammars 76 7.4 The set Q 80 7.5 Standard Earley parser 83 7.6 Earley versus Earley 84 8 ATTRIBUTE GRAMMARS 87 8.1 Notational conventions 88 8.2 Attribute functions 90 8.3 Example 92 8.4 Function graphs 96 8.5 Attribute grammar parser 99 8.6 Direct attribute evaluation 100 9 LR PARSERS 111 9.1 LR(0) recognizer 111 9.2 The deterministic case 116 9.3 Implementation with stacks 119 9.4 Some variants 122 9.5 Look ahead 124 9.6 Attributes 126 9.7 Continuations 127 9.8 Error recovery 130 9.9 The methods by Lang and Tomita 133 9.10 Evaluation w.r.t. standard approaches 134 9.11 Earley versus LR 135 10 SOME NOTES 10.1 Context-free grammars 139 10.2 Names 140 10.3 Bunches 140 10.4 Functional programming 141 10.5 Grammar transformations 141 10.6 Memo-functions 142 10.7 Parse forests 142 10.8 Earley 143 10.9 Attribute grammars 143 10.10 Natural language 144 10.11 Other applications 144 10.12 LR parsing 144 10.13 EBNF 145 10.14 Conclusion 145 PREFACE The theory of parsing with respect to context-free grammars is one of the old and established parts of computer science. The first contributions were rather theoretical treatises within automata theory. Later on, contributions to the field came from people who were interested in practical applications such as compiler construction and natural language processing. The programming language community developed a vast amount of knowledge about deterministic (LL(k), PLR(k), LR(k), LALR(k), operator precedence, recursive descent ...) parsers. For analyzing natural language, these parsers are not useful. Instead, for the latter purpose, a large number of general parsing algorithms have been used. Among them are the algorithm attributed to Cocke, Younger and Kasami, and the Earley, chart, Sheil, and Tomita parsers. The scientific communities of computational linguistics and compiler theory are rather different. It is my experience that the average researcher in either area underestimates the problems at the other side of the fence. One professor of computer science, for instance, once confided to me that it escaped him why ``beautiful compiler construction tools like parser generators (he mentioned a specific one) are not being used for natural language processing''. This professor had never realized that there are differences between artificial languages and natural languages that have grave consequences for parsing, such as the fact that natural language sentences are syntactically ambiguous. This book might help bring both parsing communities closer together. In any case, it brings together the techniques that are used at either side. The current state of parsing theory reflects the status quo regarding the two contributing communities. It suffers from a dichotomy that does not befit a mature field of knowledge. Parsers in compilers are typically implemented as deterministic push-down automata, the central data structure of which is a so-called parse stack. General parsing algorithms are mostly tabular, which means that a parse matrix is the central data structure. In this essay, by contrast, deterministic and general parsing algorithms are treated in a unified fashion. This is accomplished by adopting a functional formulation of parsing theory. In the new theory, factors that distinguish various parsing algorithms, such as stacks and parse matrices, are banned. Stacks are replaced by recursive functions, and parse matrices by memoizing functions (functions that remember past invocations). In the main course, some deficient parts of the existing body of knowledge are identified and repaired. A notable example of such a deficiency in the standard theory is the absence of simple functional implementations of LR parsers. Many books on parsing theory are dedicated to the study of many classes of grammars (such as LL(k), LR(k), ...). Such a class is determined by the requirement that a corresponding parser behaves deterministically. In this essay, whether or not a parser is deterministic is considered to be a marginal question. Our emphasis is on algorithms, not on grammar classes. Our main topic is theory, but the application of the theory is never far away, and the results are of direct practical relevance. Parsing theory is presented in a mathematical way, but the style of mathematics is not too rigorous. The correctness of most algorithms is formally established, with proofs of calculational nature. The theory presented in this essay leads to a new technique for implementing parsers, which has been coined "recursive ascent parsing". The history of recursive ascent parsing is quite interesting. The standard formulation of LR parsing uses the concepts of automata theory. Looking for efficient implementations, Pennello came up with a technique borrowed from efficient recursive descent implementations, and in this way created, in a hidden way, the first recursive ascent LR parser. Based on this work, Roberts presented an implementation at a higher level of abstraction. Independently from Roberts and Pennello, Kruseman Aretz and Barnard and Cordy almost simultaneously proposed very similar ideas, with motivations of a theoretical kind. Their starting point was the standard LR parser. Automata This book's Theory theory | | | | V V Standard Recursive LR parser -------> ascent | parsing | |^ | | ------------> Penello Figure 1 Figure 1 is a pictorial representation of this history, and displays the work reported here at the level of automata theory. More precisely, our theory is the functional equivalent of the theory of nondeterministic pushdown automata (NPDA), the class of automata that recognizes exactly the context-free languages. The basic idea is simple but extraordinary powerful: a NPDA state is `implemented' as a function. Then, state transitions correspond to function calls, and stack pops to function returns. The expert reader may be surprised when he finds his favorite parsing algorithm described in this book. LR parsers, for example, are defined not only without parse stacks but also without parsing tables. The functional approach to parsing requires a new way of thinking, and in this respect it may sometimes be an advantage to be unacquainted with standard approaches. People that love functional programming will enjoy the new applications of this style of writing algorithms. For others, the same style may be a (hopefully temporary) stumbling block. For that reason, every now and then an example is given in an imperative style, that can be translated into any imperative programming language without much ado. The reader is adviced to actually do so whenever the level of abstraction becomes a problem. My hope is that this essay will be of use to students, teachers, scientists and programmers. Parsing theory does not need heavy mathematics and only some standard mathematics skills are presupposed. The book is self-contained. References to the literature are avoided in the content chapters. The last chapter is contemplative and contains many bibliographic notes. The results are presented without long explanatory elaborations, but this could make the book a bit terse for students. As an encouragement, it may be said that working through this book is a quick way to become an expert on parsing theory. The book can be used to teach parsing, but it also contains interesting topics to touch upon in courses on formal language theory, compiler construction, functional programming, computational linguistics, or program derivation. This book was written to explain the new theory of chapters 6 to 9. The earlier chapters are a prelude, providing background knowledge that is useful for building up some intuition about recursive ascent parsing algorithms. We start with an introduction to context-free grammars. Various interpretations of context-free grammars are given. One such interpretation involves a mapping from grammar symbols to multiple-valued functions, and leads directly to the recursive descent parsing method. As this method has some limitations, we are led to study grammar transformations as a way to adapt arbitrary grammars to these limitations. Historically, many results of chapters 6 and 9 have been obtained by grammar transformations. However, the chapters 6 to 9 do not depend on all of the preceding text, and in particular they can be understood without knowing about grammar transformations. One may therefore skip chapter 5 on first reading, if one is not interested in what is `behind it all'. In practical applications, context-free grammars are often used in conjunction with some form of attribute evaluation. Chapter 8 contains a functional treatment of attribute grammars, in such a way that parsing with respect to attribute grammars is a simple extension to pure context-free parsing. Two attribute grammar formalisms are presented. One is typically useful in compiler technology, where attributes are used for semantic purposes. The other is meant for applications in which attributes play an important syntactic role, as in natural language parsing. I am indebted to my family, for letting me work so many nights, and to various colleagues at Philips Research for having been instrumental. Jan Landsbergen stimulated me to express my views in the form of an essay. Frans Kruseman Aretz invented recursive ascent parsing and through his stimulating, if critical, comments contributed to this work in its early stages. Lex Augusteijn profoundly influenced both my ideas and the way they appear in this book. Rene Leermakers, February 1993. -------------------------------------ORDER FORM------------------------------ Ref: ftpser Please send me: The Functional Treatment of Parsing, by Rene Leermakers _____copy(ies) HB, ISBN 0-7923-9376-7, $79.95 Payment enclosed to the amount of ___________________________ * Please invoice me * Please charge my credit card Name of Card Holder: ______________________________________ Card. no.: ________________________________________________ Expiry Date:______________________________________________ Am. Ex.* Visa* Diners Club* Mastercard* Delivery address: Name: ___________________________________________________________________ Address: ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ Date:________________ Signature:_______________________________ To be sent to: Outside North America In USA and Canada KLUWER ACADEMIC PUBLISHERS GROUP KLUWER ACADEMIC PUBLISHERS Order Dept. Order Dept P.O. Box 322 101 Philip Drive 3300 AH Dordrecht, The Netherlands Norwell, 02016 MA Tel: +31-78-524400 Tel: 617-871-6600 Fax +31-78-524474. Fax: 617-871-6528 email: vanderlinden@wkap.nl email: kluwer@world.std.com Orders from individuals accompanied by payment or authorization to charge a credit card account will ensure prompt delivery. Postage and handling charges will be absorbed by the Publisher on all such orders. Payment will be accepted in any convertible currency. Please check the rate of exchange at your bank. For sales within the Netherlands please add 6% VAT (BTW). Prices are subject to change without notice. * Delete those that do not apply.