Date: Fri, 4 Mar 94 16:29:21 MST
Message-Id: <9403042329.AA06052@NMSU.Edu>
From: lexical@crl.nmsu.edu (Consortium for Lexical Research)
To: clrlist@crl.nmsu.edu
Subject: CLR Newsletter No. 11
Cc: lexical@crl.nmsu.edu

Reply-To: lexical@crl.nmsu.edu
********************************************************************

Greetings;

Enclosed is the Consortium for Lexical Research Newsletter No.  11.
This special edition is being mailed to a large compiled list of
researchers in linguistics and computational linguistics.

If you do not want to receive additional mailings, no response is
necessary.  This is a one time mailing and you will not be sent any
future issues.

If you would like to be put on our mailing list and receive a CLR
newsletter every two months, please send your request to:
lexical@crl.nmsu.edu.

Hoping you enjoy the newsletter,

Katherine Mitchell
Consortium for Lexical Research

*********************************************************************


*************************************************
Consortium for Lexical Research Newsletter 11
February 28, 1994
*************************************************


From the Computing Research Laboratory 
New Mexico State University

Edited by: Jim Cowie and Katherine Mitchell 

Contributions and inquiries to:
	lexical@nmsu.edu

FTP address for accessing materials:
	clr.nmsu.edu [128.123.1.12].

*************************************************

Introduction

This newsletter discusses machine readable dictionaries made available
through CLR. In addition some recently acquired parsers are described.
The next newsletter will describe wordlists stored in our archives,
and a new service to CLR members, called Resources, which will
centralize information on ftp sites, organizations, projects,
publications, etc. of interest to the natural language processing
research community.

For more information on the Consortium, please ftp to our site and get
a copy of our catalog. It is available in plain ascii as `catalog' or
in a postscript version `catalog.ps'. Any questions about the archives
or on using the becoming a member of CLR can be obtained by emailing
lexical@nmsu.edu.  

This newsletter is distributed in plain ASCII text and in postscript
format. To obtain a copy please ftp to clr.nmsu.edu.  The directory is
CLR/newsletter and the files to get are news11.txt and news11.ps.

*************************************************

Contents 

1.	Using FTP and Changes in our FTP site 

2.	Machine Readable Dictionaries

3.	Recent Acquisitions 

4.	CLR Membership 

*************************************************

Using Anonymous FTP 

Materials stored in the CLR are constantly being updated and new
acquisitions are available. If you are interested in learning what
these items are, you are welcome to ftp our catalog.

Anonymous ftp allows non-members access to the catalogs and some
unrestricted data files.  Here are the steps for using Anonymous FTP.
It is recommended that you get the file README.clr.site for an
introduction to using our archives.


>ftp clr.nmsu.edu (or ftp 128.123.1.12)

login: anonymous 

password: type in your email address; for ex: rose@ed.ac.uk

ftp> cd CLR

ftp> binary (it is very important to set the binary mode when you are
downloading software programs. Failure to do this can cause poor data
transfer and problems with the software when you use it) 

ftp> get README.clr.site (or get catalog) 

ftp> quit

Members of CLR use a login name and a special password that they are
assigned. Members can access certain directories that non-members are
unable to use.


Changes at our FTP site

The archives have been reorganized and all CLR materials are now under
the directory CLR. Within the CLR area there are reserved materials
filed in the directories named members-only and MUC5, and freely
available materials under the directories multiling, lexica, and
tools. There are new README files, and the file README.clr.site will
answer basic questions about the Consortium and accessing its
materials through ftp.


*************************************************
MACHINE READABLE DICTIONARIES
************************************************

Large dictionaries with full semantic information are not freely
available. The first section below describes the MRD's which CLR
distributes, along with their costs.  Sample electronic files are
available for some of these dictionaries; please email us if you would
like to see electronic samples. The Consortium has very good working
relationships with these publishers, and will facilitate the paperwork
and expediting the materials.

There are a variety of freely available dictionaries which have
pronunciation information, some syntactic information, or some
features information, but none with full definitions of headwords. As
a service to CLR members we are gathering these dictionaries and
trying to build a complete centralized archive of what is available.


****Dictionaries with Semantic Information: not freely available****


1)  Collins English Dictionary 

Collins English Dictionary, 3rd Edition, published in 1991. A revised
edition will be issued later this year; Revised 3rd Edition. The CED3
contains 180,000 references, 190,000 numbered definitions, 14,000 new
entries and updated entries from the previous edition, and 16,000
biographical and geographical entries; it has 3.5 million words of
text. A very extensive resource, in many ways an encyclopedia as well
as a dictionary. The vocabulary of science, technology, and other
specialist areas is well covered.Older versions of the CED are fairly
obsolete. The CED3 is primarily a British English dictionary, though
it contains many American English spellings and meanings.

Format: The machine readable dictionary is supplied on tape. It is in
ascii text format and contains the typesetters codes.  

Cost: 2,000 pounds sterling for academic research; more for corporate
research.  

Instructions: Harper Collins Publishers has an application form that
is required; upon its approval a contract is drawn up; upon completion
of the contract and payment of the fees, the tape is provided.

Sample: an electronic sample is available from CLR.


2)   Collins COBUILD English Language Dictionary

Collins COBUILD English Language Dictionary is a Learners Dictionary,
designed for instruction in the English language. COBUILD is developed
from analysis of the Collins Bank of English, a corpus of more than
200 million words gathered from a wide range of spoken and written
sources. The dictionary concentrates on contemporary, everday,
non-specialist English and uses example sentences taken from real or
spoken language. It has over 70,000 references, and over 90,000
examples.

The Format, Cost, Instructions, and Samples are the same as above.


3)   Collins Bilingual Dictionaries

Harper Collins publishes a line of bilingual dictionaries for several
languages which come in different sizes. The Gem series editions have
about 40,000 references, and the Concise series editions have over
100,000. The Large editions typically have over 200,000 references and
over 400,000 translations.  

German-English bilingual: Gem, Concise, Large 

Italian-English bilingual: Gem, Concise, Large 

Spanish-English bilingual: Gem, Concise, Large 

French-English bilingual: Gem, 

Greek-English bilingual: Gem 

Hindi-English bilingual: Gem

Malay-English bilingual: Gem 

Portuguese-English bilingual: Gem

Russian-English bilingual: Gem 

Format: The machine readable dictionaries are supplied on tape.
They're in ascii text format and contain the typesetters codes.  

Cost: For academic research only; the Gem costs 1,000 pounds sterling,
the Concise is 1,250 pounds, and the Large is 2,000 pounds. Corporate
research pricing is higher.  

Instructions: are the same as those listed above for Collins English
dictionaries.  

Samples: electronic samples are available for the Spanish-English and
the German-English bilinguals in their Large editions. The samples are
for the complete letter "N" in both languages.


4)   Longmans Dictionary of Contemporary English

LDOCE is available electronically as the second edition published in
1987 or as the first edition from 1978. The first edition is available
as a typesetting tape or as a LISP version, and has semantic
information which was not included in the second edition. For example,
Box Codes, which hierarchically specify abstract or concrete, concrete
branching to animate or inanimate, animate branching to plant, human,
animals and etc., etc. Also marked are Subject Field Codes which
indicate domains, such as Economics, Entertainment, or Basketball.
LDOCE is a Learners Dictionary; the first edition has approximately
45,000 entries, and the second edition about 56,000 (including
phrasals).

Format: 1978 and 1987 editions: typesetting tape.  

1978 edition: lisp version on tape, or typesetting tape.  

Cost: 1,000 pounds sterling for academic research; more for corporate
research.  

Instructions: Longmans Publishers has an application form
that is required; upon its approval a contract is drawn up; upon
completion of the contract and payment of the fees, the tape is
provided.  


5)   Roget's Thesaurus 

The original 1911 Roget's Thesaurus is freely available. A 1991
American English Edition of the thesaurus is available published by
Harper Collins. For academic research purposes it costs 750 pounds
sterling.  Please write to inquire about the Collins version.  You can
ftp to the directory below to pick up the freely available 1911
version.


Ftp Directory: lexica/roget_1911/ 


****Dictionaries in English: freely available ****

The following is a list of public domain dictionaries, none of which
contain definitions or full semantic information. All have some
accompanying documentation that helps explain the codes used and the
syntactic information available. Each of these has strict copyright
privileges reserving them for academic research and excluding them
from incorporation into any commercial applications.  


1)   Collins English Dictionary Prolog FactBase 

Developed by Dr.  Ed Fox and Dr. Robert Vance at Virginia Tech. Using
the original 1974 Collins English Dictionary a set of Prolog facts
were derived and a set of relations files created, one file for each
"relation to the headword" identified in the structure of the
dictionary. Examples of these files are:HEADWORD, headword entry;
ALSO_CALLED, headword is also commonly called this; CATEGORY, semantic
label of headword; POS, part of speech; PAST, past form of headword.
Edinburgh standard Prolog syntax.

Ftp Directory: CLR/members-only/lexica/CED.prolog 


2)   The MRC Psycholinguistic Database 

Originally prepared by Max Coltheart for a Medical Research Council
grant as a database for psycholinguistic use.  The file has 150,837
headwords with 26 linguistic properties, although information on every
property is not available for very many words.  Properties include
number of phonemes and syllables, measures of frequency,
pronunciation, and part of speech, to name a few.

Ftp Directory: CLR/members-only/lexica/MRC.psycholing/ 


3)   The On-Line Dictionary of Computing 

A glossary of programming languages, architecture, networks, domain
theory, mathematics, etc.  Copyright Dennis Howe 1993, freely
available for research use.

Ftp Directory: CLR/members-only/lexica/OLDC/ 


4)   The Oxford Advanced Learners Dictionary of Contemporary
English: Mitton's version

This is a version of the 1974, 3rd edition OALDCE, prepared by Roger
Mitton of the University of London specifically for use in computer
applications. The dictionary contains no definitions; the spelling,
pronunciation, and syntactic information from the original are
retained. It has 35,000 headwords, about 2,500 added proper names, and
an added section created by Dr. Mitton which has over 68,000 derived
inflected forms.

Ftp Directory: CLR/members-only/lexica/OALDCE/ 


5)   WordNet 1.4 

Developed by Professor George Miller and his group at Princeton,
WordNet is an on-line lexical reference system designed as a semantic
network.  English nouns, verbs, and adjectives are organized into
synonym sets, each representing one underlying lexical concept.
Different relations link the synonym sets: synonymy, antonymy,
meronomy, and hyponymy.  Wordnet has brief definitions, and has the
advantage of having been conceived and built explicitly for use in
computer applications. CLR houses versions 1.2, 1.3, and 1.4.

Ftp Directory: CLR/lexica/wordnet/ 


****Pronunciation Dictionaries ****

The Carnegie Mellon University Dictionary contains about 100,000 words
and their phonetic transcriptions. The phoneset lists 26 phones that
were used. Robert Weide and Peter Jansen from CMU generated the
dictionary from a variety of sources including the UCLA Shoup
dictionary. A second source for pronunciation is Chuck Wooster's (ICSI
Berkeley) TIMIT database of 6100 words from TIMIT and their most
common pronunciation.

Homophones is not a pronunciation dictionary, but rather is a list of
words that sound the same but are spelled differently. The list of
homophones was provided by Evan Antworth from the Summer Institute of
Linguistics.

Ftp Directory: CLR/members-only/lexica/CMU-Dict.0.1/ 

Ftp Directory: CLR/lexica/TIMIT/ 

Ftp Directory: CLR/lexica/homophones/


****Dictionaries: not English****


1)   EDICT

This is a public domain Japanese/English dictionary intended
originally for use with MOKE (Marks Own Kanji Editor) but is used
today in a large number of packages. It uses EUC code for Kanji and
Kana. EDICT has over 30,000 entries, and entries do have markers, such
as transitive or intransitive verb, idiomatic expression, person name,
etc. EDICT was started by Mark Edwards, but has been developed by
James Breen.

Ftp Directory: CLR/lexica/edictj 


2)   JDDICT

A Japanese to German dictionary entered in by Helmut Goldstein. The
dictionary has over 11,000 Japanese words and 22,000 German
translations.

Ftp Directory: CLR/lexica/jddict/ 


3)  The Japanese Morphological Dictionary 

This was made freely available by ICOT, and comes with both the
dictionary and a search program to access it.  The documentation is
extensive, but it is all in Japanese.

Ftp Directory: CLR/lexica/jmorphdict/ 


4)   Russian - English On-Line Dictionary 

This is an on-line dictionary for MS DOS developed by Leon Ungier.

Ftp Directory: CLR/multiling/russian/


****************************************************
RECENT ACQUISTIONS
****************************************************

Below are some new additions to the CLR archives. The acronym
dictionary is mentioned because it is in keeping with this
newsletter's theme. The other materials are parsers and grammar
systems.  

---------------------------------------------------

ACRONYM DICTIONARY 

Ftp Directory: members-only/lexica/wordlists/acronyms/ 

An ascii text file of a very comprehensive list of acronyms; over 3300
entries. A wide variety of domains are covered, including business,
science, medicine, government, and more. A brief sample from the
letter `N':

NAS National Academy of Sciences; NAS National Advanced Systems;
NASA National (US) Aeronautics and Space Administration [Space];
NASDA NAtional (Japan) Space Development Agency [Space]; NASM	
National (US) Air and Space Museum [Space]; NASP National (US)
AeroSpace Plane [Space]; NATO North Atlantic Treaty Organization.

---------------------------------------------------

AV Parser

Ftp Directory:	 members-only/tools/ling-analysis/syntax/AVparser/

The Attribute Value Parser provides a general tool for investigating
unification-based theories of grammar, runs on Apple Macintosh
computers, and was developed by Mark Johnson. It works with a
user-defined grammar, specified in a file or constructed using the
editor included, and constructs parse trees and feature structures
from input sentences. Clicking on the nodes in the parse tree causes
their associated feature structures to be displayed. There are two
versions of the parser, corresponding to the two versions of Apple's
CommonLisp environment that were used to create them. The 1.32 version
was created with MACL version 1.32, and the 2.0p2 version was created
with MCL 2.0p2.

---------------------------------------------------

FUF and SURGE 

Ftp Directory: /members-only/tools/ling-analysis/syntax/ 

FUF 5.2 and SURGE 1.2 were developed by Michael Elhadad, currently at
Ben Gurion University of the Negev. FUF is an extended implementation
of the formalism of functional unification grammars (FUG's) introduced
by Martin Kay, specialized to the task of natural language generation.
SURGE is a large syntactic realization grammar of English, written in
FUF. SURGE is developed to serve as a "black box" syntactic generation
component in a larger generation system that encapsulates a rich
knowledge of English syntax. SURGE can also be used as a platform for
exploration of grammar writing with a generation perspective.  

---------------------------------------------------

LHIP PARSER

Ftp Directory: members-only/tools/ling-analysis/syntax/

The LHIP parser (Left-Head corner Island Parser) was developed by
Afzal Ballim, at ISSCO, the University of Geneva. LHIP is a system for
incremental grammar development using an extended DCG formalism. The
system uses a robust island-based parsing method controlled by
user-defined performance thresholds which allows it to analyze what it
can from the input, thus presenting the grammar developer with results
at an early stage. The rules themselves are an extended version of the
DCG rules, allowing optional constituents, negation, disjunction, the
specification of adjacency, and the ability to mark multiple heads in
a rule body. The latest version is 1.1. The lhip system requires an
Edinburgh style Prolog.


***************************************************
CLR MEMBERSHIP
***************************************************

The members-only area of the CLR archives is rapidly increasing its
volume with valuable materials and software available to lexical
researchers, members of the consortium. If your interests lie in
lexicology, lexicography and lexical research, we encourage your
organization to become a member, promoting the use of these valuable
resources for lexical research and ensuring that they can be
maintained.

Welcome to new CLR members: 

Edwin R. Addison, President, along with the staff of Conquest
Software, Inc. in Columbia, Maryland.  

Dr. Jose M. Castano at the Departmento de Computacion, Universidad de
Buenos Aires, Buenos Aires, Argentina.

Dr. Jane Edwards of the Institute for Cognitive Studies, and Dr.
Daniel Jurafsky of the International Computer Science Institute, at
the University of California at Berkeley, Berkeley, California.  

Dr.  Kemal Oflazer of the Computer Engineering Department, Bilkent
University, Ankara, Turkey.