NAME: NetTalk Corpus

SUMMARY: This is an updated and corrected version of the data set used by
Sejnowski and Rosenberg in their influential study of speech generation
using a neural network [1].  The file "nettalk.data" contains a list of
20,008 English words, along with a phonetic transcription for each word.
The task is to train a network to produce the proper phonemes, given a
string of letters as input.  This is an example of an input/output mapping
task that exhibits strong global regularities, but also a large number of
more specialized rules and exceptional cases.

SOURCE: The data set was contributed to the benchmark collection by Terry
Sejnowski, now at the Salk Institute and the University of California at
San Deigo.  The data set was developed in collaboration with Charles
Rosenberg of Princeton.  Approximately 250 person-hours went into creating
and testing this database.

COPYRIGHT STATUS: The data file carries the following copyright notice:

Copyright (C) 1988 by Terrence J. Sejnowski.  Permission is hereby given to
use the included data for non-commercial research purposes.  Contact The
Johns Hopkins University, Cognitive Science Center, Baltimore MD, USA for
information on commercial use.

MAINTAINER: neural-bench@cs.cmu.edu

PROBLEM DESCRIPTION:

The data set is in the benchmark directory in file "nettalk.data".  This data
set is in the standard CMU Benchmark format.

METHODOLOGY: 

This data set can be used in a number of different ways to test learning
speed, quality of ultimate learning, ability to generalize, or combinations
of these factors.

Sejnowski and Rosenberg used the following experimental setup:

The input to the network is a series of seven consecutive letters from one
of the training words.  The central letter in this sequence is the
"current" one for which the phonemic output is to be produced.  Three
letters on either side of this central letter provide context that helps to
determine the pronunciation.  Of course, there are a few words in English
for which this local seven-letter window is not sufficient to determine the
proper output.  For the study using this "dictionary" corpus, individual
words are moved through the window so that each letter in the word is seen
in the central position.  Blanks are added before and after the word as
needed.  Some words appear more than once in the dictionary, with different
pronunciations in each case; only the first pronunciation given for each
word was used in this experiment.

A unary encoding is used.  For each of the seven letter positions in the
input, the network has a set of 29 input units: one for each of the 26
letters in English, and three for punctuation characters.  Thus, there are
29 x 7 = 203 input units in all.

The output side of the network uses a distributed representation for the
phonemes.  There are 21 output units representing various articulatory
features such as voicing and vowel height.  Each phoneme is represented by
a distinct binary vector over this set of 21 units.  In addition, there are
5 output units that encode stress and syllable boundaries.

Standard back-propagation was used, with update of the weights after the
error gradient has been computed by back-propagation of errors for all the
letter positions a single word.  The number of hidden units in the network
was varied from one experiment to another, from 0 to 120.  Each layer was
totally connected to the next; in the case of 0 hidden units, the input
units were directly connected to the outputs.

The weight-update formulas used in the Sejnowski and Rosenberg study were
slightly different from the standard form; see the paper for a description
of the momentum and learning rate parameters used.  The network weights
were initialized with random values in the range -0.3 to +0.3.  A
sum-of-squares error measure was used, with errors of magnitude less than
0.1 being treated as 0.0.

RESULTS: 

In [1], Sejnowski and Rosenberg present full error curves for a number of
experiments using this corpus.  Here we will present only a brief summary
of these results.  The results reported here all use the "best guess"
criterion: an output is treated as correct if it is closer (smallest angle)
to the correct output vector than to any other phoneme output vector.

A subset, consisting of the 1000 most common English words, was tested with
0, 15, 30, 60, and 120 hidden units.  With 0 hidden units, the best
performance achieved was about 82% correct by the "best guess" criterion.
The rate of learning and final performance improved steadily with
increasing numbers of hidden units, up to 98% correct with 120 hidden
units.

(The original "list of 1000 most common English words" is not available,
but a reconstruction of this list is in file "nettalk.list" in the
benchmark data base.  This should be very close to the original.)

This 120-hidden-unit network scored about 80% after training on 5000
word-presentations (i.e. five times through the 1000-word corpus), and the
98% level was reached after 30,000 presentations.

This pre-trained network with 120 hidden units was then tried on the full
corpus of 20,000 words.  Without further training, the correct output was
generated in 77% of the cases.  After 5 passes through the larger corpus,
performance improved to 90% correct.

Sejnowski reports that in later (unpublished) experiments, a better rate of
generalization was achieved.  A window of 11 consecutive letters was used
instead of the 7 used in other experiments.  The network was trained on
18,000 words from the corpus until about 94% of the output phonemes were
correct by the best-guess criterion.  The remaining 2,000 words were then
presented without further training, and 92% of the output phonemes were
correct.

REFERENCES: 

1. Sejnowski, T.J., and Rosenberg, C.R. (1987).  "Parallel networks that
learn to pronounce English text" in Complex Systems, 1, 145-168.
