NAME: NetTalk Corpus

SUMMARY: This is an updated and corrected version of the data set used by
Sejnowski and Rosenberg in their influential study of speech generation
using a neural network [1].  The file "nettalk.data" contains a list of
20,008 English words, along with a phonetic transcription for each word.
The task is to train a network to produce the proper phonemes, given a
string of letters as input.  This is an example of an input/output mapping
task that exhibits strong global regularities, but also a large number of
more specialized rules and exceptional cases.

SOURCE: The data set was contributed to the benchmark collection by Terry
Sejnowski, now at the Salk Institute and the University of California at
San Deigo.  The data set was developed in collaboration with Charles
Rosenberg of Princeton.  Approximately 250 person-hours went into creating
and testing this database.

COPYRIGHT STATUS: The data file carries the following copyright notice:

Copyright (C) 1988 by Terrence J. Sejnowski.  Permission is hereby given to
use the included data for non-commercial research purposes.  Contact The
Johns Hopkins University, Cognitive Science Center, Baltimore MD, USA for
information on commercial use.

MAINTAINER: Scott E. Fahlman

PROBLEM DESCRIPTION:

The data set is in the benchmark directory in file "nettalk.data".  A
description of the format used for the phonetic transcription is included
at the end of this file.

METHODOLOGY: 

This data set can be used in a number of different ways to test learning
speed, quality of ultimate learning, ability to generalize, or combinations
of these factors.

Sejnowski and Rosenberg used the following experimental setup:

The input to the network is a series of seven consecutive letters from one
of the training words.  The central letter in this sequence is the
"current" one for which the phonemic output is to be produced.  Three
letters on either side of this central letter provide context that helps to
determine the pronunciation.  Of course, there are a few words in English
for which this local seven-letter window is not sufficient to determine the
proper output.  For the study using this "dictionary" corpus, individual
words are moved through the window so that each letter in the word is seen
in the central position.  Blanks are added before and after the word as
needed.  Some words appear more than once in the dictionary, with different
pronunciations in each case; only the first pronunciation given for each
word was used in this experiment.

A unary encoding is used.  For each of the seven letter positions in the
input, the network has a set of 29 input units: one for each of the 26
letters in English, and three for punctuation characters.  Thus, there are
29 x 7 = 203 input units in all.

The output side of the network uses a distributed representation for the
phonemes.  There are 21 output units representing various articulatory
features such as voicing and vowel height.  Each phoneme is represented by
a distinct binary vector over this set of 21 units.  In addition, there are
5 output units that encode stress and syllable boundaries.

Standard back-propagation was used, with update of the weights after the
error gradient has been computed by back-propagation of errors for all the
letter positions a single word.  The number of hidden units in the network
was varied from one experiment to another, from 0 to 120.  Each layer was
totally connected to the next; in the case of 0 hidden units, the input
units were directly connected to the outputs.

The weight-update formulas used in the Sejnowski and Rosenberg study were
slightly different from the standard form; see the paper for a description
of the momentum and learning rate parameters used.  The network weights
were initialized with random values in the range -0.3 to +0.3.  A
sum-of-squares error measure was used, with errors of magnitude less than
0.1 being treated as 0.0.

RESULTS: 

In [1], Sejnowski and Rosenberg present full error curves for a number of
experiments using this corpus.  Here we will present only a brief summary
of these results.  The results reported here all use the "best guess"
criterion: an output is treated as correct if it is closer (smallest angle)
to the correct output vector than to any other phoneme output vector.

A subset, consisting of the 1000 most common English words, was tested with
0, 15, 30, 60, and 120 hidden units.  With 0 hidden units, the best
performance achieved was about 82% correct by the "best guess" criterion.
The rate of learning and final performance improved steadily with
increasing numbers of hidden units, up to 98% correct with 120 hidden
units.

(The original "list of 1000 most common English words" is not available,
but a reconstruction of this list is in file "nettalk.list" in the
benchmark data base.  This should be very close to the original.)

This 120-hidden-unit network scored about 80% after training on 5000
word-presentations (i.e. five times through the 1000-word corpus), and the
98% level was reached after 30,000 presentations.

This pre-trained network with 120 hidden units was then tried on the full
corpus of 20,000 words.  Without further training, the correct output was
generated in 77% of the cases.  After 5 passes through the larger corpus,
performance improved to 90% correct.

Sejnowski reports that in later (unpublished) experiments, a better rate of
generalization was achieved.  A window of 11 consecutive letters was used
instead of the 7 used in other experiments.  The network was trained on
18,000 words from the corpus until about 94% of the output phonemes were
correct by the best-guess criterion.  The remaining 2,000 words were then
presented without further training, and 92% of the output phonemes were
correct.

REFERENCES: 

1. Sejnowski, T.J., and Rosenberg, C.R. (1987).  "Parallel networks that
learn to pronounce English text" in Complex Systems, 1, 145-168.

***************************************************************************

DESCRIPTION OF DATA FORMAT:

The pronouncing dictionary was created to study the translation process
between written English, using graphemes or letters as units, and spoken
English, using phonemes as units. The dictionary includes 20008 aligned
letter and phonetic representations with stresses.

The dictionary contains four tab separated fields of information for each
word.  The fields are:

	1) a letter representation
	2) a phonemic representation
	3) stress and syllabic structure
	4) an integer indicating foreign and irregular words

**** Phonemic Codes ****

The phonemic coding scheme uses 50 phonemes. They are compatible with the
DECtalk speech synthesizer which uses a modified ARPAbet system.

NETalk		Example words 
phoneme		that use the phoneme
symbol

  a		wad,dot,odd
  b		bad
  c		o in "or", au in "caught", aw in "awe"
  d		add
  e		angel,blade,way
  f		farm
  g	  	gap
  h		hot,who
  i		long e as in "eve", theme,bee
  k	 	cab,keep
  l		lad
  m		man,imp
  n		gnat,and
  o		only,own
  p		pad,apt
  r		rap
  s		cent,ask
  t		tab
  u		boot,ooze,you
  v		vat
  w		we,liquid
  x		a in "pirate", o in "welcome"
  y		yes,senior
  z		zoo,goes
  A		long i as in "ice", height,eye
  C		chart,cello
  D		the,mother
  E		many,end,head
  G		length,long,bank
  I		i in "give", u in "busy", ai in "captain"
  J		jam,gem
  K		anxious,sexual
  L		evil,able
  M		chasm
  N		shorten,basin
  O		oil,boy
  Q		quilt
  R		honer,after,satyr
  S		ocean,wish
  T		thaw,bath
  U		wood,could,put
  W		out,towel,house
  X		mixture,annex
  Y		use,feud,new
  Z		s in "usual", s in "vision "
  @		cab,plaid
  !		z in "nazi", zz in "pizza"
  #		x in "auxiliary", x in "exist"
  *		wh in "what"
  ^		u in "up", o in "son", oo in "blood"
  +             oi in "abattoir", oi in "mademoiselle"

**** Stress and Syllables *****

Stress and syllable information is coded in the following manner.  Each
phoneme position is coded with one of the following symbols.

	'>'	indicates a consonant prior to a syllable nucleus.
	'<'	indicates a consonant or vowel following the first
		vowel of the syllable nucleus.
	'0'	indicates the first vowel in the nucleus of an unstressed
		syllable.
	'2'	indicates the first vowel in the nucleus of a syllable
		receiving secondary stress.
	'1'	indicates the first vowel in the nucleus of a syllable
		receiving primary stress.

**** Foreign words and Irregular Phoneme-Grapheme Correspondences ****

The last field in each line indicates whether the word uses foreign or
irregular phoneme to grapheme correspondences.

	'1'	 indicates an irregular word.
	'2'	 indicates a foreign word.
	'0'	 is used for all other words.

Examples:

Foreign words -
							phoneme->grapheme
							option not found
word		phonemes	stress and	code	in other words
				syllable info	for
						foreign &
						irregular words

abattoir	@bxt-+-r	1<0<>2<<	2	/+/ -> 'oi'
beaujolais	b-o-Zxle--	>2<<>0>1<<	2	/Z/ -> 'j' and
							/e/ -> 'ais'
tortilla	tcrti--a	>0<>1>>0	2	/i/ -> 'ill'
pizza		pi!-x	 	>1<>0		2	/!/ -> 'zz'
fjord		ficrd		>01<<		2	/i/ -> 'j'

These words are all marked with '2's in their record to indicate a foreign
word.  

Irregular words -

asthma		@z--mx		1<>>>0		1
caesura		sI-ZUrx		>0<>1<0		1
hallelujah	h@l-xluyx-	>2<>0>1>0<	1
isthmus		Is--mxs		1<>>>0<		1
yeoman		y-omxn	>	1<>0<		1

These words contain irregular phoneme-grapheme correspondences. They are
marked with '1's at the end of their fields.  Seventy-one of the dictionary
words use unique phoneme-letter pairs. These are either silent letters,
where some letters are not pronounced, or unique spellings, where some
letters are pronounced in irregular ways.

**** Dictionary Sources *****

Initial phoneme codes were generated programmatically from the word spelling
and then were extensively checked and corrected by matching phonemes used by
Hanna et al to those used in the Webster's Pocket Dictionary and Webster's
New Collegiate Dictionary.  (Hanna,R.R., Hanna, J.S., Hodges, R.E. & Rudorf,
E.H.  (1966). Phoneme-grapheme correspondences as cues to spelling
improvement, U.S. Department of Health, Education and Welfare, Office of
Education, Washington, DC: U.S. Government Printing Office.) 
 
The phonetic codings were "spoken" thru a DECtalk speech synthesizer to
insure accurate pronunciation. The original phoneme lists were expanded and
changed to include all the necessary correspondences for pronunciation and
spelling of the words in the corpus. 

**** One-to-One Correspondence Constraint ****

A one-to-one phoneme to letter correspondence was needed for NETtalk, a
backpropagation network which learns to map strings of letters into strings
of phonemes. This constraint forced the compromises listed below.  Two major
types of decisions had to be made to insure that phoneme-grapheme
correspondences were one-to-one.  The decisions concerned:

	1) How to handle phonemes that were written with more than one letter
And 
	2) How to handle letters that were spoken using more than one phoneme

**** Placement of Phonemes that Correspond to Multiple Letters ****

Phonemes may correspond to as many as 5 letters. When this happens, one
letter is assigned to the phoneme and the remaining letters are represented
in the phonetic code by hyphens.
 
Usually, when a phoneme corresponds to a letter cluster, the phoneme is most
closely associated with the first letter of the cluster, and is,
accordingly, placed in the same position as the first letter.  Hyphens are
placed in the remaining positions.  In some cases there is a letter common
to many or all of the letter clusters associated with a particular phoneme,
and the letter does not come first in some of the clusters.  In these cases
the phoneme is placed in a position corresponding to that letter, and the
other letter positions are filled with hyphens.  For example, the sound /n/
is associated with the letters 'nd.' This is coded as /n-/, with the /n/ in
the same position as the letter 'n,' and the hyphen in the position of the
'd.' The /n/ sound is also written as 'gn,' 'sn' and 'kn'. Here, the /n/
phoneme is associated with the common 'n,' not the differing initial
letters, so all of these graphemes are coded as /-n/, where the /n/ is
placed opposite the second letter 'n' instead of the letter that varies.

**** Phonemes with no Letters to Match **** 

Some words used phonemes that were not matched by any letters in the word.
The word 'eighth' is an example.  'Eighth' is phonemically coded by Webster's
as /etT/, but there is no clear letter for the /t/.  The dictionary uses
/e---T-/, dropping the /t/.

Another example concerns those instances when /y/ plus another vowel sound
are written with a single vowel in foreign words.  The double phoneme was
replaced with the second vowel sound based on the DECtalk pronunciation of
the double phoneme, the /y/ alone, and the second vowel alone.

If a double phoneme corresponding to a single grapheme appeared in several
words, a new phoneme symbolizing the double sound was created.  Five such
combination phonemes were used.

They are /X/ = /ks/
         /K/ = /kS/
         /#/ = /gz/
	 /Z/ = /zh/
     and /!/ = /tz/

/X/ /K/ and /#/ are all used to represent 'x'. 
 
/Z/ is used by single letter graphemes 'z' 'g' and 's' for 
the double /zh/ sound.  
 
The last combination phoneme, /!/ = /tz/, is used for the foreign
'z'. 

**** Foreign Words ****

Some words in the dictionary use non-English phoneme-letter pairs. New
phoneme to letter correspondences and entirely new phonemes were added to
accommodate coding these words.  The new phonemes are /!/ = /tz/, discussed
above, and /+/, which is approximated by /wa/ and represents a French vowel
sound.  

