===============================================================================
  This is the UCI Repository Of Machine Learning Databases and Domain Theories
                           16 August 1991
         ics.uci.edu: /usr2/spool/ftp/pub/machine-learning-databases
        Site Librarian: Patrick M. Murphy (ml-repository@ics.uci.edu)
           Off-Site Assistant: David W. Aha (aha@cs.jhu.edu)
       76 databases and domain theories (12012K and 1 offline database)
===============================================================================

This directory contains data sets and domain theories (the latter have been
annotated as such in the following brief listing) that have been or can be
used to evaluate learning algorithms. Each data file (*.data) contains
individual records described in terms of attribute-value pairs.  The
corresponding *.names file contains voluminous documentation.  (Some files
_generate_ databases; they do not have *.data files.)

The contents of this repository can be remotely copied to other network
sites via ftp to ics.uci.edu.  Both the userid and password are "anonymous".  
These databases can be found by executing "cd pub/machine-learning-databases".

Notes:
 1. We're always looking for additional databases, which can be written to
    the sub-directory named "donations".  Please send yours, with
    documentation.  Thanks -- See DOC-REQUIREMENTS for suggested documentation
    procedures. Presently, all databases except 4 with unusual formats have the
    following format: 1 instance per line, no spaces, commas separate 
    attribute values, and missing values are denoted by "?".  Exceptions: 
    audiology, labor-negotiations, mechanical-analysis, spectrometer, 
    university, and the databases in the "undocumented" sub-directory, which
    have not yet been documented carefully by us.  Feel free to use them.

 2. Ivan Bratko requested that the databases he donated from the Ljubljana
    Oncology Institute (e.g., breast-cancer, lymphography, and primary-tumor)
    have restricted access. We are allowed to share them with academic
    institutions upon request. These databases (like several others) require
    providing proper citations be made in published articles that use them.
    Citation requirements are in each database's corresponding *.doc file.

 3. CORRESPONDENTS lists our correspondents, perhaps including someone at your
    site who can provide you with these databases and related information.
    TRANSACTIONS is a correspondence log.  DATE-RECEIVED lists when each
    entry was added to this repository.

 4. An archive server may now be used to recieve via e-mail files in this
    repository.  Installed on ics, it provides email access to files in
    our anonymous ftp/uucp area (~ftp).  If people have no other access to
    our archives, you can tell them to send mail to:

	archive-server@ics.uci.edu

    Commands to the server may be given in the body.  Some commands are:

	help
	send <archive> <file>
	find <archive> <string>

    The help command replies with a useful help message.

If you publish material based on databases obtained from this repository,
then, in your acknowledgements, please note the assistance you received by
using this repository.  Thanks -- this will help others to obtain the same
data sets and replicate your experiments.  We suggest the following pseudo-APA
reference format for referring to this repository (LaTeX'd):

  Murphy,~P.~M., \& Aha,~D.~W. (1991). {\it UCI Repository of machine
  learning databases} [Machine-readable data repository]. Irvine, CA:
  University of California, Department of Information and Computer Science.

Patrick M. Murphy (Repository Librarian)
David W. Aha (Off-Site Assistant)
     
----------------------------------------------------------------------
Brief Overview of Databases and Domain Theories:

Quick Listing:
 1. annealing (David Sterling and Wray Buntine)
 2. audiology (Ray Bareiss and Bruce Porter, used in Protos)
 3. autos (Jeff Schlimmer)
 4. breast-cancer (Ljubljana Institute of Ontcology, restricted access)
 5. bridges (Yoram Reich)
 6-13. chess
   1. Partial generator of Quinlan's chess-end-game data (kr-vs-kn) (Schlimmer)
   2. Shapiros' endgame database (kr-vs-kp) (Rob Holte)
   3-8. Six domain theories (Nick Flann)
 14. Ein-Dor and Feldmesser's cpu-performance database (David Aha)
 15. dgp-2 data generation program (Powell Benedict)
 16. Nine small EBL domain theories and examples in sub-directory ebl
 17. Evlin Kinney's echocardiogram database (Steven Salzberg)
 18. flags (Richard Forsyth)
 19. function-finding (Cullen Schafer's 352 case studies)
 20. glass (Vina Spiehler)
 21. hayes-roth (from Hayes-Roth^2's paper)
 22-25. heart-disease (Robert Detrano)
 26. hepatitis (G. Gong)
 27. Image segmentation database (Carla Brodley)
 28. iris (R.A. Fisher, 1936)
 29. kinship (J. Ross Quinlan)
 30. labor-negotiations (Stan Matwin)
 31-32. led-display-creator (from the CART book)
 33. lenses (Cendrowska's database donated by Benoit Julien)
 34. letter-recognition database (created and donated by David Slate)
 35. liver-disorders (BUPA Medical's database donated by Richard Forsyth)
 36. logic-theorist (Paul O'Rorke)
 37. lymphography (Ljubjana Institute of Oncology, restricted access)
 38-39. mechanical-analysis (Francesco Bergadano)
  1. Original Mechanical Analysis Data Set
  2. PUMPS DATA SET
 40-41. molecular-biology 
     1. promoter sequences (Towell, Shavlik, & Noordewier, domain theory also)
     2. splice-junction sequences (Towell, Noordewier, & Shavlik, 
        domain theory also)
 42. mushroom (Jeff Schlimmer)
 43. othello domain theory (Tom Fawcett)
 44. Pima Indians diabetes diagnoses (Vince Sigillito) 
 45. primary-tumor (Ljubjana Institute of Oncology, restricted access)
 46. shuttle-landing-control (Bojan Cestnik)
 47-48. soybean (from Ryszard Michalski's groups)
 49. spectrometer (Infra-Red Astronomy Satellite Project Database, John Stutz)
 50. tic-tac-toe endgame database (Turing Institute, David W. Aha)
 51-58. thyroid-disease (Garavan Institute, J. Ross Quinlan)
 59-71. Undocumented databases: sub-directory undocumented
   1. flare database (Gary Bradshaw)
   2. Information retrieval (IR) data collection (offline, David Lewis)
   3. Economic sanctions database (domain theory included, Mike Pazzani)
   4. Latest version of the thyroid database (domain theory, J. Ross Quinlan)
   5. Cloud cover images (Philippe Collard)
   6. Horse colic (Mary McLeish & Matt Cecile)
   7. Generator for creating structured objects ("animals", John Gennari)
   8. DNA secondary structure (Qian and Sejnowski, donated by Vince Sigillito) 
   9. Ionosphere information (Vince Sigillito) 
  10. Nettalk data (Sejnowski and Rosenberg, taken from connectionist-bench)
  11. Sonar data (Gorman and Sejnowski, taken from connectionist-bench)
  12. Protein folding data (see connectionist-bench)
  13. Vowel data (Qian and Sejnowski, taken from connectionist-bench (see 9))
 72. university (Michael Lebowitz, donated by Steve Souders)
 73. voting-records (Jeff Schlimmer)
 74-75. waveform domain (taken from CART book)
 76. Zoological database (Richard Forsyth)

Quick Summaries of Each Database:
1. Annealing data (unknown source)
   -- Documentation: On everything except database statistics
   -- Background information on this database: unknown
   -- Many missing attribute values

2. Audiology data (Baylor College)
   -- Documentation: On everything except database statistics
   -- Non-standardized attributes (differs between instances)
   -- All attributes are nominally-valued

3. Automobile data (1985 Ward's Automotive Yearbook)
   -- Documentation: On everything except statistics and class distribution
   -- Good mix of numeric and nominal-valued attributes
   -- More than 1 attribute can be used as a class attribute in this database

4. Breast cancer database (Ljubljana Oncology Institute)
   -- Documentation: On everything except database statistics
   -- Well-used database
   -- 286 instances, 2 classes, 9 attributes + the class attribute

5. Pittsburgh Bridges Database (donated by Yoram Reich)
   -- Topic: design knowledge
   -- 108 instances, 13 attributes (7 specifications, 5 design description, 
      and 1 identifier)
   -- 2 versions of the data: original and numeric-discretized

6-13. Chess
     1. king-rook-vs-king-knight
        -- Documentation: limited (nothing on class distribution, statistics)
        -- This concerns king-knight versus king-rook end games
        -- The database creator is coded in Common Lisp
     2. king-rook-vs-king-pawn
        -- Documentation: sufficient
        -- This concerns king-rook versus king-pawn end games
        -- Originally described by Alen Shapiro 
     3-8. Six domain theories donated by Nick Flann 
        -- In the "domain-theories" sub-directory
        -- Coded in a dialect of Prolog
        -- They all generate legal moves of chess
        -- I haven't yet touched Nick's documentation on them (See README)

14. Computer hardware described in terms of its cycle time, memory size, etc.
   and classified in terms of their relative performance capabilities (CACM
   4/87)   
   -- Documentation: complete
   -- Contains integer-valued concept labels
   -- All attributes are integer-valued

15. The Second Data Generation Program - DGP/2 
   -- Generates instances around peaks and allows for specification of the 
      mean and standard deviations in the normally distributed data.
   -- Generates application domains based on specific parameters: number of 
      features, and proportion of positive to negative examples.
   -- Allows for variations in the number of instances, the range of feature 
      values, the number of peaks, the percent of positive instances desired 
      and a radius around the peaks that these instances fall within.

16. Nine simple small EBL domain theories and examples in sub-directory ebl
   1. cup
   2. deductive.assumable (contains three domain theories)
   3. emotion
   4. ice
   5. pople
   6. safe-to-stack
   7. suicide

17. Echocardiogram database (Reed Institute, Miami)
   -- Documentation: sufficient
   -- 13 numeric-valued attributes
   -- Binary classification: patient either alive or dead after survival period

18. Flags database (Collins Gem Guide to Flags, 1986)
    -- 194 instances, mixed numeric- and nominal-valued attributes
    -- Information on countries, colors of flag components, etc.
    -- donated by Richard S. Forsyth, creator of PC/BEAGLE

19. 352 Studies in Function-Finding (donated by Cullen Schafer)
    -- 352 small "databases" (cases) of bivarate numeric data sets
    -- Collected mostly from investigations in physical science
    -- Intention: Evaluation of function-finding algorithms

20. Glass Identification database (USA Forensic Science Service)
    -- Documentation: completed
    -- 6 types of glass 
    -- Defined in terms of their oxide content (i.e. Na, Fe, K, etc)
    -- All attributes are numeric-valued 

21. Hayes-Roth and Hayes-Roth's database
    -- Described in their 1977 paper
    -- Topic: human subjects study

22-25. Heart Disease databases (Sources listed below)
      -- Documentation: extensive, but statistics and missing attribute
         information not yet furnished (perhaps later)
      -- 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
      -- 13 of the 75 attributes were used for prediction in 2 separate 
         tests, each of which achieved approximately 75%-80% classification
         accuracy
      -- The chosen 13 attributes are all continuously valued

26. Hepatitis database (G.Gong: CMU)
    -- Documentation: incomplete
    -- 155 instances with 20 attributes each; 2 classes
    -- Mostly Boolean or numeric-valued attribute types

27. Image segmentation database (Carla Brodley: UMass)
    -- Documentation status: Skimpy
    -- Not previously used in the ml literature as of 8/1991
    -- Image data described by high-level numeric-valued attributes, 7 classes
   
28. Iris Plant database (Fisher, 1936)
   -- Documentation: complete
   -- 3 classes, 4 numeric attributes, 150 instances 
   -- 1 class is linearly separable from the other 2, but the other 2 are
      not linearly separable from each other (simple database)

29. Kinship database (relational, Hinton 1986 & Quinlan 1989)
    -- 24 individuals, 12 relations 
    -- 104 instances derivable 
    -- Case studies have been reported by both authors

30. Labor relations database (Collective Bargaining Review)
    -- Documentation: no statistics
    -- Please see the labor directory for more information

31-32. LED display domains (Classification and Regression Trees book)
    -- Documentation: sufficient, but missing statistical information
    -- All attributes are Boolean-valued
    -- Two versions: 7 and 24 attributes
    -- Optimal Baye's rate known for the 10% probability of noise problem
    -- Several ML researchers have used this domain for testing noise tolerancy
    -- We provide here 2 C programs for generating sample databases

33. Lenses: Fitting contact lenses (donated by Benoit Julien)
    -- Small database with few attributes 
    -- attributes are either binary- or ternary-valued
    -- 3 classes: hard contact lenses, soft contact lenses, or neither

34. David Slate's letter recognition database (real)
    -- 20,000 instances (712565 bytes) (.Z available)
    -- 17 attributes: 1 class (letter category) and 16 numeric (integer)
    -- No missing attribute values

35. Liver-disorders
    -- BUPA Medical Research Ltd. database donated by Richard S. Forsyth
    -- 7 numeric-valued attributes
    -- 345 instances (male patients)

36. Logic-theorist
    -- Paul O'Rorke's work, as described in Machine Learning

37. Lymphography database (Ljubljana Oncology Institute)
    -- Documentation: incomplete
    -- CITATION REQUIREMENT: Please use (see the documentation file)
    -- 148 instances; 19 attributes; 4 classes; no missing data values

38-39. Mechanical analysis (Donated by members of the Universita di Torino)
   1.  -- Fault diagnosis problem of electromechanical devices
       -- ENIGMA system application described in proceedings of MLC-1990
       -- Each of the 209 instances is described by a different set of 
          components
   2.  -- PUMPS DATA SET
       -- Newer version of above dataset with domain theory and results

40-41. Molecular Biology directory
    1. Promoter gene sequences
       -- Donated by Jude Shavlik; See AAAI-90 Towell, Shavlik, & Noordewier
       -- E. Coli promoter gene sequences (DNA) with partial domain theory
       -- 106 instances, each predictor attribute takes on one of four values
       -- 50% positive instances
    2. Splice-junction gene sequences
       -- Donated by Geoffrey Towell, Noordewier, & Shavlik.
       -- categories "ei" and "ie" include every "split-gene"
          for primates in Genbank 64.1
       -- non-splice examples taken from sequences known not to include
          a splicing site
       -- 3190 instances with classes "ei" (25%), "ie" (25%) and 
          Neither (50%). 
       -- Domain theory included.

42. Mushrooms in terms of their physical characteristics and classified
    as poisonous or edible (Audobon Society Field Guide)
    -- Documentation: complete, but missing statistical information
    -- All attributes are nominal-valued
    -- Large database: 8124 instances (2480 missing values for attribute #12)

43. Othello Domain Theory: used in research to generate features for an
    inductive learning system
    -- Written and donated by Tom Fawcett
    -- Coded in Prolog

44. Pima Indians Diabetes Database (National Institute of Diabetes and
    Digestive and Kidney Diseases)
    -- Binary classes (tested positive or negative for diabetes)
    -- All 8 attributes are numeric-valued 
    -- 768 instances

45. Primary Tumor database (Ljubljana Oncology Institute)
    -- Documentation: incomplete
    -- CITATION REQUIREMENT: Please use (see the documentation file)
    -- 339 instances; 18 attributes; 22 classes; lots of missing data values

46. Shuttle Landing Control database
    -- tiny, 15-instance database with 7 attributes per instance; 2 classes
    -- appears to be well-known in the decision-tree community

47-48. Soybean data (Michalski)
   -- Documentation: Only the statistics is missing
   -- (2 sizes)
   -- Michalski's famous soybean disease databases

49. Low resolution spectrometer data (IRAS data -- NASA Ames Research Center)
    -- Documentation: no statistics nor class distribution given
    -- LARGE database...and this is only 531 of the instances
    -- 98 attributes per instance (all numeric)
    -- Contact NASA-Ames Research Center for more information

50. Tic-Tac-Toe Endgame database (David W. Aha, Turing Institute)
    -- Documentation complete as of Summer 1991
    -- 958 instances, all attributes can take on 1 of 3 possible values
    -- Binary classification task (i.e., "win for x")
    -- A paradigmatic domain for constructive induction studies

51-58. Thyroid patient records classified into disjoint disease classes 
       (Garavan Institute)
       -- Documentation: as given by Ross Quinlan
       -- 6 databases from the Garavan Institute in Sydney, Australia
       -- Approximately the following for each database:
          -- 2800 training (data) instances and 972 test instances
          -- plenty of missing data
          -- 29 or so attributes, either Boolean or continuously-valued
       -- 2 additional databases, also from Ross Quinlan, are also here
          -- hypothyroid.data and sick-euthyroid.data
          -- Quinlan believes that these databases have been corrupted
          -- Their format is highly similar to the other databases

59-71. Undocumented databases: see the sub-directory named undocumented
   1. Bradshaw's flare data
   2. David Lewis's information retrieval (IR) data collection (offline)
   3. Mike Pazzani's economic sanctions database
   4. Ross Quinlan's latest version of the thyroid database
   5. Philippe Collard's database on cloud cover images
   6. Mary McLeish & Matt Cecile's database on horse colic
   7. John Gennari's program for creating structured objects ("animals")
   8. Vince Sigillito's database on dna secondary structure
   9. Vince Sigillito's database on ionosphere information
  10. Nettalk data (see connectionist-bench)
  11. Sonar data (see connectionist-bench)
  12. Protein folding data (see connectionist-bench)
  13. Vowel data (see connectionist-bench)

72. University data (Lebowitz)
    -- Documentation: scant; we've left it in its original (LISP-readable) form
    -- 285 instances, including some duplicates
    -- At least one attribute, academic-emphasis, can have multiple values
       per instance
    -- The user is encouraged to pursue the Lebowitz reference for more 
       information on the database

73. Congressional voting records classified into Republican or Democrat (1984
    United Stated Congressional Voting Records)
    -- Documentation: completed
    -- All attributes are Boolean valued; plenty of missing values; 2 classes
    -- Also, their is a 2nd, undocumented database containing 1986 voting 
       records here. (will be)

74-75. Waveform data generator (Classification and Regression Trees book)
       -- Documentation: no statistics
       -- CART book's waveform domains
       -- 21 and 40 continuous attributes respectively
       -- difficult concepts to learn, but known Bayes optimal classification
          rate of 86% accuracy

76. Richard Forsyth's zoological database (artificial)
    -- 7 classes of animals 
    -- 17 attributes (besides name), 15 Boolean and 2 numeric-valued
    -- No missing attribute values



