Message-Id: <9310142218.AA29810@CSLI.Stanford.EDU>
From: asmeaton@compapp.dcu.ie (by way of yarowsky@unagi.cis.upenn.edu)
Subject: TREC-2 report and TREC-3 call for participation 
To: empiricists@CSLI.Stanford.EDU
Date: Thu, 14 Oct 1993 15:18:10 -0700
Sender: roscheis@CSLI.Stanford.EDU

Report on TREC-2 (Text REtrieval Conference)
30 August - 2 September, Gaithersburg, USA

by

The TREC-2 Program Committee


INTRODUCTION

As part of an effort to encourage research into text retrieval from
large and diverse document collections, the first Text REtrieval Conference
(TREC-1) was held in Gaithersburg, Md., in 1992.  This forum provided
researchers with a large collection of textual materials, queries and
associated relevance judgements, and a uniform scoring procedure.  The
conference, co-sponsored by the U.S. Advanced Research Projects Agency
(ARPA) and the U.S. National Institute for Standards and Technology
(NIST), was a benchmarking exercise which involved gauging the relative
effectivenesses of many different approaches to the indexing and retrieval
of large volumes of text.  A second conference/workshop (TREC-2) was
held in early September 1993 and was the culmination of the experimental
runs carried out at over 31 sites where information retrieval research
is carried out across the world.

A call for participation in TREC-2 was drafted and circulated (mostly
electronically) during November and December 1992 with a closing date of
intent to participate of December 5th, 1992.  There were a total of 39
groups who submitted an initial request to participate.  The program
committee divided that group into 20 full participants and the remainder
were offered participation in the benchmarking but with a poster rather than
a paper presentation at the workshop.  Of the 20 full participants selected,
19 made presentations, plus 4 presentations from TIPSTER groups
(University of Massachusetts, Syracuse University, HNC and BBN), some
of whom also partook in the TREC-2 benchmarking.  There were 5 posters
at the workshop representing groups who had also run the TREC-2 benchmark.

The 19 full TREC-2 participants and TIPSTER groups with their respective
approaches to information retrieval were (U.S. unless stated)

Bellcore  - SMART document preprocessing and Latent Semantic Indexing
Carnegie Mellon University - NLP-based indexing
CITRI, Royal Melbourne Institute of Technology (Australia) - document structure
       and efficiency issues
City University, London (UK) - variant of probabilistic model and
       probabilistic weighting functions
Cornell University - Vector Space Model/SMART system
Environment Research Institute of Michigan - n-gram indexing/retrieval
GE Research and Development Center - building complex boolean queries
HNC - TIPSTER group learning a reduced dimensionality index space
Institute for Decision System Research - Bayesian networks
New York University - NLP-based indexing by word pairs
Queens College (CUNY) - variant of probabilistic model using PIRCS
       system and spreading activation
Rutgers University - combination of results of different retrieval strategies
Siemens Corporate Research Inc. - SMART retrieval & query expansion using WordNet
Swiss Federal Institute of Technology (ETH) (Switzerland) - efficient
       implementation of RSV metric
Thinking Machines Corporation - various vector space model experiments, 
       concerned with efficiency of execution
TRW Systems Development Division- hardware filtering
Universitaet Dortmund (Germany) & Cornell University - variant of
       probabilistic model with learning of parameter values
University of California, Berkeley - variant of probabilistic model with
       logistic regression for probability estimates
University of Massachusetts - TIPSTER group using a Bayesian Inference
       Network approach
Verity, Inc. - machine learning for TOPIC IR system
VPI&SU - combining results of multiple searches

Poster groups included UCLA, ConQuest Software, Mead Data Central,
PRC, University of Central Florida, University of Illinois at Chicago,
Systems Environment Corporation, Advanced Decision Systems and Dalhousie
University.

This list of participants represents a good mix of academic and industrial
interest in the project and more importantly, a good mix of approaches to
indexing and retrieval.  The TREC experiment reaches a community of
information retrieval researchers and developers and provides an exploratory
benchmark for information retrieval techniques but it is important to realise
from the start that TREC-2, like the first TREC in 1992, was not a
competition.  Indeed there are so many variables in running the experiment
and so many caveats about the evaluation methodologies used, that it is
very difficult to compare even two systems directly and impossible to come
up with a "ranking" of approaches.  Most participants continued to
develop their approaches and refine their systems after the official deadline
for submission of results and most achieved further improvements in
retrieval effectiveness to present at the workshop.  Although comparisons
across systems are very difficult to interpret, experiments within a
given system are quite informative and many groups reported results of
indexing and retrieval experiments conducted against their own baseline
performance.  Only when the official proceedings are published by NIST
in Spring 1994 will we have a stable picture of the performances of
different systems.


LOGISTICS

As with the first TREC, participants in TREC-2 worked with approximately
one million documents (2 gigabytes of text data), retrieving lists of
documents that could be considered relevant to each of 50 topics in what
was called "ad hoc" querying.  A second information retrieval paradigm
used was where 50 retrieval topics were known in advance and new
documents were to be matched against the 50 standard queries simulating a
"routing" operation.  In both cases the queries were not really queries
at all but carefully honed user need statements and were thus extensive
descriptions of the topic of interest. Participating groups were allowed to
do completely automatic query construction, manual query formulation or to
simulate relevance feedback.  The test data used consisted of newspaper
stories (Wall Street Journal and San Jose Mercury News), Associated Press
Newswire articles, U.S. patent applications and articles from the Federal
Register, the Ziff database and the U.S. Department of the Environment,
all in all, a deliberately heterogeneous mix of document types and
document lengths.

The test data was distributed by NIST and was installed by the participants
at their research sites, in addition to some test topics and relevance
assessments.  Participating groups fine-tuned their retrieval strategies
and were then sent the new topics for ad-hoc querying and 1 gigabyte of
new test data for the pre-defined routing queries.  The ranking results
from each site were then sent back to NIST who pooled together the
rankings and had teams of assessors manually evaluate the relevance of
each document appearing in the top 100 documents from at least one site,
for each of the 50 ad hoc and 50 routing queries.

A total of 41 different ad hoc runs (from 25 groups) and 40 different routing
runs (from 23 groups) were pooled to generate the set for manual relevance
assessment.  As was expected, different systems retrieved different sets
of documents in their top rankings but there was a much higher overlap
in retrieved document sets as compared with the first TREC, possibly
because the systems in TREC-2 were better.

Each participant in TREC-2 set their own baseline effectiveness levels using
the trial queries and relevance assessments provided at the start of TREC-2
and then improved upon, or deteriorated their relative effectivenesses on the
official runs.  No relevance assessments were available at the time the
official runs were being completed and there was little time given in which
to complete these runs, so there wasn't much tinkering that could be done
in the time allowed.  This ensured that no system had an unfair advantage over
any others.


EVALUATION

The issue of evaluation has always been one of debate in information
retrieval and within TREC there is scope for even more discussion than
normal.  For the "official" results submitted by each group, NIST calculated
a range of statistical performance figures including averaged recall-precision,
recall-fallout, and precision figures at 5,10,15,20,30,100,200,500 and 1000
documents.  A major improvement in the evaluation of TREC-2 over the
first TREC is the fact that the top 1000 ranking and not just top 200
documents per topic were submitted by the groups.  The way in which
evaluation figures were calculated and averaged were also improved upon.

There is a real problem with using the standard measures for information
retrieval evaluation on something like TREC;  looking at averages of
averages is very superficial and hides most of what is actually going on
with respect to performance.  In TREC-2 it was possible to do some failure
analysis on the data before the workshop and this showed some interesting
features like the fact that "long" documents were being retrieved by most
approaches but were not proving relevant and that systems which yielded
poor levels of precision averaged over 50 topics actually did well, often
best, for some of those 50 topics !  In fact there were 21 groups for which
there was at least one topic on which their system had the best average
precision. The message to be found here is that there is much work to be
done on the data generated by different retrieval approaches to try and
explain some of the results.


THE WORKSHOP

The TREC-2 workshop in September 1993 was open only to participating
systems and government sponsors and was even more open and sharing and
workshop-like than TREC-1.  Each participant presented an overview of
their system and the performance as measured using the evaluation methods
outlined earlier were available for all official runs for all systems.  This
meant that there were many results being presented and there was an
element of information overload in trying to digest and assimilate so much
raw data.  With most participants performing experimental runs and
presenting results obtained literally a couple of days before the workshop
this was really leading edge stuff.

There are many different approaches to information retrieval
represented among the TREC-2 participants grouped roughly into
probabilistic models and variants thereof, vector space approaches, NLP-
based, bayesian networks, query expansion and dimensionality reduction,
boolean query construction, combination of results of different retrieval
strategies, explorations into document structuring, ... as well as some
outliers like retrieval using n-grams, word pairs, hardware approaches, and
some work on efficiency issues.

Generalising results across systems and across approaches is difficult but
some trends have already emerged.  Simple systems which do simple things
are still doing really well and the more complex ones are catching up and in
some cases surpassing the simple approaches.  This result was expected
after the first TREC where simple systems did well and more complex ones
generally did not.  Term weighting has also emerged as something which
counts as especially important.  There is also a large spread of levels of
effectiveness among systems.  An irritating aspect of the way information
retrieval evaluation using the standard measures is that this does not show the
full power of the systems; to say that system retrieves 13 relevant
documents in its top 20 ranked set from a collection of 1,000,000 means
that that system really is a good computational tool to have, but recall and
precision values hide this fact.  The inclusion of recall-fallout tables
addresses this somewhat but these still belie the fact that the information
retrieval techniques available are really quite good.

The overall measurements in TREC-2 show an improvement in effectiveness
over the first TREC and whereas some of this is due to the fact that
ranking was done to top 1000 and not to top 200, it may also be due to
the systems being better.  This could be because systems in TREC-2 are
more fine tuned than before as most TREC-2 participants were also in
TREC-1 and they would have thus been able to anticipate, if not always
fully overcome, the engineering problems entailed when wrestling with 2 or
3 Gbytes of text and the associated indexes, etc.  TREC-2 seemed to have
less problems with engineering the volume of the data than before, probably
because most groups had been through it before.

>From the outset, efficiency issues were never foremost in TREC which is
benchmarking retrieval effectiveness and is not directly concerned with the
engineering aspects of large IR systems.  Some of the figures for indexing
and retrieval operations show a very large range of equipment and
performance times, varying from retrieval from 2 gbytes in less than 5
seconds where the entire inverted file is held in main memory, to retrieval
for a single query measured in hours of CPU and elapsed time implemented
on a PC which decompresses and scans the text as it is being read from the
CD-ROM.  The message here is that you don't need massive computing
resources to take part in TREC ... it helps and it makes things easier, but it
is not mandatory.  In fact, as described in the accompanying call for TREC-3,
there is a category of participation within TREC for computationally intensive
approaches which allows a group to use only a subset of the entire collection.

Finally, TREC-2 did not have as much work on sub-document retrieval as was
expected.  This may be due to the fact that relevance judgements are
dichotomous and do not indicate which PART of a document is relevant, as
in most IR test collections.  This is something which is being looked at in
the next TREC.


WIND UP

At the start of this report we cautioned about making comparisons between
systems and approaches which took part in TREC because of the number of
variables involved.  This then begs the question of why bother doing TREC
if the results of different systems cannot be compared ?  The answer lies in
the objective of the TREC initiative, which were defined by Donna Harman as

1.     to increase research in information retrieval carried out on large-
       scale test collections
2.     to provide a forum for communication among academic, industrial
       and other interested parties
3.     to foster the transfer of technology between research laboratories and
       commercial products
4.     to present a state of the art showcase of retrieval methods

Certainly the first and last of these goals have been achieved; the second
goal looks like having been accomplished and as for the third, only time
will tell.  Direct comparisons between systems and approaches taken in
TREC are extremely dodgy and only broad stroke statements about
effectiveness as made in this report, can be made.

So what happens next in TREC ?  A call for participation is already
available for TREC-3 (attached) with deadline of proposals for participation due
December 1st, 1993.  Data will be distributed in January 1994 with results
of retrieval runs due August 1st, and the workshop scheduled for early
November.  In addition to the English texts there will also be texts in
Spanish with queries and relevance assessments.  A subset
of the topics will be much narrower than in previous TRECs and there will
be more emphasis next time on user interfaces and issues of query
formulation, user models etc.  At this stage TREC has got its own
momentum and is having an effect on how information retrieval research is
carried out.  We can only expect its impact on information retrieval to grow
even more in the future.


The full TREC program committee is:

Donna Harman, NIST, (chair); Chris Buckley, Cornell University; Susan
Dumais, Bellcore; Darryl Howard, U.S. Department of Defense; David
Lewis, AT & T Bell Labs; Matt Mettler, TRW; John Prange, U.S.
Department of Defense; Alan Smeaton, Dublin City University, Ireland;
Richard Tong, Advanced Decision Systems; Steve Walker, City University,
UK; Karen Spark Jones (for TREC-3), Cambridge University, UK;


-------------------------------------------------------------------------------


                         CALL FOR PARTICIPATION

                       TEXT RETRIEVAL CONFERENCE

                      January 1994 - November 1994

               
                            Conducted by:
             National Institute of Standards and Technology
                               (NIST)

                            Sponsored by:
                Advanced Research Projects Agency
           Software and Intelligent Systems Technology Office
                            (ARPA/SISTO)


  A new conference for examination of text retrieval methodologies (TREC) was
held in November 1992 at Gaithersburg, Md.  The goal of this conference was 
to encourage research in text retrieval from large document collections by 
providing a large test collection, uniform scoring procedures and a forum for 
organizations interested in comparing their results. Both ad-hoc queries 
against archival data collections and routing (filtering or dissemination) 
queries against incoming data streams were tested.  The conference was a 
workshop open only to the 24 participating systems and government sponsors;  
however, the proceedings were published by NIST in the spring of 1993.  A 
second workshop (TREC-2) was held in September 1993, with 31 participating
systems, and proceedings to be published in the spring of 1994.

  This announcement serves as a call for participation from groups interested
in working in the third year of this workshop (TREC-3).  Participants will be
expected to work with approximately million documents (2 gigabytes of data),
retrieving lists of documents that could be considered relevant to each of 
100 topics (50 routing and 50 adhoc topics).  NIST will distribute the data 
and will collect and analyze the results.   As before, the workshop will be
open only to participating systems and government sponsors.

  Because of government cutbacks, there will be no financial support this
year for participants.

Schedule:
  Dec. 1, 1993 -- deadline for participation applications
  Jan. 1, 1994 -- acceptances announced, and training data distributed to
                  new participants (including 3 CD-ROMS containing about
                  3 gigabytes of data, and 150 training topics and relevance
                  judgments)
  June 1, 1994 -- Test gigabyte of data distributed via CD-ROM, after 
                  routing queries received at NIST
  July 1, 1994 -- 50 new test topics distributed
  Aug. 1, 1994 -- results from 50 routing queries and 50 test topics due 
                    at NIST
  Oct. 1, 1994 -- relevance judgments and individual evaluation scores due
                   back to participants
  Nov. 2-4     -- TREC-3 conference at NIST in Gaithersburg, Md.


Task Description:

  Participants will receive 3 gigabytes of data to use for training of their 
systems, including development of appropriate algorithms or knowledge bases.  
The 150 topics used in the first two TREC workshops, and the relevance 
judgments for these topics will also be sent.  The topics are in the form of 
a highly-formatted user need statement (see attachment 1).  Queries can 
either be constructed automatically from this topic description, or can be 
manually constructed. 

  Two types of retrieval operations will be tested: a routing or filtering 
operation against new data, and an ad-hoc query operation against archival 
data.  Fifty of the topics (numbers 101-150) initially distributed as 
training topics will be used by each participating group to create formalized
routing or filtering queries to be used for retrieval against a new test 
gigabyte of data (disk 4).  Fifty new test topics (151-200) will be used 
against 2 gigabytes of the training data (disks 2 and 3) as ad-hoc queries.

  Results from both types of queries (routing and ad-hoc) will be submitted 
to NIST as the top 1000 documents retrieved for each query.  Participants 
creating queries both automatically and manually may submit both sets for 
evaluation.  Scoring techniques including traditional recall/precision 
measures will be run for all systems and individual results will be returned 
to each participant.


Conference Format:

  The conference itself will be used as a forum both for presentation of 
results (including failure analyses and system comparisons), and for more 
lengthy system presentations describing retrieval techniques used, 
experiments run using the data, and other issues of interest to researchers 
in information retrieval.  As there is a limited amount of time for these 
presentations, the program committee will determine which groups are asked to
speak and which groups will present in a poster session.  Additionally some 
organizations may not wish to describe their proprietary algorithms, and 
these groups may chose to participate in a different manner (see Category C).
To allow a maximum number of participants, the following three categories 
have been established.

Category A: Full participation
  Participants will be expected to work with the full data set, and to present
full details of system algorithms and various experiments run using the data, 
either in a talk or in a poster session. In addition to algorithms and 
experiments, some information on time and effort statistics should be 
provided.  This includes time for data preparation (such as indexing, 
building a manual thesaurus, building a knowledge base), time for construction
of manual queries, query execution time, etc.  More details on the desired 
content of the presentation will be provided later.

Category B: Exploratory groups
  Because small groups with novel retrieval techniques might like to 
participate but may have limited research resources, a category has been set 
up to work with only a subset of the data.  This subset will consist of about
1/2 gigabyte of training data (and all training topics), and 1/4 gigabyte of 
test data (and all test topics).  Participants in this category will be 
expected to follow the same schedule as category A, except with less data, 
and will be expected to present full details of system algorithms, 
experiments, and time and effort statistics either in a poster session or 
in a talk.

Category C: Evaluation only
  Participants in this category will be expected to work on the full data set,
submit results for common scoring and tabulation, and present their results in
a poster session, including the time and effort statistics described in 
Category A.  They will not be expected to describe their systems in detail.  


Data (Test Collection):

  The test collection (documents, topics, and relevance judgments) will be an
extension of the collection (English only) used for the ARPA TIPSTER project. 
The collection is being assembled from Linguistic Data Consortium text, and a
 LDC User Agreement will be required from all participants. The documents 
are an assorted collection of newspapers (including the Wall Street Journal),
newswires, journals, technical abstracts and email newsgroups.  The test set 
will be of approximately the same composition as the training set, and all 
documents will be typical of those seen in a real-world situation (i.e. there
will not be arcane vocabulary, but there may be missing pieces of text or 
typographical errors).  The format of the documents is relatively clean and 
easy-to-use as is (see attachment 2).  Most of the documents will consist of 
a text section only, with no titles or other categories.  The relevance 
judgments against which each system's output will be scored will be made by 
experienced relevance assessors based on the output of all TREC participants 
using a pooled relevance methodology.


Response format and submission details 

  By Dec. 1, 1993 organizations wishing to participate should respond to 
the call for participation by submitting a summary of their text retrieval 
approach and a system architecture description, not to exceed five pages in 
total.  The summary should include the strengths and significance of their 
approach to text retrieval, and highlight differences between their approach 
and other retrieval approaches.  Each organization should indicate in which 
category they wish to participate.
  Please indicate clearly the persons responsible for the summary statement
and to whom correspondence should be directed.  A full regular address, 
telephone number, and an email address should be given.  EMAIL IS THE 
PREFERRED METHOD OF COMMUNICATION, although it is realized that diagrams and 
figures will need to be sent by regular mail or FAX.   It is expected that 
ALL participants have some access to email, as conference communications will 
be done via email.

  It is highly likely that some Spanish text and topics (approximately a 
1/4 gigabyte of text and 25 topics) will also be available for retrieval 
tests.  If your organization is interested in trying Spanish (in addition to 
English), please state this and indicate the availability of at least one 
person who can read Spanish.

  All responses should be submitted by Dec. 1, 1993 to the Program Chair, 
Donna Harman:

                 harman@magi.ncsl.nist.gov
or
             Donna Harman, NIST, Building 225/A216,
                  Gaithersburg, Md. 20899
                 
                    FAX: 301-975-2128

AS NOTED ABOVE, EMAIL IS THE DESIRED FORM OF COMMUNICATION.
*****************************************************************************

   Any questions about conference participation, response format, etc. should 
also be sent to the same address.


Selection of participants:

  As the goal of TREC is to further research in large-scale text retrieval,
the program committee will be looking for as wide a range of text retrieval
approaches as possible, and will select the best representatives of these 
approaches as participants for categories A and B.  Category C participants 
must be able to demonstrate their ability to work with the full data 
collection.  The program committee has been chosen from a broad range of 
information retrieval researchers and government users, and will both select 
the participants and provide guidance in the planning of the conference.

     Program Committee
         Donna Harman, NIST, chair
         Chris Buckley, Cornell University
         Susan Dumais, Bellcore
         Darryl Howard, U.S. Department of Defense
         David Lewis, AT & T Bell Labs
         Matt Mettler, TRW
         John Prange, U.S. Department of Defense
         Alan Smeaton, Dublin City University, Ireland
         Karen Sparck Jones, Cambridge University
         Richard Tong, Advanced Decision Systems 
         Steve Walker, City University, London

----------------------------------------------------------------------------
Attachment 1 -- Sample Topic

<top>
<head> Tipster Topic Description
<num> Number: 028
<dom> Domain:Science and Technology
<title> Topic: AT&T's Technical Efforts
<desc> Description: Document must describe AT&T's technical efforts in
computers and communications.

<narr> Narrative: To be relevant, a document must contain information
on American Telephone and Telegraph's (AT&T) technical efforts in
computers and communications.  Examples of relevant subject matter
would include: product announcements, releases or cancellations, and
discussion of AT&T Bell Labs research. Documents focusing either
AT&T's efforts to buy other computer companies or AT&T's legal battles
with other organizations, or AT&T's Unix operating system are NOT
relevant. For the purposes of this topic the Regional Bell Operating
Companies, (RBOC's) or the "Baby Bells" are not considered AT&T.

<con> Concept(s):
1. AT&T, American Telephone and Telegraph
2. 3B-2 minicomputer, AT&T 386 PC
3. AT&T Starlan
4. PBX,
5. Product announcements, product releases
</top>

-------------------------------------------------------------------------------

Attachment 2 -- Sample Document (abridged)

<DOC>
<DOCNO> WSJ880406-0090 </DOCNO>
<HL> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </HL>
<AUTHOR> Janet Guyon (WSJ Staff) </AUTHOR>
<SO> </SO>
<CO> T </CO>
<IN> TEL </IN>
<DATELINE> NEW YORK  </DATELINE>
<TEXT>
   American Telephone & Telegraph Co. introduced the first of a new generation
of phone services with broad implications for computer and communications 
equipment markets. 
   AT&T said it is the first national long-distance carrier to announce prices 
for specific services under a world-wide standardization plan to upgrade phone 
networks.  By announcing commercial services under the plan, which the industry
calls the Integrated Services Digital Network, AT&T will influence evolving 
communications standards to its advantage, consultants said, just as 
International Business Machines Corp. has created de facto computer standards 
favoring its products. 
   .
   .
</TEXT>
</DOC>


Date: Thu, 7 Oct 93 17:18 EDT
From: lewis@research.att.com (David Lewis)
To: nl-kr@cs.rpi.edu
Subject: CFP: Text Retrieval Conference/Dataset/Evaluation (TREC-3)


                         CALL FOR PARTICIPATION

                       TEXT RETRIEVAL CONFERENCE

                      January 1994 - November 1994

               
                            Conducted by:
             National Institute of Standards and Technology
                               (NIST)

                            Sponsored by:
                Advanced Research Projects Agency
           Software and Intelligent Systems Technology Office
                            (ARPA/SISTO)


  A new conference for examination of text retrieval methodologies (TREC) was
held in November 1992 at Gaithersburg, Md.  The goal of this conference was 
to encourage research in text retrieval from large document collections by 
providing a large test collection, uniform scoring procedures and a forum for 
organizations interested in comparing their results. Both ad-hoc queries 
against archival data collections and routing (filtering or dissemination) 
queries against incoming data streams were tested.  The conference was a 
workshop open only to the 24 participating systems and government sponsors;  
however, the proceedings were published by NIST in the spring of 1993.  A 
second workshop (TREC-2) was held in September 1993, with 31 participating
systems, and proceedings to be published in the spring of 1994.

  This announcement serves as a call for participation from groups interested
in working in the third year of this workshop (TREC-3).  Participants will be
expected to work with approximately million documents (2 gigabytes of data),
retrieving lists of documents that could be considered relevant to each of 
100 topics (50 routing and 50 adhoc topics).  NIST will distribute the data 
and will collect and analyze the results.   As before, the workshop will be
open only to participating systems and government sponsors.

  Because of government cutbacks, there will be no financial support this
year for participants.

Schedule:
  Dec. 1, 1993 -- deadline for participation applications
  Jan. 1, 1994 -- acceptances announced, and training data distributed to
                  new participants (including 3 CD-ROMS containing about
                  3 gigabytes of data, and 150 training topics and relevance
                  judgments)
  June 1, 1994 -- Test gigabyte of data distributed via CD-ROM, after 
                  routing queries received at NIST
  July 1, 1994 -- 50 new test topics distributed
  Aug. 1, 1994 -- results from 50 routing queries and 50 test topics due 
                    at NIST
  Oct. 1, 1994 -- relevance judgments and individual evaluation scores due
                   back to participants
  Nov. 2-4     -- TREC-3 conference at NIST in Gaithersburg, Md.


Task Description:

  Participants will receive 3 gigabytes of data to use for training of their 
systems, including development of appropriate algorithms or knowledge bases.  
The 150 topics used in the first two TREC workshops, and the relevance 
judgments for these topics will also be sent.  The topics are in the form of 
a highly-formatted user need statement (see attachment 1).  Queries can 
either be constructed automatically from this topic description, or can be 
manually constructed. 

  Two types of retrieval operations will be tested: a routing or filtering 
operation against new data, and an ad-hoc query operation against archival 
data.  Fifty of the topics (numbers 101-150) initially distributed as 
training topics will be used by each participating group to create formalized
routing or filtering queries to be used for retrieval against a new test 
gigabyte of data (disk 4).  Fifty new test topics (151-200) will be used 
against 2 gigabytes of the training data (disks 2 and 3) as ad-hoc queries.

  Results from both types of queries (routing and ad-hoc) will be submitted 
to NIST as the top 1000 documents retrieved for each query.  Participants 
creating queries both automatically and manually may submit both sets for 
evaluation.  Scoring techniques including traditional recall/precision 
measures will be run for all systems and individual results will be returned 
to each participant.


Conference Format:

  The conference itself will be used as a forum both for presentation of 
results (including failure analyses and system comparisons), and for more 
lengthy system presentations describing retrieval techniques used, 
experiments run using the data, and other issues of interest to researchers 
in information retrieval.  As there is a limited amount of time for these 
presentations, the program committee will determine which groups are asked to
speak and which groups will present in a poster session.  Additionally some 
organizations may not wish to describe their proprietary algorithms, and 
these groups may chose to participate in a different manner (see Category C).
To allow a maximum number of participants, the following three categories 
have been established.

Category A: Full participation
  Participants will be expected to work with the full data set, and to present
full details of system algorithms and various experiments run using the data, 
either in a talk or in a poster session. In addition to algorithms and 
experiments, some information on time and effort statistics should be 
provided.  This includes time for data preparation (such as indexing, 
building a manual thesaurus, building a knowledge base), time for construction
of manual queries, query execution time, etc.  More details on the desired 
content of the presentation will be provided later.

Category B: Exploratory groups
  Because small groups with novel retrieval techniques might like to 
participate but may have limited research resources, a category has been set 
up to work with only a subset of the data.  This subset will consist of about
1/2 gigabyte of training data (and all training topics), and 1/4 gigabyte of 
test data (and all test topics).  Participants in this category will be 
expected to follow the same schedule as category A, except with less data, 
and will be expected to present full details of system algorithms, 
experiments, and time and effort statistics either in a poster session or 
in a talk.

Category C: Evaluation only
  Participants in this category will be expected to work on the full data set,
submit results for common scoring and tabulation, and present their results in
a poster session, including the time and effort statistics described in 
Category A.  They will not be expected to describe their systems in detail.  


Data (Test Collection):

  The test collection (documents, topics, and relevance judgments) will be an
extension of the collection (English only) used for the ARPA TIPSTER project. 
The collection is being assembled from Linguistic Data Consortium text, and a
 LDC User Agreement will be required from all participants. The documents 
are an assorted collection of newspapers (including the Wall Street Journal),
newswires, journals, technical abstracts and email newsgroups.  The test set 
will be of approximately the same composition as the training set, and all 
documents will be typical of those seen in a real-world situation (i.e. there
will not be arcane vocabulary, but there may be missing pieces of text or 
typographical errors).  The format of the documents is relatively clean and 
easy-to-use as is (see attachment 2).  Most of the documents will consist of 
a text section only, with no titles or other categories.  The relevance 
judgments against which each system's output will be scored will be made by 
experienced relevance assessors based on the output of all TREC participants 
using a pooled relevance methodology.


Response format and submission details 

  By Dec. 1, 1993 organizations wishing to participate should respond to 
the call for participation by submitting a summary of their text retrieval 
approach and a system architecture description, not to exceed five pages in 
total.  The summary should include the strengths and significance of their 
approach to text retrieval, and highlight differences between their approach 
and other retrieval approaches.  Each organization should indicate in which 
category they wish to participate.
  Please indicate clearly the persons responsible for the summary statement
and to whom correspondence should be directed.  A full regular address, 
telephone number, and an email address should be given.  EMAIL IS THE 
PREFERRED METHOD OF COMMUNICATION, although it is realized that diagrams and 
figures will need to be sent by regular mail or FAX.   It is expected that 
ALL participants have some access to email, as conference communications will 
be done via email.

  It is highly likely that some Spanish text and topics (approximately a 
1/4 gigabyte of text and 25 topics) will also be available for retrieval 
tests.  If your organization is interested in trying Spanish (in addition to 
English), please state this and indicate the availability of at least one 
person who can read Spanish.

  All responses should be submitted by Dec. 1, 1993 to the Program Chair, 
Donna Harman:

                 harman@magi.ncsl.nist.gov
or
             Donna Harman, NIST, Building 225/A216,
                  Gaithersburg, Md. 20899
                 
                    FAX: 301-975-2128

AS NOTED ABOVE, EMAIL IS THE DESIRED FORM OF COMMUNICATION.
*****************************************************************************

   Any questions about conference participation, response format, etc. should 
also be sent to the same address.


Selection of participants:

  As the goal of TREC is to further research in large-scale text retrieval,
the program committee will be looking for as wide a range of text retrieval
approaches as possible, and will select the best representatives of these 
approaches as participants for categories A and B.  Category C participants 
must be able to demonstrate their ability to work with the full data 
collection.  The program committee has been chosen from a broad range of 
information retrieval researchers and government users, and will both select 
the participants and provide guidance in the planning of the conference.

     Program Committee
         Donna Harman, NIST, chair
         Chris Buckley, Cornell University
         Susan Dumais, Bellcore
         Darryl Howard, U.S. Department of Defense
         David Lewis, AT & T Bell Labs
         Matt Mettler, TRW
         John Prange, U.S. Department of Defense
         Alan Smeaton, Dublin City University, Ireland
         Karen Sparck Jones, Cambridge University
         Richard Tong, Advanced Decision Systems 
         Steve Walker, City University, London

----------------------------------------------------------------------------
Attachment 1 -- Sample Topic

<top>
<head> Tipster Topic Description
<num> Number: 028
<dom> Domain:Science and Technology
<title> Topic: AT&T's Technical Efforts
<desc> Description: Document must describe AT&T's technical efforts in
computers and communications.

<narr> Narrative: To be relevant, a document must contain information
on American Telephone and Telegraph's (AT&T) technical efforts in
computers and communications.  Examples of relevant subject matter
would include: product announcements, releases or cancellations, and
discussion of AT&T Bell Labs research. Documents focusing either
AT&T's efforts to buy other computer companies or AT&T's legal battles
with other organizations, or AT&T's Unix operating system are NOT
relevant. For the purposes of this topic the Regional Bell Operating
Companies, (RBOC's) or the "Baby Bells" are not considered AT&T.

<con> Concept(s):
1. AT&T, American Telephone and Telegraph
2. 3B-2 minicomputer, AT&T 386 PC
3. AT&T Starlan
4. PBX,
5. Product announcements, product releases
</top>

-------------------------------------------------------------------------------

Attachment 2 -- Sample Document (abridged)

<DOC>
<DOCNO> WSJ880406-0090 </DOCNO>
<HL> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </HL>
<AUTHOR> Janet Guyon (WSJ Staff) </AUTHOR>
<SO> </SO>
<CO> T </CO>
<IN> TEL </IN>
<DATELINE> NEW YORK  </DATELINE>
<TEXT>
   American Telephone & Telegraph Co. introduced the first of a new generation
of phone services with broad implications for computer and communications 
equipment markets. 
   AT&T said it is the first national long-distance carrier to announce prices 
for specific services under a world-wide standardization plan to upgrade phone 
networks.  By announcing commercial services under the plan, which the industry
calls the Integrated Services Digital Network, AT&T will influence evolving 
communications standards to its advantage, consultants said, just as 
International Business Machines Corp. has created de facto computer standards 
favoring its products. 
   .
   .
</TEXT>
</DOC>


-----------------------------------------------------------------------