LTI Student Tops LiveQA Evaluation for Second Consecutive Year

Bryan BurtnerTuesday, November 22, 2016

LTI Ph.D. student Di Wang designed a system to rapidly answer questions posed to the Yahoo! Answers website that received the highest score in the LiveQA evaluation track at TREC 2016.

For the second consecutive year, Carnegie Mellon came out on top in the LiveQA evaluation — an exercise that requires question-answering (QA) software to respond to real-time questions received by the Yahoo! Answers website — at the Text Retrieval Conference (TREC 2016).

A system designed by Di Wang, a Language Technologies Institute Ph.D. student and member of Professor Eric Nyberg's Open Advancement of Question Answering (OAQA) research group, out-paced competitors in answering user-generated questions in real time.

The questions, ranging from the mundane to the perplexing (e.g., "How do I convince my mom to pay for my gym?", "Can I keep ferrets and rats in the same room?"), were selected from a collection of Yahoo! Answers submissions that had not yet been answered by human users. The QA systems were allowed one minute to answer each question. Wang's system, which uses a Deep Learning approach to question answering, received the highest average score and success rate among 25 automatic QA systems from 14 teams.

"One of the core challenges is to effectively teach a machine to judge the relationship between question and answer texts, which are often ambiguous, ungrammatical and greatly varied in length, topic and language style," Wang said. "Our internal evaluations show that our QA system offered answers that were comparable in quality to the first responses received from the Yahoo! Answers community, indicating that QA systems could be helping real-world users in the near future!"

TREC, conducted annually by the National Institute of Standards and Technology (NIST) in Gaithersburg, Md., and now in its 25th year, "encourages research in information retrieval and related applications by providing a large test collection, uniform scoring procedures and a forum for organizations interested in comparing their results," according to the conference's website.

System responses were assessed by human judges, who gave Wang's system an average score of 1.115 on a 3.0 scale. That compared to an overall average score of 0.577 for all participants and 1.054 for the second-place finisher. The score also marked an improvement over last year for Wang's system, when it received the top score of 1.081.

"The competition was much more intense than last year," Wang said. "All of the teams improved a lot." He credited the OAQA team's new method of answer ranking, using an attentional neural encoder-decoder approach, for his system's improvement.

Wang said he appreciates the collegial nature of the competition.

"During the last TREC conference, most of the leading teams shared what they learned with the rest of teams," he said.

Nyberg echoed Wang's enthusiasm for the practical benefits that could come from their research, while also taking care to note the significance of his student's success at TREC.

"Winning the TREC QA track is a noteworthy achievement for any team," he said, "But doing it two years in a row is simply amazing."

Wang's work is sponsored in part by the InMind project, which is funded by Yahoo!

For More Information

Byron Spice | 412-268-9068 | bspice@cs.cmu.edu