EQE

Event-Centric Query Expansion in Web Search

What is Title2EventPhrase?

Title2EventPhrase is a open Event Reformulation dataset with large-scale human annotated Chinese title. Title2EventPhrase is a dataset that comprises over 41,000 <headline, event> pairs, with the headlines sourced from Chinese web pages and spanning 25 different topics. It is collected by researcher at QQ Browser Search.

For more details, please refer to our ACL 2022 Industy Track paper:

(Zhang2023EventCentricQE)

Citation

If you use Title2EventPhrase in your research, please cite our paper.

@inproceedings{zhang-etal-2023-event,
  title = "Event-Centric Query Expansion in Web Search",
  author = "Zhang, Yanan  and
    Cui, Weijie  and
    Zhang, Yangfan  and
    Bai, Xiaoling  and
    Zhang, Zhe  and
    Ma, Jin  and
    Chen, Xiang  and
    Zhou, Tianhua",
  booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)",
  month = jul,
  year = "2023",
  address = "Toronto, Canada",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2023.acl-industry.45",
  pages = "464--475",
  abstract = "In search engines, query expansion (QE) is a crucial technique to improve search experience. Previous studies often rely on long-term search log mining, which leads to slow updates and is sub-optimal for time-sensitive news searches. In this work, we present Event-Centric Query Expansion (EQE), the QE system used in a famous Chinese search engine. EQE utilizes a novel event retrieval framework that consists of four stages, i.e., event collection, event reformulation, semantic retrieval and online ranking, which can select the best expansion from a significant amount of potential events rapidly and accurately. Specifically, we first collect and filter news headlines from websites. Then we propose a generation model that incorporates contrastive learning and prompt-tuning techniques to reformulate these headlines to concise candidates. Additionally, we fine-tune a dual-tower semantic model to serve as an encoder for event retrieval and explore a two-stage contrastive training approach to enhance the accuracy of event retrieval. Finally, we rank the retrieved events and select the optimal one as QE, which is then used to improve the retrieval of event-related documents. Through offline analysis and online A/B testing, we observed that the EQE system has significantly improved many indicators compared to the baseline. The system has been deployed in a real production environment and serves hundreds of millions of users.",
}
            

Quick Start

Title2EventPhrase is distributed under a CC BY-SA 4.0 License. The dataset can be obtained below:

Baidu Netdisk
Google Drive

For the baseline codes, please refer to our github repository.

baseline repo

If you want your results to be appeared on the official leaderboard here, please read the guideline following.

Leaderboard Guideline
Leaderboard
Methods Rouge-L BLEU BERTScore
BART~(vanilla)
0.8391 0.7692 0.9266
BART + CL
0.8406 0.7724 0.9278
BART + PG
0.8458 0.7777 0.9294
BART + CL + PG
0.848 0.7822 0.9312

mT5~(vanilla)
0.8453 0.7781 0.9297
mT5 + CL
0.8489 0.7833 0.9315
mT5 + PG
0.8511 0.7857 0.9322
mT5 + CL + PG
0.8533 0.7897 0.9336