Publications

MIA 2022 Shared Task Submission: Leveraging Entity Representations, Dense-Sparse Hybrids, and Fusion-in-Decoder for Cross-Lingual Question Answering
Zhucheng Tu, Janani Padmanabhan
Proceedings of the Workshop on Multilingual Information Access (MIA), July 2022, Seattle, USA

We describe our two-stage system for the Multilingual Information Access (MIA) 2022 Shared Task on Cross-Lingual Open-Retrieval Question Answering. The first stage consists of multilingual passage retrieval with a hybrid dense and sparse retrieval strategy. The second stage consists of a reader which outputs the answer from the top passages returned by the first stage. We show the efficacy of using entity representations, sparse retrieval signals to help dense retrieval, and Fusion-in-Decoder. On the development set, we obtain 43.46 F1 on XOR-TyDi QA and 21.99 F1 on MKQA, for an average F1 score of 32.73. On the test set, we obtain 40.93 F1 on XOR-TyDi QA and 22.29 F1 on MKQA, for an average F1 score of 31.61. We improve over the official baseline by over 4 F1 points on both the development and test sets.

Abstract Paper
Open-Domain Question Answering Goes Conversational via Question Rewriting
Raviteja Anantha*, Svitlana Vakulenko*, Zhucheng Tu, Shayne Longpre, Stephen Pulman, Srinivas Chappidi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2021, Online

We introduce a new dataset for Question Rewriting in Conversational Context (QReCC), which contains 14K conversations with 80K question-answer pairs. The task in QReCC is to find answers to conversational questions within a collection of 10M web pages (split into 54M passages). Answers to questions in the same conversation may be distributed across several web pages. QReCC provides annotations that allow us to train and evaluate individual subtasks of question rewriting, passage retrieval and reading comprehension required for the end-to-end conversational question answering (QA) task. We report the effectiveness of a strong baseline approach that combines the state-of-the-art model for question rewriting, and competitive models for open-domain QA. Our results set the first baseline for the QReCC dataset with F1 of 19.10, compared to the human upper bound of 75.45, indicating the difficulty of the setup and a large room for improvement.

Abstract Paper Code
A Comparison of Question Rewriting Methods for Conversational Passage Retrieval
Svitlana Vakulenko, Nikos Voskarides, Zhucheng Tu, Shayne Longpre
43rd European Conference on IR Research, ECIR 2021, March 2021, Online

Conversational passage retrieval relies on question rewriting to modify the original question so that it no longer depends on the conversation history. Several methods for question rewriting have recently been proposed, but they were compared under different retrieval pipelines. We bridge this gap by thoroughly evaluating those question rewriting methods on the TREC CAsT 2019 and 2020 datasets under the same retrieval pipeline. We analyze the effect of different types of question rewriting methods on retrieval performance and show that by combining question rewriting methods of different types we can achieve state-of-the-art performance on both datasets (Resources can be found at https://github.com/svakulenk0/cast_evaluation.)

Abstract Paper
Question Rewriting for Conversational Question Answering
Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, Raviteja Anantha
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, March 2021, Online

Conversational question answering (QA) requires the ability to correctly interpret a question in the context of previous conversation turns. We address the conversational QA task by decomposing it into question rewriting and question answering subtasks. The question rewriting (QR) subtask is specifically designed to reformulate ambiguous questions, which depend on the conversational context, into unambiguous questions that can be correctly interpreted outside of the conversational context. We introduce a conversational QA architecture that sets the new state of the art on the TREC CAsT 2019 passage retrieval dataset. Moreover, we show that the same QR model improves QA performance on the QuAC dataset with respect to answer span extraction, which is the next step in QA after passage retrieval. Our evaluation results indicate that the QR model we proposed achieves near human-level performance on both datasets and the gap in performance on the end-to-end conversational QA task is attributed mostly to the errors in QA.

Abstract Paper
A Wrong Answer or a Wrong Question? An Intricate Relationship between Question Reformulation and Answer Selection in Conversational Question Answering
Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, Raviteja Anantha
Workshop on Search-Oriented Conversational AI (SCAI) 2020, October 2020, Online

Conversational question answering (QA) requires answers conditioned on the previous turns of the conversation. We address the conversational QA task by decomposing it into question rewriting and question answering subtasks, and conduct a systematic evaluation of this approach on two publicly available datasets. Question rewriting is designed to reformulate ambiguous questions, dependent on the conversation context, into unambiguous questions that are fully interpretable outside of the conversation context. Thereby, standard QA components can consume such explicit questions directly. The main benefit of this approach is that the same questions can be used for querying different information sources, e.g., multiple 3rd-party QA services simultaneously, as well as provide a human-readable interpretation of the question in context. To the best of our knowledge, we are the first to evaluate question rewriting on the conversational question answering task and show its improvement over the end-to-end baselines. Moreover, our conversational QA architecture based on question rewriting sets the new state of the art on the TREC CAsT 2019 dataset with a 28% improvement in MAP and 21% in NDCG@3. Our detailed analysis of the evaluation results provide insights into the sensitivity of QA models to question reformulation, and demonstrates the strengths and weaknesses of the retrieval and extractive QA architectures, that should be reflected in their integration.

Abstract Paper
Least Squares Binary Quantization of Neural Networks
Hadi Pouransari, Zhucheng Tu, and Oncel Tuzel
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020, Seattle, USA

Quantizing weights and activations of deep neural networks results in significant improvement in inference efficiency at the cost of lower accuracy. A source of the accuracy gap between full precision and quantized models is the quantization error. In this work, we focus on the binary quantization, in which values are mapped to -1 and 1. We provide a unified framework to analyze different scaling strategies. Inspired by the pareto-optimality of 2-bits versus 1-bit quantization, we introduce a novel 2-bits quantization with provably least squares error. Our quantization algorithms can be implemented efficiently on the hardware using bitwise operations. We present proofs to show that our proposed methods are optimal, and also provide empirical error analysis. We conduct experiments on the ImageNet dataset and show a reduced accuracy gap when using the proposed least squares quantization algorithms.

Abstract Paper Code
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering
Shayne Longpre*, Yi Lu*, Zhucheng Tu*, and Chris DuBois
Proceedings of the 2nd Workshop on Machine Reading for Question Answering, November 2019, Hong Kong, China

To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a simple negative sampling technique to be particularly effective, even though it is typically used for datasets that include unanswerable questions, such as SQuAD 2.0. When applied in conjunction with per-domain sampling, our XLNet (Yang et al., 2019)-based submission achieved the second best Exact Match and F1 in the MRQA leaderboard competition.

Abstract Paper
Pay-Per-Request Deployment of Neural Network Models Using Serverless Architectures
Zhucheng Tu, Mengping Li, and Jimmy Lin
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, June 2018, New Orleans, USA

We demonstrate the serverless deployment of neural networks for model inferencing in NLP applications using Amazon's Lambda service for feedforward evaluation and DynamoDB for storing word embeddings. Our architecture realizes a pay-per-request pricing model, requiring zero ongoing costs for maintaining server instances. All virtual machine management is handled behind the scenes by the cloud provider without any direct developer intervention. We describe a number of techniques that allow efficient use of serverless resources, and evaluations confirm that our design is both scalable and inexpensive.

Abstract Paper Code
CNNs for NLP in the Browser: Client-Side Deployment and Visualization Opportunities
Yiyun Liang, Zhucheng Tu, Laetitia Huang, and Jimmy Lin
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, June 2018, New Orleans, USA

We demonstrate a JavaScript implementation of a convolutional neural network that performs feedforward inference completely in the browser. Such a deployment means that models can run completely on the client, on a wide range of devices, without making backend server requests. This design is useful for applications with stringent latency requirements or low connectivity. Our evaluations show the feasibility of JavaScript as a deployment target. Furthermore, an in-browser implementation enables seamless integration with the JavaScript ecosystem for information visualization, providing opportunities to visually inspect neural networks and better understand their inner workings.

Abstract Paper Code
An Experimental Analysis of Multi-Perspective Convolutional Neural Networks
Zhucheng Tu
Master's thesis, University of Waterloo, May 2018, Waterloo, Canada

Modelling the similarity of sentence pairs is an important problem in natural language processing and information retrieval, with applications in tasks such as paraphrase identification and answer selection in question answering. The Multi-Perspective Convolutional Neural Network (MP-CNN) is a model that improved previous state-of-the-art models in 2015 and has remained a popular model for sentence similarity tasks. However, until now, there has not been a rigorous study of how the model actually achieves competitive accuracy. In this thesis, we report on a series of detailed experiments that break down the contribution of each component of MP-CNN towards its statistical accuracy and how they affect model robustness. We find that two key components of MP-CNN are non-essential to achieve competitive accuracy and they make the model less robust to changes in hyperparameters.

Abstract Paper Code
An Experimental Analysis of the Power Consumption of Convolutional Neural Networks for Keyword Spotting
Raphael Tang, Weijie Wang, Zhucheng Tu, and Jimmy Lin
Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), April 2018, Calgary, Canada

Nearly all previous work on small-footprint keyword spotting with neural networks quantify model footprint in terms of the number of parameters and multiply operations for a feedforward inference pass. These values are, however, proxy measures since empirical performance in actual deployments is determined by many factors. In this paper, we study the power consumption of a family of convolutional neural networks for keyword spotting on a Raspberry Pi. We find that both proxies are good predictors of energy usage, although the number of multiplies is more predictive than the number of model parameters. We also confirm that models with the highest accuracies are, unsurprisingly, the most power hungry.

Abstract Paper Code
An Exploration of Approaches to Integrating Neural Reranking Models in Multi-Stage Ranking Architectures
Zhucheng Tu, Matt Crane, Royal Sequiera, Junchen Zhang, and Jimmy Lin
Proceedings of the SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR'17), August 2017, Tokyo, Japan

We explore different approaches to integrating a simple convolutional neural network (CNN) with the Lucene search engine in a multi-stage ranking architecture. Our models are trained using the PyTorch deep learning toolkit, which is implemented in C/C++ with a Python frontend. One obvious integration strategy is to expose the neural network directly as a service. For this, we use Apache Thrift, a software framework for building scalable cross-language services. In exploring alternative architectures, we observe that once trained, the feedforward evaluation of neural networks is quite straightforward. Therefore, we can extract the parameters of a trained CNN from PyTorch and import the model into Java, taking advantage of the Java Deeplearning4J library for feedforward evaluation. This has the advantage that the entire end-to-end system can be implemented in Java. As a third approach, we can extract the neural network from PyTorch and "compile" it into a C++ program that exposes a Thrift service. We evaluate these alternatives in terms of performance (latency and throughput) as well as ease of integration. Experiments show that feedforward evaluation of the convolutional neural network is significantly slower in Java, while the performance of the compiled C++ network does not consistently beat the PyTorch implementation.

Abstract Paper
Exploring the Effectiveness of Convolutional Neural Networks for Answer Selection in End-to-End Question Answering
Royal Sequiera, Gaurav Baruah, Zhucheng Tu, Salman Mohammed, Jinfeng Rao, Haotian Zhang, and Jimmy Lin
Proceedings of the SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR'17), August 2017, Tokyo, Japan

Most work on natural language question answering today focuses on answer selection: given a candidate list of sentences, determine which contains the answer. Although important, answer selection is only one stage in a standard end-to-end question answering pipeline. This paper explores the effectiveness of convolutional neural networks (CNNs) for answer selection in an end-to-end context using the standard TrecQA dataset. We observe that a simple idf-weighted word overlap algorithm forms a very strong baseline, and that despite substantial efforts by the community in applying deep learning to tackle answer selection, the gains are modest at best on this dataset. Furthermore, it is unclear if a CNN is more effective than the baseline in an end-to-end context based on standard retrieval metrics. To further explore this finding, we conducted a manual user evaluation, which confirms that answers from the CNN are detectably better than those from idf-weighted word overlap. This result suggests that users are sensitive to relatively small differences in answer selection quality.

Abstract Paper
Prizm: A Wireless Access Point for Proxy-Based Web Lifelogging
Jimmy Lin, Zhucheng Tu, Michael Rose, and Patrick White
Proceedings of the First Workshop on Lifelogging Tools and Applications (LTA 2016), October 2016, Amsterdam, The Netherlands

We present Prizm, a prototype lifelogging device that comprehensively records a user’s web activity. Prizm is a wireless access point deployed on a Raspberry Pi that is designed to be a substitute for the user’s normal wireless access point. Prizm proxies all HTTP(S) requests from devices connected to it and records all activity it observes. Although this particular design is not entirely novel, there are a few features that are unique to our approach, most notably the physical deployment as a wireless access point. Such a package allows capture of activity from multiple devices, integration with web archiving for preservation, and support for offline operation. This paper describes the design of Prizm, the current status of our project, and future plans.

Abstract Paper