I am currently a master’s student studying CS at The University of Texas at Austin. I did my undergraduate degree at Carnegie Mellon University where I majored in Artificial Intelligence, TA’d two AI courses, and conducted research at the Language Technologies Institute in the areas of neural lexical retrieval. Most recently, I interned at Jina AI leading the training of their Jina-ColBERT-v2 model, a multilingual multi-vector retriever.
I’m currently most interested in research in information retrieval and representation learning. My current and previous works have been in the area of efficient and expressive multi-vector retrieval models, and reasoning intensive retrieval tasks.
MS in Computer Science, 2023 - present
The University of Texas at Austin
BS in Artificial Intelligence, 2019 - 2023
Carnegie Mellon University
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this work we propose a number of incremental improvements to the ColBERT model architecture and training pipeline, using methods shown to work in the more mature single-vector embedding model training paradigm, particularly those that apply to heterogeneous multilingual data or boost efficiency with little tradeoff. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks.
Tip-of-the-Tongue (ToT) retrieval is challenging for search engines because the queries are usually natural-language, verbose, and contain uncertain and inaccurate information. This paper studies the generalization capabilities of existing retrieval methods with ToT queries in multiple domains. We curate a multi-domain dataset and evaluate the effectiveness of recall-oriented first-stage retrieval methods across the different domains, considering in-domain, out-of-domain, and multi-domain training settings. We further explore the use of a Large Language Model (LLM), i.e. GPT-4, for zero-shot re-ranking in various ToT domains, relying solely on the item titles. Results show that multi-domain training enhances recall, and that LLMs are strong zero-shot re-rankers, especially for popular items, outperforming direct GPT-4 prompting without first-stage retrieval.
COILcr (COntextualized Inverted Lists with Canonical Representation) extends the orginal COIL [Gao et al. 2021] neural-lexical retrieval system by explicitly factorizing COIL into intra-context term importance weights and cross-context semantic representations. At indexing time COILcr further maps term semantic representations to a smaller set of clustered canonical representations which efficiently preserve term semantics and retrieval performance while reducing its storage and computational cost.