Rohan Mackin Jha

Master’s Student & Researcher

The University of Texas at Austin

About

I am currently a master’s student studying CS at The University of Texas at Austin. I did my undergraduate degree at Carnegie Mellon University where I majored in Artificial Intelligence, TA’d two AI courses, and conducted research at the Language Technologies Institute in the areas of neural lexical retrieval. Most recently, I interned at Jina AI leading the training of their Jina-ColBERT-v2 model, a multilingual multi-vector retriever.

I’m currently most interested in research in information retrieval and representation learning. My current and previous works have been in the area of efficient and expressive multi-vector retrieval models, and reasoning intensive retrieval tasks.

Interests

Information Retrieval
Representation Learning
Efficient IR Inference
Reasoning-Intensive IR
LMs for IR
LM Decoding Strategies

Education

MS in Computer Science, 2023 - present
The University of Texas at Austin
BS in Artificial Intelligence, 2019 - 2023
Carnegie Mellon University

Publications

Rohan Jha, Bo Wang, Michael Günther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Akram, Nan Wang, Han Xiao

September 2024 ArXiv

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this work we propose a number of incremental improvements to the ColBERT model architecture and training pipeline, using methods shown to work in the more mature single-vector embedding model training paradigm, particularly those that apply to heterogeneous multilingual data or boost efficiency with little tradeoff. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks.

PDF Code DOI

Luís Borges, Rohan Jha, Jamie Callan, Bruno Martins

July 2024 SIGIR ‘24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Generalizable Tip-of-the-Tongue Retrieval with LLM Re-ranking

Tip-of-the-Tongue (ToT) retrieval is challenging for search engines because the queries are usually natural-language, verbose, and contain uncertain and inaccurate information. This paper studies the generalization capabilities of existing retrieval methods with ToT queries in multiple domains. We curate a multi-domain dataset and evaluate the effectiveness of recall-oriented first-stage retrieval methods across the different domains, considering in-domain, out-of-domain, and multi-domain training settings. We further explore the use of a Large Language Model (LLM), i.e. GPT-4, for zero-shot re-ranking in various ToT domains, relying solely on the item titles. Results show that multi-domain training enhances recall, and that LLMs are strong zero-shot re-rankers, especially for popular items, outperforming direct GPT-4 prompting without first-stage retrieval.

PDF Code DOI

Zhen Fan, Luyu Gao, Rohan Jha, Jamie Callan

March 2023 Advances in Information Retrieval (ECIR 2023)

COILcr: Efficient Semantic Matching in Contextualized Exact Match Retrieval

COILcr (COntextualized Inverted Lists with Canonical Representation) extends the orginal COIL [Gao et al. 2021] neural-lexical retrieval system by explicitly factorizing COIL into intra-context term importance weights and cross-context semantic representations. At indexing time COILcr further maps term semantic representations to a smaller set of clustered canonical representations which efficiently preserve term semantics and retrieval performance while reducing its storage and computational cost.

PDF DOI

Experience

Model Training Intern

Jina AI

May 2024 – Aug 2024 Berlin, GER

Led the research and development of new multilingual, multi-vector retrieval model
Expanded training infrastructure and administered controlled experiments on data, architecture, and recipe

Software Engineer Intern

Undergraduate Research Assistant

Carnegie Mellon University Langauge Technology Institute

Dec 2021 – Dec 2022 Pittsburgh, PA

Designed, implemented, and presented experiments and results to the principal investigator
Produced independent conference-style work analyzing the performance/cost tradeoff and distribution of sparsely-factorized dense embeddings as a performance-preserving, cost-optimizing modification to an existing retriever
Supported published research in neural information retrieval focused on combining dense language models’ context with the sparse efficiency of the inverted list architecture

Teaching Assistant (07-180: Concepts of AI, 15-281: AI: Representation and Problem Solving)

Carnegie Mellon University

Aug 2021 – May 2022 Pittsburgh, PA

Designed, tested, proctored, and graded written and programming-based homework and exams
Led 120+ students with weekly office hours, recitations, and exam review sessions
Was primary expert for student questions on class forum with high coverage and low response latency

Software Development Engineer Intern

Amazon

Jun 2021 – Aug 2021 Seattle, WA

Designed, implemented, and tested software processing billions of customer impressions hourly
Led extensive design and review process after researching and documenting various solution alternatives
Reduced latency, cost, and on-call pain points to accommodate imminent 3-7x increase in impression volume