Rohan Mackin Jha

Rohan Mackin Jha

Master’s Student & Researcher

The University of Texas at Austin

About

I am currently a master’s student studying CS at The University of Texas at Austin. I did my undergraduate degree at Carnegie Mellon University where I majored in Artificial Intelligence, TA’d two AI courses, and conducted research at the Language Technologies Institute in the areas of neural lexical retrieval. Most recently, I interned at Jina AI leading the training of their Jina-ColBERT-v2 model, a multilingual multi-vector retriever.

I’m currently most interested in research in information retrieval and representation learning. My current and previous works have been in the area of efficient and expressive multi-vector retrieval models, and reasoning intensive retrieval tasks.

Interests

  • Information Retrieval
  • Representation Learning
  • Efficient IR Inference
  • Reasoning-Intensive IR
  • LMs for IR
  • LM Decoding Strategies

Education

  • MS in Computer Science, 2023 - present

    The University of Texas at Austin

  • BS in Artificial Intelligence, 2019 - 2023

    Carnegie Mellon University

Publications

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this work we propose a number of incremental improvements to the ColBERT model architecture and training pipeline, using methods shown to work in the more mature single-vector embedding model training paradigm, particularly those that apply to heterogeneous multilingual data or boost efficiency with little tradeoff. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks.

Generalizable Tip-of-the-Tongue Retrieval with LLM Re-ranking

Tip-of-the-Tongue (ToT) retrieval is challenging for search engines because the queries are usually natural-language, verbose, and contain uncertain and inaccurate information. This paper studies the generalization capabilities of existing retrieval methods with ToT queries in multiple domains. We curate a multi-domain dataset and evaluate the effectiveness of recall-oriented first-stage retrieval methods across the different domains, considering in-domain, out-of-domain, and multi-domain training settings. We further explore the use of a Large Language Model (LLM), i.e. GPT-4, for zero-shot re-ranking in various ToT domains, relying solely on the item titles. Results show that multi-domain training enhances recall, and that LLMs are strong zero-shot re-rankers, especially for popular items, outperforming direct GPT-4 prompting without first-stage retrieval.

COILcr: Efficient Semantic Matching in Contextualized Exact Match Retrieval

COILcr (COntextualized Inverted Lists with Canonical Representation) extends the orginal COIL [Gao et al. 2021] neural-lexical retrieval system by explicitly factorizing COIL into intra-context term importance weights and cross-context semantic representations. At indexing time COILcr further maps term semantic representations to a smaller set of clustered canonical representations which efficiently preserve term semantics and retrieval performance while reducing its storage and computational cost.

Experience

 
 
 
 
 

Model Training Intern

Jina AI

May 2024 – Aug 2024 Berlin, GER
  • Led the research and development of new multilingual, multi-vector retrieval model
  • Expanded training infrastructure and administered controlled experiments on data, architecture, and recipe
 
 
 
 
 

Software Engineer Intern

Meta

May 2022 – Aug 2022 Menlo Park, CA
  • Using Caffe2/PyTorch frameworks, implemented sparse Mixture of Experts and recent Differentiable Gating Mechanisms techniques into multiple sub-architectures of production advertisement recommendation models
  • Conducted validation and ablation experiments to determine the efficacy and infrastructure costs of newly introduced modules, achieving model performance improvements supporting multiple organizations across the company corresponding to increased advertisement revenue
 
 
 
 
 

Undergraduate Research Assistant

Carnegie Mellon University Langauge Technology Institute

Dec 2021 – Dec 2022 Pittsburgh, PA
  • Designed, implemented, and presented experiments and results to the principal investigator
  • Produced independent conference-style work analyzing the performance/cost tradeoff and distribution of sparsely-factorized dense embeddings as a performance-preserving, cost-optimizing modification to an existing retriever
  • Supported published research in neural information retrieval focused on combining dense language models’ context with the sparse efficiency of the inverted list architecture
 
 
 
 
 

Teaching Assistant (07-180: Concepts of AI, 15-281: AI: Representation and Problem Solving)

Carnegie Mellon University

Aug 2021 – May 2022 Pittsburgh, PA
  • Designed, tested, proctored, and graded written and programming-based homework and exams
  • Led 120+ students with weekly office hours, recitations, and exam review sessions
  • Was primary expert for student questions on class forum with high coverage and low response latency
 
 
 
 
 

Software Development Engineer Intern

Amazon

Jun 2021 – Aug 2021 Seattle, WA
  • Designed, implemented, and tested software processing billions of customer impressions hourly
  • Led extensive design and review process after researching and documenting various solution alternatives
  • Reduced latency, cost, and on-call pain points to accommodate imminent 3-7x increase in impression volume