Projects at Stanford AI Lab (OVAL - Open Virtual Assistants Lab)
De(ep)Composition:
a system for long-horizon reasoning that decomposes unstructured information into hierarchical components and stores it in persistent, queryable memory. Useful in settings that require continuously updated views grounded in accumulated evidence, like hedge-fund portfolio management, corporate teams monitoring markets, and investigative journalists maintaining long-running storylines. (The code can't be publicly available yet as it's part of an ongoing OVAL project, but here's
a fun advertisement we made!)
ScalableTokenizer:
a tokenizer that learns a vocabulary with column generation and decodes text with dynamic programming while consulting morphology-aware features, named-entity gazetters, and regex guards. Inspired by foundational tokenizers such as SentencePiece and Byte-Pair Encoding, I see work like this as a path toward more efficient representations of text in LLMs.
POMDP for Investing Portfolios:
Optimal Portfolio Rebalancing Under Changing Market Conditions, for
CS 238 (decision-making under uncertainty). A partially observable Markov decision process for portfolio allocation under uncertain volatility regimes, using Fama-French 5-factor data and value iteration to maximize mean-variance utility, outperforming 60/40 and volatility-threshold strategies in return, Sharpe ratio, and maximum drawdown.
PreSearch (research opportunity marketplace):
a two-sided platform connecting students seeking research positions with PIs and labs posting open project roles, with a compatibility matching algorithm over scraped academic interests. Deployed on AWS EC2 with a React frontend and FastAPI backend; engineered a LinkedIn and university directory scraping pipeline using Selenium and BeautifulSoup.
Atriium (archival document transcription):
a web-based system for transcribing scanned historical and archival documents into structured text, combining OCR with LLM-based post-processing for noisy and handwritten sources.
Stanford Archives Database:
complete corpus of Regina Twala's letters and archival materials, built via OCR- and LLM-based transcription, translation, and metadata structuring from scanned documents.
Sequence [backend] [frontend] (computer vision safety platform for trades):
a real-time video analytics system for hazardous trades, using deep learning-based object detection and activity recognition to identify unsafe behaviors and job-site risks.
State of the Students:
an award-winning civic media platform recognized by former presidential candidates Kamala Harris and Pete Buttigieg; had the most views and engagement for high-profile races and local elections from 2019 to 2023.