ML-NYC Speaker Series and Happy Hour: Danqi Chen
Schedule
Tue Apr 08 2025 at 04:00 pm to 06:00 pm
UTC-04:00Location
Flatiron Institute | New York, NY

About this Event
The ML in NYC Speaker Series + Happy Hour is excited to host Professor Danqi Chen as our April speaker! Her talk will take place this Tuesday, April 8th, at 4pm at the Flatiron Institute. As always, there will be a reception afterward for all attendees.
Title: Optimizing Data Use for Pre-training Language Models
Abstract: Modern language models are trained on massive, unstructured data consisting of trillions of tokens, typically obtained by crawling the web. In this talk, I argue that we are still in the early stages of understanding pre-training data and unlocking its full potential, and that more effective use of data can lead to both more compute-efficient and more capable language models. I will present several perspectives on improving data curation, focusing on three general techniques. First, quality filtering aims to train classifiers that can distinguish high-quality from low-quality documents at scale (QuRating). Second, domain curation focuses on developing taxonomies of web data and leveraging domain mixing strategies to enhance pre-training (WebOrganizer). Third, I will introduce a simple pre-training approach that conditions on metadata, which both accelerates training and improves model steerability (MeCo). Together, these efforts highlight the importance of optimizing the use of pre-training data and point toward a more data-centric paradigm for training future language models.
Bio: Danqi Chen is an Assistant Professor of Computer Science at Princeton University and co-leads the Princeton NLP Group. She also serves as an Associate Director of Princeton Language and Intelligence (PLI), an initiative focused on developing fundamental research of large AI models. Her research centers on training, adapting, and understanding language models (LMs), with an emphasis on making them more accessible to academia. She also works at the intersection of LMs and retrieval, exploring how retrieval can serve as a foundational component of LMs. Before joining Princeton, Danqi was a visiting scientist at Facebook AI Research in Seattle. She earned her Ph.D. from Stanford University (2018) and her B.E. from Tsinghua University (2012), both in Computer Science. Her work has been recognized with a Sloan Fellowship, an NSF CAREER Award, a Samsung AI Researcher of the Year Award, and multiple outstanding paper awards from ACL and EMNLP.
Where is it happening?
Flatiron Institute, 162 5th Avenue, New York, United StatesEvent Location & Nearby Stays:
USD 0.00
