From Raw Data to ML-Ready: Dataset Curation with Pandas
Schedule
Tue Feb 24 2026 at 02:00 pm to 04:00 pm
UTC-06:00Location
John Crerar Library - Kathleen A. Zar Room | Chicago, IL
About this Event
Data Science and Machine Learning(ML) have come to play a crucial role in a wide range of domains, from biological and physical sciences and engineering to finance and social science. In practice, however, most ML pipelines begin with datasets that need substantial curation before being used for any meaningful analytic purposes.
This workshop will focus on practical techniques for curating and analyzing datasets. Topics include handling missing values, working with mixed numerical and non-numerical data, and preparing data for downstream Machine Learning tasks. Pandas will be used for exploratory data analysis and data preparation, with scikit-learn introduced to demonstrate how curated data feeds into ML models.
Participants will work through hands-on exercises to explore dataset properties, identify common data quality issues, and develop strategies for transforming raw data into ML-ready inputs. The workshop will be conducted on the Midway HPC system, demonstrating workflows suitable for both local and high-performance computing environments.
- Do you have a large dataset but aren’t sure how to prepare it for use in a Machine Learning tool?
- Want to understand your data’s structure and properties before feeding into a Machine Learning pipeline?
- Are missing, inconsistent, or messy values breaking your ML pipeline?
- Have you heard of or used tools like “Pandas” and “Scikit-learn” but want a clearer, hands-on understanding of how they fit into data preparation?
If the answer to any of these questions is “yes” – this workshop is for you.
Objectives:
By the end of this workshop, participants will be able to:
- understand the core functionalities of Pandas tools and the basic workflow of Scikit-learn.
- Build an end-to-end pipeline that transforms raw data into a trained ML model.
- Apply demonstrated techniques to curate datasets and train r ML models on their own data
Level: Intermediate
Duration: 2 hours
Prerequisites: Working knowledge of Python. All participants are encouraged to bring a laptop with a Mac, Linux, or Windows operating system. Having an RCC account will be helpful to perform the exercises on Midway3.
Where is it happening?
John Crerar Library - Kathleen A. Zar Room, 5730 South Ellis Avenue, Chicago, United StatesEvent Location & Nearby Stays:
USD 0.00



















