Spray charts are a data visualization used in baseball to describe which area of the field hitters have historically hit the ball to. Hitters have unique hitting tendencies for various reasons (e.g., handedness, contact points, bat-speed, etc.). Spray charts are useful both for understanding player mechanics and game-planning for opposing hitters. This is especially useful in collegiate baseball where the "shift" (i.e., players moving to a different position in response to a hitter's tendencies) is a legal strategy. Baseball, especially at the professional level, has been infused with technologically advanced data recording tools (e.g., Trackman, Hawkeye, Rapsodo, etc.). These technologies utilize radar and high speed cameras to track ball and player movement. With these technologies, it is simple to create the spray charts we are discussing. Unfortunately, these technologies often are prohibitively expensive and consequently many amateur teams are unable to reap the benefits of high quality data. This project proposes an alternative method to making spray charts that does not rely on these high speed cameras and radars. Instead, this project utilizes large language models from OpenAI (e.g., GPT-4) to process existing textual data (play-by-play data) that is already available at all levels of baseball. Using a semi-supervised machine learning approach called pseudo-labeling, we are able to train OpenAI’s cheapest large language models to create spray charts with >90% accuracy.
Senne Michielssen, '25:
https://www.linkedin.com/in/senne-m