Recently, Snow Fox Data had the opportunity to present the wonders of data science and analytics to a local middle school math club. Eager to support our community and give these mathematicians a taste of what makes data analysis exciting, our team chose a sweet topic to explore: candy.
During our presentation, we guided students through the analysis of a candy dataset, optimization of “ingredients” to design their own candy, and used Generative AI to name their optimally designed custom candy creation. In just a short time, these students turned numbers and data into an impressive candy-coated adventure.
Meet the Math Club
The 6th-8th grade math club meets bi-weekly throughout the school year, offering an engaging blend of learning and fun for young math enthusiasts. Each session features a mix of competitive math challenges and intriguing logic puzzles that stimulate problem-solving skills and critical thinking. The club also encourages presentations from the community on real-world applications of mathematics, showcasing how math is utilized in various professions. These presentations provide students with practical insights and inspire them to see the relevance of their math skills beyond the classroom. With its creative activities and community engagement, the math club fosters a love for math and encourages students to explore its many possibilities.
Sweet Statistics
In our analysis, we used data from The Ultimate Halloween Candy Power Ranking Kaggle which provides the survey results from WaltHickey.com. In this online survey, participants are presented with two pieces of candy and asked to vote on which they prefer. Participants can vote multiple times on different matchups (ex. M&M'S® versus Mounds bars), and the results were collected. The summarized dataset contains the winning percentage of the 85 types of candy surveyed. What makes this dataset even more interesting is that attributes were then assigned to each type of candy - “chocolate”, “fruity”, “caramel”, etc. - which provides an opportunity to analyze candy-eaters’ preferences and potentially help us design the perfect piece of candy.
These 6th graders were a wealth of candy knowledge, having had years of experience consuming fun-size snacks. They enjoyed creating their own internal tally of favorites.
Data Is Like a Box of Chocolates
Before digging into any data, we did two things. First, the students got to pick (and eat) from a basket of mixed Halloween candy and we listed the “attributes” of the candy on a whiteboard. After doing some delicious sampling, we asked the students to vote on which attribute was their favorite - likely influenced by the piece they were snacking on at the time. Like a box of chocolates, we weren’t quite sure what we were going to get. The results looked something similar to the following, with chocolate and caramel being the overwhelming favorites.
Note: In the case of candy, “Pluribus” means “contains several individual pieces” - ex. Skittles.
Analysis, Sweet Analysis
After performing our own club survey, we began describing and analyzing the dataset we had at our disposal - introduced as the data science task of “descriptive analysis”. We explained how important it is to truly understand the data before trying to use it to answer any questions, or any results we might find may lead to an inaccurate answer. For this analysis, we used Dataiku since the platform provides a very intuitive user interface that could be easily understood by the math club. We explored the data in a simple table view, analyzing the distributions of the attribute columns as well as discussing the win_percent column.
After a brief (and sometimes passionate) discussion concerning the unique attributes of the various candies, we jumped into the idea of visualizations, creating the following bar chart which lists the top 12 candies ordered by their winning percentage in the online poll. The realization that REESE'S claimed 3 of the top 6 spots was highly controversial amongst this team of candy experts!
Numerous other columns were analyzed in a similar manner until we felt we had a firm grasp on the candy dataset.
The Candy Designer Challenge
Once the candy data was well understood, we introduced the real goal of working with this data - to use statistical analysis to design the optimal candy based on user survey preferences. Although the students thought it might be easy to use a dataset to validate which attributes are most likely to be present in popular candy, we introduced an even more complex question: What combination of attributes are the most likely to be popular? This question was determined to be a bit more difficult.
Crunching the Numbers
As the students had seen in the descriptive analysis, it was possible to determine popularity on a single-attribute basis, but it wasn’t clear how we could perform an analysis to compare all of the attributes to see which was most popular.
To help solve this problem, the idea of correlation and a correlation matrix was presented. By creating a correlation matrix, it would be possible to see the strength of each attribute in relation to the winning percent. As shown in the following correlation matrix, the students found that “fruity” and “hard” candies have a negative correlation with winning, whereas “chocolate” and “bar” have a very strong positive correlation. In many cases, this helped the students validate their assumptions - and in other cases, provided interesting insight.
Advanced Analysis with Python
Using the above descriptive and statistical analytics techniques, we walked the students through the process of identifying the leading candies and correlations of individual attributes to the winning percentage in the online poll. To find the optimal combinations of attributes, we presented the idea of using a script written in Python to perform the more complex analysis. Although there are more streamlined methods to perform combinatorial optimization, for this example, we walked through an approach of combining every possibility of 3 attributes (which we discussed to be a total space size of 84).
Many of the students were surprisingly comfortable with Python scripting and were able to follow along with our pseudocode description of the approach. Using this script, we were able to create a result dataset that illustrated the top combination of 3 attributes, ordered by the average winning percentage of the candies with these attributes, as well as the real candies that have those attributes.
The winner of the top attribute combination turned out to be “chocolate”, “peanuty-almondy” and “pluribus” - which was a somewhat surprising result compared to some of the students’ predictions. Since we had proposed the challenge of designing a candy that had attributes unlike any other on the market, we also found the top potential attribute combination that wasn’t found in an existing candy in the dataset, which turned out to be “peanuty_almondy”, “nougat”, and “crispedrice_wafer”. This combination was projected to have a 64% win rate based on the average win-rates of candies with those attributes.
The Final Ingredient: GenAI
Once we had developed our new idea for a candy, what modern data science project could be complete without utilizing Generative AI? We had a lively discussion about the idea of large language models and ChatGPT - which the club members were incredibly well-versed in. To utilize LLMs in our project, we used the OpenAI API along with Dataiku’s prompt studio to generate candy bar names based on the attribute combinations from our generated dataset.
The students experimented with prompts to generate example names and finally settled on an LLM-generated name for their newly designed bar, “Crunchy Nutty Chomp-A-Lot”.
A Taste of What Data Science Can Do
While the deliciously chocolatey “Crunchy Nutty Chomp-A-Lot” will not be headed to stores anytime soon, in a few years these students will be choosing a career and we hope our exploration of data science has given them a “taste” of what data science can do and the impact it can make in our world.