Ryan MooreSeptember 23, 20249 min read

Mint Condition Metrics: Organizing Chaotic Text With Dataiku and OpenAI

Extracting valuable insights from messy, unstructured data is a traditionally daunting task. Whether it’s unstructured auction listings like we’re uncovering in this article, social media posts, or product reviews, trying to make sense of the chaos can feel like trying to find a needle in a haystack. Fortunately, with the rise of Large Language Models (LLMs), what once seemed like an impossible task has become not only possible, but refreshingly simple. Instead of relying on rigid rules and complicated techniques, LLMs can help understand and organize data in ways that are flexible, intuitive, and surprisingly accurate. LLM solutions developed in Dataiku can bring order to chaos and unlock the hidden potential within seemingly unruly data.

Try it out! Download the completed project in Snow Fox Data’s Github repository.

The Challenge: Making Sense of Unstructured Data

In this demonstration, we’re working with a dataset containing a relatively small number of completed eBay auction listings for baseball cards from 1987. This dataset contains the title of the eBay listing, the link to the original listing, and the price the card sold for.

If you aren’t familiar, baseball cards were a hot commodity in 1987 (and are again), but were grossly overproduced, causing a supply and demand imbalance that rendered their valued less than similar cards produced earlier in the 1980’s. A period also referred to as the “Junk Wax Era”.

For this analysis, we will use the card titles and sale prices to understand:

Which card was sold in this auction? (Including year, brand, set, and card number).
- What player(s) are featured on this card? (Some cards have multiple players).
Was the card professionally graded?
- If so, what was the grade?
- If not, was there an “ungraded” condition of the card listed?

Note: Collectible grading is the process of evaluating the condition and authenticity of a sports trading card to assign it a numerical grade between 1 and 10. This grade indicates the card’s quality, which directly affects its value to collectors and investors.

Once we have the above information for each card, we should be able to easily perform interesting analysis such as:

Which specific card had the highest average sales price?
How much higher does a professionally graded card typically sell for?

The real challenge to this exercise is the parsing of the titles, which follow no consistent patterns and may contain superfluous words and even player names that are hard to identify and extract. The exact same card in multiple listings could be listed with very different titles, such as:

LOOK NOW👀 1987 Mark McGwire ROOKIE! Donruss #46 PSA 8
1987 Donruss Mark McGwire 46 Athletics HOF MINT

Baseball Cards- www-sportscardinvestor-com Before the advent of LLMs, approaches to solving this problem were complex at best. A combination of extraction techniques, including regex string matching, fuzzy string matching, ML named entity recognition, and ML text classification could be used. A blend of these techniques could make the task possible but certainly error-prone and difficult to train and build. Today’s most advanced Large Language Models, however, make this task much easier to execute, providing more flexibility and delivering a higher degree of accuracy.

The Dataset

The dataset is a .csv containing 2,041 rows that has been uploaded into a Dataiku project (for more information on creating datasets, see this documentation). As you’ll see in the following screenshot, this simple dataset contains the titles of 1987 baseball cards sold on eBay, as well as the price the card sold for.

Dataset in Dataiku 1

Here, we also see examples of unstructured text that make it nearly impossible to perform descriptive analysis of the dataset in its raw format. Using a regex, we may be able to extract the year, brand, and card numbers, but anything beyond that is sure to be a challenging and time-consuming process.

LLMs for the Win

To get started, we’re going to filter down our uploaded ebay_sales dataset with a "Filter recipe" in the Dataiku flow to reduce the set to only some more popular cards, and then add a “Prompt” recipe from the LLM Recipes section.

Baseball Card Ebay Sales in Dataiku 1

With this recipe created, we’ll select Edit in Prompt Studio to open a Dataiku Prompt Studio which we’ll use to author a prompt which will be used to instruct an LLM how to parse the eBay titles.

Baseball Card Ebay Sales in Dataiku 2

After giving the prompt studio a name, it is necessary to select a desired LLM to utilize for this design session. In our DSS instance, we have configured a LLM Connection to utilize the OpenAI API and allow us to select the GPT 4 model for our development.

Baseball Card Ebay Sales in Dataiku 3 Once we’ve selected our LLM, we’re going to utilize Dataiku’s intuitive Prompt Studio to design our prompt template. For this prompt, we want to pull examples of the eBay card titles from the ebay_sales_filtered dataset in our flow, so we’ll select that dataset in the “Inputs from” dropdown, as shown below. After selecting this dataset, we’re going to map the relevant column (title) to a parameter of the same name in our prompt template. This is done so that we can utilize these columns as inputs to our prompt.

Baseball Card Ebay Sales in Dataiku 4

As many of us are now familiar, the real magic of LLM development is in the prompt template itself, which we are able to provide to the language model along with each of our card titles. For this effort, we want to extract attributes from the titles, such as the player’s name, year, brand, card number, and card grade. Since we’re looking to extract a consistent, structured result from these records, we’re going to ask the LLM to output the results in JSON format. By using a predefined JSON structure, it will be much easier to create a downstream process to consistently work with the data. Prompt template development can be an art in itself, but a good start for this task may be similar to:

Given the following eBay listing title, extract the details of the baseball card in JSON format.

Only include the attributes "year", "brand", "card_number", "player_name", "professional_grade", "ungraded_condition".

"professional_grade" will have values similar to "PSA9", "HGA 7", "SGC 8.5", or "CSG3".

"ungraded_condition" will have values similar to "NR-MT", "Excellent", "mint"

{title}

You’ll notice that we’ve added the mapped {title} column to this template using the curly-brace syntax as described in this tutorial.

With this first pass at a prompt created, we can click the RUN PROMPT button to test out our prompt on the first eight records in our dataset. As you can see in the screenshot below, the LLM does an impressive job of parsing the desired attributes from these titles. Some of the titles contain all of our targeted attributes, while others only contain a few, but in these examples, the LLM is able to identify them with 100% accuracy.

Baseball Card Ebay Sales in Dataiku 5 As with many of the tasks associated with the Data Science Lifecycle, this prompt development may undergo a series of iterations and testing. Once we’re satisfied with the results, we’ll click the Export as Recipe button to move the prompt template and configuration we’ve established back into the prompt recipe in our flow.

Baseball Card Ebay Sales in Dataiku 6

With the recipe configuration populated, we can run the recipe to generate the parsed JSON results per row. In the output dataset, four new columns will be present, of which we’re mainly interested in the llm_output which contains the raw JSON created by the LLM.

With this dataset generated, we’ll create a Prepare recipe to extract and make use of the JSON results. To accomplish this, we’ll add an Unnest step which is capable of expanding the JSON result column to create a corresponding column per JSON object key.

Baseball Card Ebay Sales in Dataiku 7

Along with this step, we’re going to add an Extract with regular expression step to make the graded card data easier to analyze. The regex we’ll use is as follows:

Expression: (\d+(\.\d+)?)
This regular expression will effectively pull the numeric values (like “10” or “7.5”) out of the professional grade column.

After running this relatively simple Prepare recipe, we already have a very usable dataset for analysis - an incredibly easier process than the alternative methods. The ability to perform this analysis so quickly is all possible due to the parsing “intelligence” of the language model we employed - and the intuitive nature of development in Dataiku. At this point, we can take the time we saved in development to find some interesting insights in the output set.

Let’s take a look at a couple of the initial questions we had proposed and create some corresponding visualizations using Dataiku Charts. Our first question asks: What was the average sale price by card grade?

Baseball Card Ebay Sales in Dataiku 8 Here we can see that generally the price of cards in our dataset increases exponentially with the grade, which shouldn’t be much of a surprise - a collectible in perfect condition typically garners a premium price.

Let’s dig a bit deeper and answer the question: What was the average sale price by grade per player in the dataset?

Baseball Card Ebay Sales in Dataiku 8 In this visualization, we can observe that certain players fetch a premium price relative to others. For example, across all brands, we see that 1987 PSA graded 10 Mark McGwire cards average over a $200 sale price, whereas his former teammate Jose Canseco has an average grade 10 sale of $29 (in this limited dataset).

And finally, let’s try to describe the relationship between price and card grade in terms of an exponential regression curve. For this analysis, we narrow our focus to cards to one of the most popular players of the era, Barry Bonds, and further limit the scope to the most common Topps, Donruss, and Fleer card brands. After performing this chart filter in a scatter plot in Dataiku and choosing the Exponential regression type, we can quickly obtain the formula of y=0.00012e^(1.43x) as a means of describing the exponential rise in prices for this card as the grade increases from 7 to 10. This formula gives us an extremely powerful estimation of how the grade of the card affects the sales price.

Baseball Card Ebay Sales in Dataiku 9

It may be possible to also abstract the use of this formula, allowing us to estimate the values of other players’ graded cards within these 1987 sets, even if we’ve never seen examples of previous sales.

Another Home Run for LLMs

We’ve explored one of the numerous use cases for large language models, and there are countless other challenges where they can also be applied. In this case, the LLM’s parsing of text in eBay auction titles made it easy to turn unstructured, unpredictable phrases into structured JSON objects that can be easily processed and interpreted. Dataiku’s Prompt Studio, along with its preparation and analysis features, make this development very easy and intuitive, providing a terrific opportunity to solve many similar challenges!

Want to learn more? Dig deeper into LLMs and Natural Language Querying in Dataiku in this blog.

Ryan Moore

With over 20 years of experience as a Lead Software Architect, Principal Data Scientist, and technical author, Ryan Moore is the Head of Delivery and Solutions at Snow Fox Data and our resident Dataiku Neuron. He provides Data Science architecture and implementation consultation to organizations around the globe.