A Data Scientist’s Guide to Predicting Customer Behavior: From Raw Data to Real Insights: Part 1
How a simple dataset about insurance customers revealed flawed data, surprising trends, and the hidden story of who is most likely to buy.
In the world of data science, we’re often portrayed as digital wizards, conjuring predictions from complex algorithms. But the reality is far more like being a detective. We’re given a case file — a raw dataset — and our job is to dust for prints, interrogate the witnesses (the variables), and piece together a story that solves the mystery.
Recently, I was handed just such a case: a dataset from an insurance company. The mystery was To figure out which of their existing customers were most likely to purchase a new insurance product.
The company’s goal was simple: they wanted to target their marketing efforts more effectively. Instead of casting a wide, expensive net, they wanted to focus on the customers who were already raising their hands, even if they didn’t know it yet.
Join me on this journey as we walk through the real-life process of a data science project — from the messy first contact with raw data to the “Nice!!” moments that shape a successful predictive model. This isn’t about complex math; it’s about logic, curiosity, and the art of asking the right questions.
Chapter 1: First Contact and Cleaning Up-The Crime Scene
Every detective knows you don’t just rush into a crime scene. You first have to secure it and clean it up. In data science, this is the crucial first step of Data Loading and Cleaning. Our initial dataset had over 14,000 records and 15 features, including customer age, loyalty level, city, and spending on other products.

Our first discovery or problem in his case, The case file was a mess.
We immediately found over 3,008 duplicate rows — more than 21% of the entire dataset! Imagine a detective interviewing the same person 3,000 times and treating it as new testimony. It would completely bias the investigation. Removing these was non-negotiable.
Next, we standardized our key variable — the TARGET column, which told us if a customer bought the new product (‘Y’). Computers prefer numbers, so we converted this to a simple 1 for “Yes” and 0 for “No.”
With a clean, non-redundant dataset, our real investigation could begin.
Chapter 2: Exploratory Data Analysis (EDA)-The Interrogation
This is where the magic kinda happens. Exploratory Data Analysis (EDA) is the art of getting to know your data. We put every single variable under the microscope, asking one simple question: “What’s your story?”
Data Integrity- Checking the Alibis:
Before diving deep, we did a full background check. We found a few characters who weren’t who they appeared to be:
- The Useless Variable: The `contract` column had the same value for every single customer. A clue that’s the same for everyone is no clue at all. Useless.
- The Impostor: The `partner_age` column was a perfect mirror of the `customer_age` column. It was a redundant variable in disguise, offering no new information.
- The Secret Code: This was our first big break. The `city` column had a bizarre minimum value of `-999999`. This wasn’t a real city code; it was a “sentinel value” — a secret code left by a database to say, “I have no idea what the real value is.” This told us our data wasn’t as clean as we thought and would need careful handling.
Putting Variables Under the Microscope
We then looked at each variable one by one.
Customer Age: The distribution was heavily skewed. We had a huge number of customers in their 20s and early 30s, with a long, dwindling tail of older customers. This company’s core demographic was young.

Customer Spending (Turnover): The spending on existing products (`turnover_A` and `turnover_B`) was also extremely skewed. Most people spent a modest, consistent amount, but a handful of “whale” customers spent astronomical sums.


Why does this matter? Because many predictive models are sensitive to scale. These few high-rollers could completely distort the model’s view, making it think spending is the only thing that matters. We knew right away we’d have to transform this data (e.g., using a log scale) to keep our model from getting star-struck by the big numbers.
Loyalty Level: This was perhaps the most surprising witness. The `loyalty_level` was graded from 0 to 3. But nearly half the customers were labeled `99`. Was this a super-loyalty tier? No. Our analysis revealed it was another code, likely for “unclassified.” This wasn’t a measure of loyalty but a bucket for customers whose status was unknown. This group would turn out to be incredibly important.

Chapter 3: Connecting the Dots — Finding Relationships
Once we understood each variable, it was time to see how they related to our central mystery: Who bought the new product?
This is where the story truly came to life, and a few of our assumptions were turned on their heads.
The Loyalty Paradox

We compared each `loyalty_level` to the purchase rate.
- The customers in the “unclassified” (99) group were the most likely to buy the new product, with a 38% conversion rate!
- For the other customers, loyalty had an inverse relationship with purchasing. The most “loyal” customers (level 3) were the least likely to buy.
Business Insight: The company’s most loyal customers were already content. The real opportunity lay within the large, unclassified group — a segment the company didn’t even have a label for.
Product A
Next, we looked at whether owning another product, “Product A,” affected the decision. The result was just as shocking.
Customers who had NOT bought Product A were far more likely to buy the new product (a 43% conversion rate) than customers who already owned it (a mere 16% conversion rate).
Business Insight: The new product wasn’t a companion to Product A; it was a substitute. The marketing team was likely targeting the wrong people by focusing on existing Product A owners. They should have been targeting the customers who didn’t have it.
The Final Flaw: A Smoking Gun in the Data
Our last check was a simple logic test. We compared the spending data (turnover) with the product ownership flag. And we found our most critical data flaw: customers who had a 0 for product_a_bought still had non-zero spending data recorded for that product.
This is logically impossible. It was the equivalent of a detective finding a receipt for a car from someone who doesn’t have a driver’s license. This systemic error meant our data was fundamentally corrupted and required a major cleanup before any modeling could be trusted.
The Story So Far: What the Data Told Us
After hours of interrogation, our data had told us a rich and complex story. It wasn’t just a spreadsheet of numbers; it was a portrait of a business.
- The customers are young, and their spending habits are highly skewed.
- The data has serious flaws that needed to be corrected before any conclusions could be drawn.
- The most valuable customers for this new product were not who the company thought they were. They were the “unclassifieds” and those who hadn’t bought into their other flagship products.
This entire process of EDA is the unglamorous, indispensable heart of data science. Before a single fancy algorithm is run, the real work is done in understanding the story hidden within the data.
Our next step, of course, would be to take these insights, clean the data based on our findings, and use these powerful, newly-understood features to build our KNN and SVM prediction models. But without this detective work, we would have been building a house on a foundation of sand. And in the world of data, that’s a crime you can’t afford to commit.
Comments
Post a Comment