KDD Stands for Knowledge Discovery in database . It all about how we derive useful information aka knowledge from the the raw data. Major Steps Involved in KDD are as follows in the same order.
Selection |
Target Data |
Preprocess |
Transform |
Mining |
Evaluation |
Present |
|---|
Let us take a problem statement and understand what exactly we mean by each of these steps and how we can derive at knowledge with the given information.
Problem Statement: ( Identify Fraud in Credit Card Usage )
Given Data The given data is highly classified into three categories like DemoGraphical, Custom behaviour, Social Media Data as follows
Demographical:
Name |
Age |
Location |
Address |
Phone |
|
Driving License |
|---|
Customer Behavior:
CC Location |
Amount Spent |
Selection |
Time of Usage |
Date of Usage |
Items Purchased |
Quantity |
|---|
Social Media Data
CC Email |
Integration |
Web Posts |
Sentiment |
Phone |
Location of Post |
|---|
KDD Process with the above Data
Goal: This is the first step we need to understand , with the given problem statement , The Goal of the given use case is to identify the fraud transaction with the given set of credit card transactions.
TargetData : In this phase, we need to identify the right data we are interested in , that can help us to achieve the goal.We need to identify the right target data set. For any given problem we may have different data sets, however we need to identify the right data, else we may end up with more biased or incorrect predictions. Kind of questions we should get here will Social Media data Set help me ? Will Customer Behavior help me ? Or should we mix the data ? These are questions we need to think of at this phase.
Identifying the target data set doesn’t mean that we have everything in place , our focus now should be on the data for the features we wanted to use. Where the next step comes.
Cleaning and Preprocessing: In this phase we need to clean the data as per the requirements also we need to identify the missing fields . We can also consider Removing outliers or any noise in the data. Also you can append any data which is missing. For example In this case we can get Pincode basing on the location. Also given Latitude and Longitude we can populate pin code.
| Features | Conditions |
|---|---|
| Name | Alphanumeric , Not Empty |
| Age | Numeric |
| Location | Numeric |
| Address | Alphanumeric , Not Empty |
| CC Location | Alphanumeric , Not Empty |
| Amount Spent | Numeric |
| Time Of Usage | In HH:MM:SS Format, Not Empty |
| Date Of Usage | In MM:DD:YYYY Format,Not Empty |
| Items Purchased | Numeric |
Transformation: In this phase, we need to identify the data we are interested in , that can help us to achieve the goal.We can look at the data samples provided to arrive at the conclusion. Therefore i would like to consider the following columns from above data set that can make sense for the problem discovery.
| Name | Age | Location |
| Address | CC Location | Amount Spent |
| Time of Usage | Date of Usage | Items Purchased |
Data Mining: In this phase we need to decide on what’s the goal of KDD , Do you want to do Classification ? Regression ? or Clustering. As per the given problem statement . My goal is to have classification of data, Which is Transaction is Fraud or not.
Evaluation: This is an evaluation phase where we can choose different algorithms to discover the pattern . Here we can decide which models and features might make sense for the Whole KDD process and then interpret the knowledge from the mined patterns. As per the Given problem statement , we can see different patterns like , Having huge amount Transactions, Having Transactions in Location which customer never visited etc.
Knowledge: We will be consolidating the knowledge discovered using all the above steps. This knowledge of knowing the given transaction is fraud or not can help the management to make better decisions .Also it can help to use the same knowledge with the other systems.