KDD – Knowledge Discovery in database

KDD Stands for Knowledge Discovery in database . It all about how we derive useful information aka knowledge from the the raw data. Major Steps Involved in KDD are as follows in the same order.

Selection
Target Data
Preprocess
Transform
Mining
Evaluation
Present

Let us take a problem statement and understand what exactly we mean by each of these steps and how we can derive at knowledge with the given information.

Problem Statement: ( ​Identify Fraud in Credit Card Usage )

Given Data The given data is highly classified into three categories like DemoGraphical, Custom behaviour, Social Media Data as follows

Demographical:

Name
Age
Location
Address
Phone
Email
Driving License

Customer Behavior:

CC Location
Amount Spent
Selection
Time of Usage
Date of Usage
Items Purchased
Quantity

Social Media Data

CC Email
Integration
Web Posts
Sentiment
Phone
Location of Post

KDD Process with the above Data

Goal: ​This is the first step we need to understand , with the given problem statement , The Goal of the given use case is to identify the fraud transaction with the given set of credit card transactions.

TargetData : ​In this phase, we need to identify the right data we are interested in , that can help us to achieve the goal.We need to identify the right target data set. For any given problem we may have different data sets, however we need to identify the right data, else we may end up with more biased or incorrect predictions. Kind of questions we should get here will Social Media data Set help me ? Will Customer Behavior help me ? Or should we mix the data ? These are questions we need to think of at this phase.

Identifying the target data set doesn’t mean that we have everything in place , our focus now should be on the data for the features we wanted to use. Where the next step comes.

Cleaning and Preprocessing: ​In this phase we need to clean the data as per the requirements also we need to identify the missing fields . We can also consider Removing outliers or any noise in the data. Also you can append any data which is missing. For example In this case we can get Pincode basing on the location. Also given Latitude and Longitude we can populate pin code.

Features Conditions
Name Alphanumeric , Not Empty
Age Numeric
Location Numeric
Address Alphanumeric , Not Empty
CC Location Alphanumeric , Not Empty
Amount Spent Numeric
Time Of Usage In HH:MM:SS Format, Not Empty
Date Of Usage In MM:DD:YYYY Format,Not Empty
Items Purchased Numeric

Transformation: ​In this phase, we need to identify the data we are interested in , that can help us to achieve the goal.We can look at the data samples provided to arrive at the conclusion. Therefore i would like to consider the following columns from above data set that can make sense for the problem discovery.

Name Age Location
Address CC Location Amount Spent
Time of Usage Date of Usage Items Purchased

Data Mining: ​In this phase we need to decide on what’s the goal of KDD , Do you want to do Classification ? Regression ? or Clustering. As per the given problem statement . My goal is to have classification of data, Which is Transaction is Fraud or not.

Evaluation​: This is an evaluation phase where we can choose different algorithms to discover the pattern . Here we can decide which models and features might make sense for the Whole KDD process and then interpret the knowledge from the mined patterns. As per the Given problem statement , we can see different patterns like , Having huge amount Transactions, Having Transactions in Location which customer never visited etc.

Knowledge: ​We will be consolidating the knowledge discovered using all the above steps. This knowledge of knowing the given transaction is fraud or not can help the management to make better decisions .Also it can help to use the same knowledge with the other systems.

Leave a comment