KDD – Knowledge Discovery in database – My Experiences with Microservices / Cloud Services / Machine Learning / CI CD

KDD Stands for Knowledge Discovery in database . It all about how we derive useful information aka knowledge from the the raw data. Major Steps Involved in KDD are as follows in the same order.

Selection	Target Data	Preprocess	Transform	Mining	Evaluation	Present

Let us take a problem statement and understand what exactly we mean by each of these steps and how we can derive at knowledge with the given information.

Problem Statement: ( Identify Fraud in Credit Card Usage )

Given Data The given data is highly classified into three categories like DemoGraphical, Custom behaviour, Social Media Data as follows

Demographical:

Name	Age	Location	Address	Phone	Email	Driving License

Customer Behavior:

CC Location	Amount Spent	Selection	Time of Usage	Date of Usage	Items Purchased	Quantity

Social Media Data

CC Email	Integration	Web Posts	Sentiment	Phone	Location of Post

KDD Process with the above Data

Goal: This is the first step we need to understand , with the given problem statement , The Goal of the given use case is to identify the fraud transaction with the given set of credit card transactions.

TargetData : In this phase, we need to identify the right data we are interested in , that can help us to achieve the goal.We need to identify the right target data set. For any given problem we may have different data sets, however we need to identify the right data, else we may end up with more biased or incorrect predictions. Kind of questions we should get here will Social Media data Set help me ? Will Customer Behavior help me ? Or should we mix the data ? These are questions we need to think of at this phase.

Identifying the target data set doesn’t mean that we have everything in place , our focus now should be on the data for the features we wanted to use. Where the next step comes.

Cleaning and Preprocessing: In this phase we need to clean the data as per the requirements also we need to identify the missing fields . We can also consider Removing outliers or any noise in the data. Also you can append any data which is missing. For example In this case we can get Pincode basing on the location. Also given Latitude and Longitude we can populate pin code.

Features	Conditions
Name	Alphanumeric , Not Empty
Age	Numeric
Location	Numeric
Address	Alphanumeric , Not Empty
CC Location	Alphanumeric , Not Empty
Amount Spent	Numeric
Time Of Usage	In HH:MM:SS Format, Not Empty
Date Of Usage	In MM:DD:YYYY Format,Not Empty
Items Purchased	Numeric

Transformation: In this phase, we need to identify the data we are interested in , that can help us to achieve the goal.We can look at the data samples provided to arrive at the conclusion. Therefore i would like to consider the following columns from above data set that can make sense for the problem discovery.

Name	Age	Location
Address	CC Location	Amount Spent
Time of Usage	Date of Usage	Items Purchased

Data Mining: In this phase we need to decide on what’s the goal of KDD , Do you want to do Classification ? Regression ? or Clustering. As per the given problem statement . My goal is to have classification of data, Which is Transaction is Fraud or not.

Evaluation: This is an evaluation phase where we can choose different algorithms to discover the pattern . Here we can decide which models and features might make sense for the Whole KDD process and then interpret the knowledge from the mined patterns. As per the Given problem statement , we can see different patterns like , Having huge amount Transactions, Having Transactions in Location which customer never visited etc.

Knowledge: We will be consolidating the knowledge discovered using all the above steps. This knowledge of knowing the given transaction is fraud or not can help the management to make better decisions .Also it can help to use the same knowledge with the other systems.

Selection

Target Data

Preprocess

Transform

Mining

Evaluation

Present

Name

Age

Location

Address

Phone

Email

Driving License

CC Location

Amount Spent

Selection

Time of Usage

Date of Usage

Items Purchased

Quantity

CC Email

Integration

Web Posts

Sentiment

Phone

Location of Post

Share this:

Leave a comment Cancel reply