pkb contents > data mining | just under 3466 words | updated 12/28/2017

1. What is data mining?

Per Sharda et al. (2014), the field of data mining draws on statistics, artificial intelligence and machine learning to create data mining tools that facilitate the discovery of meaningful patterns in large BI datasets(see notes on BI systems). Adapted from Sharda et al. (2014, p. 157), there are three general patterns sought:

Data Mining Task Learning Method Popular Algorithms
Prediction (forecasting Y based on Xs) Supervised Classification and regression trees, ANN, SVM, genetic algorithms
Classification Supervised Decision trees, ANN/MLP, SVM, rough sets, genetic algorithms
Regression Supervised Linear/nonlinear regression, regression trees, ANN/MLP, SVM
Association (relationship between X and Y) Unsupervised Apriori, OneR, ZeroR, Eclat
Link analysis Unsupervised Expectation maximization, apriori algorithm, graph-based matching
Sequence analysis Unsupervised Apriori algorithm, FP-growth technique
Clustering (breaking X into logical groups) Unsupervised K-means, ANN/SOM
Outlier analysis Unsupervised K-means, expectation maximization (EM)

See notes on statistics and notes on machine learning.

1.1. Business applications of data mining

1.1.1. Applications by technique

Per Sharda et al. (2014, pp. 172; 181; 183):

Clustering
  • "identify a classification scheme (e.g., types of customers)"
  • "suggest statistical models to describe populations"
  • "indicate rules for assigning new cases to classes for identification, tagging, and diagnostic purposes"
  • "provide measures of definition, size, and change in what were previously broad concepts"
  • "find typical cases to label and represent classes"
  • "decrease the size and complexity of the problem space for other data mining methods"
  • "identify outliers in a specific domain (i.e., rare-event detection)"
Association
  • "improve product placement on the sales floor [or magazine, or website] ... and [coordinate] promotional pricing of products"
  • identify credit fraud based on purchase combinations
  • "sequential patterns of services used by customers (checking account followed by savings account) can be used to identify other services they may be interested in (investment account)"
  • "structure product bundles to maximize revenue"
  • detect elevated medical risk as a combination of factors
  • market segmentation
  • "discover relationships between symptoms and illnesses, diagnosis and patient characteristics and treatments ... and genes and their functions"
Prediction
  • credit approval
  • store location
  • target marketing
  • fraud detection
  • attrition

1.1.2. Applications by industry

Per Sharda et al. (2014, pp. 160-161):

1.2. Doing data mining

1.2.1. Data mining processes

Per Sharda et al. (2014):

CRISP-DM* SEMMA KDD**
Business understanding Sample: "generate a representative sample of the data" Data selection
Data understanding Explore: "visualization and basic description of the data" Data preprocessing
Data preparation Modify: "select variables, transform variable representations" Data transformation
Model building Model: "use a variety of statistical and machine learning model" Data mining
Testing & evaluation Assess: "evaluate the accuracy and usefulness of the models" Interpretation/evaluation
Deployment

* Cross-Industry Standard Process for Data Mining, c. 1990s

** Knowledge Discovery in Databases, c. 1996

1.2.2. Common data mining pitfalls

Per Sharda et al. (2014, p. 195):

1.2.3. Data mining software

Per Sharda et al. (2014, p. 187), some commercial tools:

Also, they cite a poll from KDNuggets.com (most to least popular):

2. Data mining techniques

2.1. Clustering

"Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters" (Sharda et al., 2014, pp. 180).

2.1.1. Clustering versus classification

Per Sharda et al. (2014, p. 172), "Even though clustering ... can also be used to determine groups (or class memberships) of things, there is a significant difference between the two. Classification learns the function between the characteristics of things (i.e., [their] independent variables) and their membership (i.e., output variable) through a supervised learning process where both types (input and output) of variables are presented to the algorithm; in clustering, the membership of the objects is learned through an unsupervised learning process where only the input variables are presented to the algorithm. Unlike classification, clustering does not habe a supervising (or controlling) mechanism that enforces the learning process; instead, clustering algorithms use one or more heuristics (e.g., multidimensional distance measure) to discover natural groupings of objects."

2.1.2. Generic clustering methodology

Clustering algorithms usually require the desired number of clusters to be set as a parameter. To estimate this parameter, there are several different approaches (Sharda et al., 2014, p. 181):

"Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items. Popular distance measures include Euclidean distance (the ordinary distance between two points that one would measure with a ruler) and Manhattan distance (also called rectilinear distance, or taxicab distance, between two points). ... Weighted averages may be used to establish these distances" (Sharda et al., 2014, 182).

2.1.3. Clustering techniques

Per Sharda et al. (2014, p. 181), algorithms may be classified by approach:

And also according to whether they are:

2.1.3.1. k-means

Per Sharda et al. (2014, p. 182):

2.2. Association

"In essence, association rule mining [AKA affinity analysis, AKA market-basket analysis] aims to find interesting relationships (affinities) between variables (items) in large databases" (Sharda et al., 2014, p. 183).

Association rules have the following terms and notation:

2.2.1. Generic association methodology

"Several algorithms are available for discovering association rules ... [they] do only half the job, which is to identify the frequent itemsets in the database. Once the frequent itemsets are identified, they need to be converted into rules with antecedent and consequent parts ... [which is] a straightforward matching process, but the process may be time-consuming with large transaction databases.

Even though there may be many items in each section of the rule, in practice the consequent part usually contains a single item" (Sharda et al., 2014, p. 184).

2.2.2. Evaluating associations

2.2.3. Association techniques

2.2.3.1. Apriori algorithm

"Given a set of itemsets ... the algorithm attempts to find subsets that are common to at least a minimum number of items sets (i.e., complies with a minimum support). Apriori uses a bottom-up approach, where frequent subsets are extended one item at a time (a method known as _candidate generation ...), and groups of candidates at each level are tested agains the data for minimum support. The algorithm terminates when no further successful extensions are found" (Sharda et al., 2014, p. 185).

2.2.3.2. Eclat

2.2.3.3. FP-Growth

2.3. Prediction

Per Sharda et al. (2014, p. 172), "classification learns patterns from past data (a set of information---traits, variables, features---on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes. If what is being predicted is a class label (e.g., 'sunny, 'rainy', or 'cloudy') the prediction problem is called a classification, whereas if it is a numeric value (e.g., temperature such as 68°F), the prediction problem is called a regression".

2.3.1. Regression

2.3.2. Classification

Per Sharda et al. (2014, p. 172), "classification learns patterns from past data (a set of information---traits, variables, features---on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes."

2.3.2.1. Classification versus clustering

Per Sharda et al. (2014, p. 172), "Even though clustering ... can also be used to determine groups (or class memberships) of things, there is a significant difference between the two. Classification learns the function between the characteristics of things (i.e., [their] independent variables) and their membership (i.e., output variable) through a supervised learning process where both types (input and output) of variables are presented to the algorithm; in clustering, the membership of the objects is learned through an unsupervised learning process where only the input variables are presented to the algorithm. Unlike classification, clustering does not habe a supervising (or controlling) mechanism that enforces the learning process; instead, clustering algorithms use one or more heuristics (e.g., multidimensional distance measure) to discover natural groupings of objects."

2.3.2.2. Generic classification methodology

Per Sharda et al. (2014):

  1. Model development AKA training: "a collection of input data, including the actual class labels, is used ... [and] the model is trained"
  2. Model testing: "the model is tested against the holdout sample for accuracy assessment ..."
  3. Model deployment: "... and eventually deployed for actual use"
2.3.2.2.1. Training approaches

Per Sharda et al. (2014, pp. 174-175):

2.3.2.3. Evaluating classification models

Per Sharda et al. (2014, pp. 172-173):

2.3.2.3.1. Measuring accuracy

From Sharda et al. (2014, pp. 173-174), the universally-known confusion matrix (AKA classification matrix, AKA contingency table) for a two-class classification problem:

True Class
Positive Negative
Predicted Class Positive True positive count (TP) False positive count (FP)
Negative False negative count (FN) True negative count (TN)

Common accuracy metrics:

Formula Description
TRUE POSITIVE RATE = RECALL = TP/(TP+FN) "The ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives"
TRUE NEGATIVE RATE = TN/(TN+FP) "The ratio of correctly classified negatives divided by the total negative count (i.e., false alarm rate)" i.e., the sum of correctly classified negatives and incorrectly classified positives"
ACCURACY = (TP+TN)/(TP+TN+FP+FN) "The ratio of correctly classified instances (positives and negatives) divided by the total number of instances"
PRECISION = TP/(TP+FP) "The ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified positives"

"The area under the ROC curve is a graphical assessment technique where the true positive rate is plotted on the y-axis and the false positive rate is plotted on the x-axis. The area under the ROC curve determines the accuracy measure of a classifier: A value of 1 indicates a perfect classifier whereas 0.5 indicates no better than random chance" (Sharda et al., 2014, p. 175).

Accuracy metrics for classification problems with more than two classes:

2.3.2.4. Classification techniques

2.3.2.4.1. Decision tree analysis

Per Sharda et al. (2014, pp. 176-178):

"A tree consists of brances and nodes. A branch represents the outcome of a test to classify ... A leaf node at the end represents the final class choice ... The basic idea behind a decision tree is that it recursively divides a training set until each division consists entirely or primarily of examples from one class. Each non-leaf node of the tree contains a split point, which is a test on one or more attributes and determines how the data are to be divided further. Decision tree algorithms, in general, build an initial tree from the training data such that each leaf node is pure ["i.e., contains members of the same class ... The basic idea is to ask questions whose answers provide the most information, similar to what we may do when playing the game 'Twenty Questions'"], and they then prune the tree to increase its generalization, and hence, the prediction accuracy on test data ...

  1. Create a root node and assign all of the training data to it.
  2. Select the best splitting attribute.
  3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive (nonoverlapping) subsets along the lines of the specific splits and mode to the branches.
  4. Repeat steps 2 and 3 for each and every leaf node until the stopping criteria is reached (e.g., the mode is dominated by a single class label).

Many algorithms have been proposed for creating decision trees. These algorithms differ primarily in terms of the way in which they determine the splitting attribute (and its split values), the order of splitting the attributes (splitting the same attribute only once or many times), the number of splits at each node (binary versus ternary), the stopping criteria, and the pruning of the tree (pre- versus postpruning)."

Common algorithms:

To evaluate splits, use the Gini index or information gain:

2.3.2.4.2. Statistical analysis

Per Sharda et al. (2014), logistic regression and discriminant analysis are two techniques that were dominant until machine learning techniques were developed. They arhave fairly restrictive assumptions.

2.3.2.4.3. Neural networks

Sharda et al. (2014, p. 176) call this "among the most popular" of classification techniques.

2.3.2.4.4. Case-based reasoning

"This approach uses historical cases to recognize commonalities in order to assign a new case into the more probable category" (Sharda et al., 2014, p. 176).

2.3.2.4.5. Bayesian classifiers

"This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category)" (Sharda et al., 2014, p. 176).

2.3.2.4.6. Genetic algorithms

"[Genetic algorithms are inspired by] natural evolution to build directed-search-based mechanisms to classify sample data" (Sharda et al., 2014, p. 176).

2.3.2.4.7. Rough sets

"This method takes into account the partial membership of class labels into predefined categories in building models (collections of rules) for classification problems" (Sharda et al., 2014, p. 176).

3. Sources

3.1. Cited

Sharda, R., Delen, D., & Turban, E. (2014). Business intelligence: A managerial perspective on analytics (3rd ed.). New York City, NY: Pearson.

3.2. References

3.3. Read

3.4. Unread