pkb contents > data mining | just under 3466 words | updated 12/28/2017
Per Sharda et al. (2014), the field of data mining draws on statistics, artificial intelligence and machine learning to create data mining tools that facilitate the discovery of meaningful patterns in large BI datasets(see notes on BI systems). Adapted from Sharda et al. (2014, p. 157), there are three general patterns sought:
Data Mining Task | Learning Method | Popular Algorithms | |
---|---|---|---|
Prediction (forecasting Y based on Xs) | Supervised | Classification and regression trees, ANN, SVM, genetic algorithms | |
Classification | Supervised | Decision trees, ANN/MLP, SVM, rough sets, genetic algorithms | |
Regression | Supervised | Linear/nonlinear regression, regression trees, ANN/MLP, SVM | |
Association (relationship between X and Y) | Unsupervised | Apriori, OneR, ZeroR, Eclat | |
Link analysis | Unsupervised | Expectation maximization, apriori algorithm, graph-based matching | |
Sequence analysis | Unsupervised | Apriori algorithm, FP-growth technique | |
Clustering (breaking X into logical groups) | Unsupervised | K-means, ANN/SOM | |
Outlier analysis | Unsupervised | K-means, expectation maximization (EM) |
See notes on statistics and notes on machine learning.
Per Sharda et al. (2014, pp. 172; 181; 183):
Clustering |
|
---|---|
Association |
|
Prediction |
|
Per Sharda et al. (2014, pp. 160-161):
Per Sharda et al. (2014):
CRISP-DM* | SEMMA | KDD** |
---|---|---|
Business understanding | Sample: "generate a representative sample of the data" | Data selection |
Data understanding | Explore: "visualization and basic description of the data" | Data preprocessing |
Data preparation | Modify: "select variables, transform variable representations" | Data transformation |
Model building | Model: "use a variety of statistical and machine learning model" | Data mining |
Testing & evaluation | Assess: "evaluate the accuracy and usefulness of the models" | Interpretation/evaluation |
Deployment |
* Cross-Industry Standard Process for Data Mining, c. 1990s
** Knowledge Discovery in Databases, c. 1996
Per Sharda et al. (2014, p. 195):
Per Sharda et al. (2014, p. 187), some commercial tools:
Also, they cite a poll from KDNuggets.com (most to least popular):
"Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters" (Sharda et al., 2014, pp. 180).
Per Sharda et al. (2014, p. 172), "Even though clustering ... can also be used to determine groups (or class memberships) of things, there is a significant difference between the two. Classification learns the function between the characteristics of things (i.e., [their] independent variables) and their membership (i.e., output variable) through a supervised learning process where both types (input and output) of variables are presented to the algorithm; in clustering, the membership of the objects is learned through an unsupervised learning process where only the input variables are presented to the algorithm. Unlike classification, clustering does not habe a supervising (or controlling) mechanism that enforces the learning process; instead, clustering algorithms use one or more heuristics (e.g., multidimensional distance measure) to discover natural groupings of objects."
Clustering algorithms usually require the desired number of clusters to be set as a parameter. To estimate this parameter, there are several different approaches (Sharda et al., 2014, p. 181):
"Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items. Popular distance measures include Euclidean distance (the ordinary distance between two points that one would measure with a ruler) and Manhattan distance (also called rectilinear distance, or taxicab distance, between two points). ... Weighted averages may be used to establish these distances" (Sharda et al., 2014, 182).
Per Sharda et al. (2014, p. 181), algorithms may be classified by approach:
And also according to whether they are:
Per Sharda et al. (2014, p. 182):
"In essence, association rule mining [AKA affinity analysis, AKA market-basket analysis] aims to find interesting relationships (affinities) between variables (items) in large databases" (Sharda et al., 2014, p. 183).
Association rules have the following terms and notation:
"Several algorithms are available for discovering association rules ... [they] do only half the job, which is to identify the frequent itemsets in the database. Once the frequent itemsets are identified, they need to be converted into rules with antecedent and consequent parts ... [which is] a straightforward matching process, but the process may be time-consuming with large transaction databases.
Even though there may be many items in each section of the rule, in practice the consequent part usually contains a single item" (Sharda et al., 2014, p. 184).
"Given a set of itemsets ... the algorithm attempts to find subsets that are common to at least a minimum number of items sets (i.e., complies with a minimum support). Apriori uses a bottom-up approach, where frequent subsets are extended one item at a time (a method known as _candidate generation ...), and groups of candidates at each level are tested agains the data for minimum support. The algorithm terminates when no further successful extensions are found" (Sharda et al., 2014, p. 185).
Per Sharda et al. (2014, p. 172), "classification learns patterns from past data (a set of information---traits, variables, features---on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes. If what is being predicted is a class label (e.g., 'sunny, 'rainy', or 'cloudy') the prediction problem is called a classification, whereas if it is a numeric value (e.g., temperature such as 68°F), the prediction problem is called a regression".
Per Sharda et al. (2014, p. 172), "classification learns patterns from past data (a set of information---traits, variables, features---on characteristics of the previously labeled items, objects, or events) in order to place new instances (with unknown labels) into their respective groups or classes."
Per Sharda et al. (2014, p. 172), "Even though clustering ... can also be used to determine groups (or class memberships) of things, there is a significant difference between the two. Classification learns the function between the characteristics of things (i.e., [their] independent variables) and their membership (i.e., output variable) through a supervised learning process where both types (input and output) of variables are presented to the algorithm; in clustering, the membership of the objects is learned through an unsupervised learning process where only the input variables are presented to the algorithm. Unlike classification, clustering does not habe a supervising (or controlling) mechanism that enforces the learning process; instead, clustering algorithms use one or more heuristics (e.g., multidimensional distance measure) to discover natural groupings of objects."
Per Sharda et al. (2014):
Per Sharda et al. (2014, pp. 174-175):
Per Sharda et al. (2014, pp. 172-173):
From Sharda et al. (2014, pp. 173-174), the universally-known confusion matrix (AKA classification matrix, AKA contingency table) for a two-class classification problem:
True Class | |||
---|---|---|---|
Positive | Negative | ||
Predicted Class | Positive | True positive count (TP) | False positive count (FP) |
Negative | False negative count (FN) | True negative count (TN) |
Common accuracy metrics:
Formula | Description |
---|---|
TRUE POSITIVE RATE = RECALL = TP/(TP+FN) | "The ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified negatives" |
TRUE NEGATIVE RATE = TN/(TN+FP) | "The ratio of correctly classified negatives divided by the total negative count (i.e., false alarm rate)" i.e., the sum of correctly classified negatives and incorrectly classified positives" |
ACCURACY = (TP+TN)/(TP+TN+FP+FN) | "The ratio of correctly classified instances (positives and negatives) divided by the total number of instances" |
PRECISION = TP/(TP+FP) | "The ratio of correctly classified positives divided by the sum of correctly classified positives and incorrectly classified positives" |
"The area under the ROC curve is a graphical assessment technique where the true positive rate is plotted on the y-axis and the false positive rate is plotted on the x-axis. The area under the ROC curve determines the accuracy measure of a classifier: A value of 1 indicates a perfect classifier whereas 0.5 indicates no better than random chance" (Sharda et al., 2014, p. 175).
Accuracy metrics for classification problems with more than two classes:
Per Sharda et al. (2014, pp. 176-178):
"A tree consists of brances and nodes. A branch represents the outcome of a test to classify ... A leaf node at the end represents the final class choice ... The basic idea behind a decision tree is that it recursively divides a training set until each division consists entirely or primarily of examples from one class. Each non-leaf node of the tree contains a split point, which is a test on one or more attributes and determines how the data are to be divided further. Decision tree algorithms, in general, build an initial tree from the training data such that each leaf node is pure ["i.e., contains members of the same class ... The basic idea is to ask questions whose answers provide the most information, similar to what we may do when playing the game 'Twenty Questions'"], and they then prune the tree to increase its generalization, and hence, the prediction accuracy on test data ...
Many algorithms have been proposed for creating decision trees. These algorithms differ primarily in terms of the way in which they determine the splitting attribute (and its split values), the order of splitting the attributes (splitting the same attribute only once or many times), the number of splits at each node (binary versus ternary), the stopping criteria, and the pruning of the tree (pre- versus postpruning)."
Common algorithms:
To evaluate splits, use the Gini index or information gain:
Per Sharda et al. (2014), logistic regression and discriminant analysis are two techniques that were dominant until machine learning techniques were developed. They arhave fairly restrictive assumptions.
Sharda et al. (2014, p. 176) call this "among the most popular" of classification techniques.
"This approach uses historical cases to recognize commonalities in order to assign a new case into the more probable category" (Sharda et al., 2014, p. 176).
"This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category)" (Sharda et al., 2014, p. 176).
"[Genetic algorithms are inspired by] natural evolution to build directed-search-based mechanisms to classify sample data" (Sharda et al., 2014, p. 176).
"This method takes into account the partial membership of class labels into predefined categories in building models (collections of rules) for classification problems" (Sharda et al., 2014, p. 176).
Sharda, R., Delen, D., & Turban, E. (2014). Business intelligence: A managerial perspective on analytics (3rd ed.). New York City, NY: Pearson.