pkb contents > text analytics | just under 4044 words | updated 12/29/2017

1. What is text analytics?

Per Sharda et al. (2014, pp. 205-206), text analytics aims "to turn unstructured textual data into actionable information through the application of [techniques from] natural language processing (NLP) and analytics [i.e., [data mining]" ---](https://jtkovacs.github.io/refs/data-mining.html#what-is-data-mining) the latter taking a 'bag of words' approach and the former taking a much more sophisticated approach rooted in linguistics.

Text analytics includes the core activities of:

Text analytics is enabled by the foundational disciplines of:

1.1. Business applications of text analytics

1.1.1. Applications by technique

Per Sharda et al., some applications of text analytics (2014, pp. 206-207):

... and some applications specifically enabled by NLP (pp. 213, 225; for a great example, see Textio, The Augmented Writing Platform):

... and some applications specifically enabled by sentiment analysis, part of NLP (p. 233):

... and some applications specifically enabled by Web analytics (p. 250):

1.1.2. Applications by industry

Per Sharda et al. (2014, pp. 213-220):

1.1.2.1. Deception detection

Per Sharda et al. (2014, p. 216):

"Applying text mining to a large set of real-world criminal (person-of-interest) statements, Fuller et al. (2008) developed prediction models to differentiate deceptive statements from truthful ones. Using a rich set of cues extracted from the textual statements, the model predicted the holdout samples with 70 percent accuracy, which is believed to be a significant success considering that the cues are extracted only from textual statementss (no verbal or visual cues are present). Furthermore, compared to other deception-detection techniques, such as polygraph, this method is nonintrusive and widely applicable to not only textual data, but also (potentially) to transcriptions of voice recordings."

Construct Example Cues
Quantity Verb count, noun-phrase count
Complexity Average number of clauses, average sentence length
Uncertainty Modifiers, modal verbs
Nonimmediacy Passive voice, objectification
Expressivity Emotiveness
Diversity Lexical diversity, redundancy
Informality Typographical error ratio
Specificity Spatiotemporal information, perceptual information
Affect Positive affect, negative affect

2. Text analytics techniques

2.1. Text mining

Per Sharda et al., text mining can be undertaken through the three-step process elaborated below (2014; I follow and mostly quote from pp. 220-226, but some term definitions are taken from pp. 206-207).

Delen and Crossland (2008, cited in Sharda et al., 2014) place the 'black box' of this data mining process into the following context, which they represent graphically:

2.1.1. Establish the corpus

("large and structured set of texts ... prepared for the purpose of conducting knowledge discovery")

2.1.2. Process the corpus

Count raw frequencies in each document:

Perform stemming to "[reduce] inflected words to their stem (or base or root) form"

Normalize frequencies (e.g., to account for different document lengths or to assign different weights to different documents; can use log frequencies, binary frequencies, inverse document frequencies, etc.; "text mining research and practice have clearly indicated that the best weighting may come from the use of term-frequency divided by inverse-document-frequency ... "; p. 245)

Construct the term-by-document-matrix AKA occurrence matrix (example below) --- a "common representation schema of the frequency-based relationship between the terms and documents in a tabular format where terms are listed in rows, documents are listed in columns, and the frequency between the terms and documents is listed in cells as integer values"

Term1 Term2 Term3 Term4 Term4 Term5
Doc1 1 1
Doc2 1
Doc3 3 1
Doc4 2 1 1

2.1.3. Analyze the data

See notes on data mining.

2.1.3.1. Clustering

Per Sharda et al. (2014, pp. 224-225), clustering is often used to improve search recall ("when a query matches a document its whole cluster is returned") and precision ("grouping the documents into a number of much smaller groups of related documents, ordering them by relevance, and returning only the documents from the most relevant group or groups"). The most common clustering methods:

2.1.3.2. Association

Sharda et al. (2014, pp. 225): "In text mining, associations specifically refer to the direct relationships between concepts (terms) or set of concepts ... [For A ==> _C],_ confidence is the percentage of documents that include all the concepts in C within the same subset of those documents that include all the concepts in A. Support is the percentage (or number) of documents that include all the concepts in A and C."

2.1.3.3. Classification

(AKA automatic text categorization, a form of prediction)

Per Sharda et al. (2014, pp. 224), some applications:

2.1.3.4. Trend analysis

Comparing the distribution of concepts across different subcollections, e.g. from the same source but at different points in time.

2.2. Natural language processing

With its two parent disciplines---artificial intelligence and computational linguistics---NLP extracts more meaning from textual data because it goes beyond the 'bag of words' approach to account for syntax, and, beyond that, "grammatical and semantic constraints as well as the context" (Sharda et al., 2014, p. 210).

2.2.1. Challenges with NLP

Per Sharda et al. (2014, p. 210), NLP faces major challenges:

2.2.2. Sentiment analysis

"Often we want to categorize text by topic, which may involve dealing with whole taxonomies of topics. Sentiment classification, on the other hand, usually deals with two classes (positive versus negative), a range of polarity (e.g., star ratings for movies), or even a range in strength of opinion" (Sharda et al., 2014, p. 229).

2.2.2.1. Generic sentiment analysis process

Per Sharda et al. (2014, pp. 234-237):

2.2.2.2. Challenges with sentiment identification

Per Sharda et al. (2014):

2.2.2.3. Methods for sentiment identification

Per Sharda et al. (2014, pp. 236-237):

2.3. Web mining

Per Sharda et al. (2014, pp. 240-241), web mining, AKA web data mining, "is essentially the same as data mining that uses data generated over the web". They contrast two common terms, noting that Web analytics has a narrower meaning but is replacing its parent term in popular discussion:

Web mining Web analytics
"all [Web] data ... including transaction, social, and usage data" "Web site usage data"
"discover previously unknown patterns and relationships" "describe what happened on a website"
"predictive or prescriptive analytics methodology" "predefined, metrics-driven descriptive analysis"

2.3.1. Challenges with web mining

Per Sharda et al. (2014, p. 239) --- the Web is:

2.3.2. Web crawlers (structure & content mining)

Web content and metadata can be scraped and mined by web crawlers, to:

See notes on search engines for a discussion of how web crawlers are used there.

2.3.3. Web analytics (usage mining)

(AKA clickstream analysis)

Per Sharda et al. (2014, p. 251):

2.3.3.1. Metrics for on-site web analytics

Citing TWG (2013), Sharda et al. present their metrics in four categories (2014, pp.253-256):

2.3.3.1.1. Web site usability

How were they using my website?

2.3.3.1.2. Traffic sources

Where did they come from?

2.3.3.1.3. Visitor profiles

"What do my visitors look like?" --- segmentation (and potentially, differentiation of landing pages) based on:

2.3.3.1.4. Conversion statistics

"What does it all mean for the business?"

2.3.3.2. Technologies for web analytics

2.3.4. Social analytics

As defined by Gartner, social analytics is "monitoring, analyzing, measuring and interpreting digital interactions and relationships of people, topics, ideas and content" (qtd. in Sharda et al., 2014, p. 257).

2.3.4.1. Social network analysis

Per Sharda et al. (2014):

2.3.4.1.1. Types of networks

Per Sharda et al. (2014):

2.3.4.1.2. Network metrics

Per Sharda et al. (2014):

CONNECTIONS

DISTRIBUTIONS

SEGMENTATION

2.3.4.2. Social media analytics

2.3.4.2.1. What is social media?

Per Sharda et al. (2014, p. 261), social media includes "online magazine, Internet forums, Web logs, social blogs, microblogging, wikis, social networks, podcasts, pictures, video, and product/service evaluations/ratings"; they cite Kaplan and Haenlein's (2010) typology of social media based on theories from "media research (social presence, media richness) and social processes (self-presentation, self-disclosure):

Sharda et al. (2014, p. 262) summarize Morgan et al. (2010) regarding differences between social and traditional media. For social media,

Sharda et al. (2014, pp. 262-263) summarize Brogan and Bastone's (2011) stratification of social media users on the basis of time and intensity of use:

2.3.4.2.2. Tools for social media analytics

3. Text analytics tools

3.1. IBM Watson

IBM Watson's DeepQA is a "massively parallel, text mining-focused, probabilistic evidence-based computational architecture ... [using] more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses" (Sharda et al., 2014, pp. 203-204):

3.2. Attensity

https://en.wikipedia.org/wiki/Attensity

3.3. Python

# reverse order of elements:
list.reverse(), my_string[::-1]
# selectively replace:
str_name.replace(‘this’,’with this’)
# find index of known element:
list.index(‘str name’)
# times element occurs:
list.count(‘em_name’) makes tuple with (index,value): enumerate(my_list)

3.3.1. String manipulation

# remove punctuation
import string
line.translate(None, string.punctuation)

# modify case
my_string.lower()
my_string.upper()
my_string.capitalize()
my_string.title()

# remove whitespace by default, or remove specified characters
my_string.strip('chars')
my_string.lstrip()
my_string.rstrip()

3.3.2. Regex

# search for substrings within string or subset of string (i inclusive to j exclusive)
str_index = my_string.find(x,i,j)
str_index = my_string.index(x,i,j)  # raises ValueError if not found
str.endswith(x,i,j)
str.startswith(x,i,j)
my_string.count(x,i,j)
# match the beginning of a string:
re.match(pattern, text, flags)
re.match(r’Jac’, data) # the r denotes a raw string

# search anywhere in a string:
# first match only:
re.search(pattern, text, flags)
# all nonoverlapping:
re.findall(pattern, text, flags)

# phone number, note escaped parentheses:
re.search(r’\(\d\d\d\) \d\d\d-\d\d\d\d’, data)
# make parentheses, space, hyphen optional in phone number
r’\)?\d{3})?\s?-?\d\{3}-\d{4}’

flags:

# store regex for reuse:
my_regex = re.compile(pattern, flags)
re.search(my_regex, data)
# OR
my_regex.search(data)

# loop to obtain iterable of match objects:
for match in my_regex.finditer(data):
    print(‘{first} {last} <{email}>’.format(**match.groupdict()))

counts, for when something occurs multiple times:

sets let us combine explicit characters and escape patterns into pieces that can be repeated multiple time; they also let us specify pieces that should be left out of any matches: [aple] finds apple and pale, [a-z] finds any lowercase letter, [A-Z] finds uppercase, [a-zA-Z] finds any case, [^2] finds anything not two, [0-9] finds any number, [.]+ finds any # of , .

# groups search for multiple conditions simultaneously; note that ^ marks the beginning of the string, and $ marks the end; unnamed groups returned as tuples, named groups as dicts:
my_var = re.findall (r’’’
    ^(?P<name>[-\w ]+,\s[-\w ]+)\t   # search for lastname, firstname
    (\)?\d{3})?\s?-?\d\{3}-\d{4})? # search for phone number, optional
    (?<email>[-\w\d.+]+ @[-\w\d.]+)\t$  # search for emails
    ‘’’, data, flags)

# groups addressing
my_var.groups()
my_var.group_dict()
my_var.group(‘group_name’)
my_var.group(1)

4. Sources

4.1. Cited

Sharda, R., Delen, D., & Turban, E. (2014). Business intelligence: A managerial perspective on analytics (3rd ed.). New York City, NY: Pearson.

4.2. References

4.3. Read

4.4. Unread