pkb contents
> text analytics | just under 4044 words | updated 12/29/2017
Per Sharda et al. (2014, pp. 205-206),
text analytics
aims "to turn unstructured textual data into actionable information through the application of [techniques from] natural language processing (NLP) and analytics [i.e., [data mining]" ---](https://jtkovacs.github.io/refs/data-mining.html#what-is-data-mining) the latter taking a 'bag of words' approach and the former taking a much more sophisticated approach rooted in linguistics.
Text analytics includes the core activities of:
-
Information retrieval:
"searching and identifying relevant documents for a given set of key terms"; see
notes on search engines
and
IA for information retrieval
-
Text mining
AKA text data mining, AKA knowledge discovery in textual databases: "primarily focused on discovering new and useful relationships from the textual data sources"
-
Information extraction:
"identification of key phrases and relationships within text by looking for predefined objects and sequences by way of pattern matching"
-
Web mining
-
Search engines (overlaps with "information retrieval")
-
Web analytics
-
Social media analytics
-
Data mining
Text analytics is enabled by the foundational disciplines of:
-
Statistics
-
Computer Science
-
Artificial Intelligence
-
Machine Learning
-
Linguistics
-
Natural Language Processing
-
Management Science
Per Sharda et al., some applications of text analytics (2014, pp. 206-207):
-
"Topic tracking.
Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user.
-
Categorization.
Identifying the main themes of a document and them placing the document into a predefined set of categories based on those themes.
-
Clustering.
Grouping similar documents without having a predefined set of categories.
-
Concept-linking.
Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods."
... and some applications specifically enabled by NLP (pp. 213, 225; for a great example, see
Textio, The Augmented Writing Platform):
-
"Question-answering.
... producing a human language answer when given a human language question. ...
-
Automatic summarization.
The creation of a shortened version of a textual document by a computer program that contains the most important parts of the original document.
-
Natural languge generation.
Systems convert information from computer databases into readable human language.
-
Natural language understanding.
Systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.
-
Machine translation.
Automatic translation of one human language to another.
-
Foreign language reading.
A computer program that assists a nonnative language speaker to read [or write, or speak] a foreign language ...
-
Speech recognition.
... Given a sound clip of a person speaking, the system produces a text dictation.
-
Text-to-speech.
Also called
speech synthesis,
a computer program automatically converts normal language text into human speech.
-
Text proofing.
A computer program reads a proof copy of a text in order to detect and correct errors.
-
Optical character recognition.
The automatic translation of images of handwritten, typewritten, or printed text (usually captured by a scanner) into machine-editable textual documents"
... and some applications specifically enabled by sentiment analysis, part of NLP (p. 233):
-
Incorporating 'buzz' into models of financial markets
-
Understanding the 'voices' of employees (VOE), customers (VOC) and the market (VOM)
-
Politics & surveillance
... and some applications specifically enabled by Web analytics (p. 250):
-
"improve the effectiveness of e-commerce Web sites"
-
"measure the results of traditional print or broadcast marketing campaigns" (impact on site traffic)
Per Sharda et al. (2014, pp. 213-220):
-
Information management
-
quarterly reports
-
manage search engines
-
manage websites
-
email
-
classify
-
filter junk
-
prioritize
-
generate automatic responses
-
Marketing
-
Data sources
-
call centers (notes and transcriptions)
-
blogs
-
user reviews
-
discussion boards & comment sections
-
Information sought
-
customer perceptions in the market at-large
-
CRM system-based insights about churn, perceptions, purchasing behavior
-
improve customer service performance by providing granular feedback on writing (e.g. email to customers)
-
Legal
-
court orders
-
patent files
-
Security
-
ECHELON, "assumed to be capable of identifying the content of telephone calls, faxes, emails, and other types of data, intercepting information sent via satellites, publics-switched telephone networks, and microwave links"
-
FBI & CIA joint database development
-
deception detection
-
Academic & biomedical
-
citation analysis
-
research articles
-
medical records
-
molecular interactions
Per Sharda et al. (2014, p. 216):
"Applying text mining to a large set of real-world criminal (person-of-interest) statements, Fuller et al. (2008) developed prediction models to differentiate deceptive statements from truthful ones. Using a rich set of cues extracted from the textual statements, the model predicted the holdout samples with 70 percent accuracy, which is believed to be a significant success considering that the cues are extracted only from textual statementss (no verbal or visual cues are present). Furthermore, compared to other deception-detection techniques, such as polygraph, this method is nonintrusive and widely applicable to not only textual data, but also (potentially) to transcriptions of voice recordings."
Quantity
|
Verb count, noun-phrase count
|
Complexity
|
Average number of clauses, average sentence length
|
Uncertainty
|
Modifiers, modal verbs
|
Nonimmediacy
|
Passive voice, objectification
|
Expressivity
|
Emotiveness
|
Diversity
|
Lexical diversity, redundancy
|
Informality
|
Typographical error ratio
|
Specificity
|
Spatiotemporal information, perceptual information
|
Affect
|
Positive affect, negative affect
|
Per Sharda et al., text mining can be undertaken through the three-step process elaborated below (2014; I follow and mostly quote from pp. 220-226, but some term definitions are taken from pp. 206-207).
Delen and Crossland (2008, cited in Sharda et al., 2014) place the 'black box' of this data mining process into the following context, which they represent graphically:
-
Input
-
structured data
-
unstructured data
-
Constraints
-
software/hardware limitations
-
privacy issues
-
linguistic limitations
-
Mechanisms
-
domain expertise
-
tools & techniques
-
Output
-
context-specific knowledge
("large and structured set of texts ... prepared for the purpose of conducting knowledge discovery")
-
"Collect
all documents related to the context (domain of interest) being studied", which may include:
-
XML files
-
emails
-
web pages
-
notes
-
memos
-
transcriptions of audio
-
Organize
(often into a flat text file with consistent character encoding)
Count raw frequencies
in each document:
-
Tokenize
raw input (a token is "a categorized block of text in a sentence ... assignment of meaning to blocks of text is known as tokenizing")
-
Filter out
stop words
OR filter in
include terms
-
stop words
"(or noise words) ... are filtered out prior to or after processing of natural language data ... there is no universally accepted list of stop words, [but] most natural language processing tools use a list that includes articles
(a, an, the, of, etc.),
auxiliary verbs
(is, are, was, were, etc.),
and context-specific words that are deemed not to have any differentiating value"
-
include terms
AKA term dictionary
-
Reckon with linguistic ambiguities
e.g. typos, synonyms, etc.
Perform stemming
to "[reduce] inflected words to their stem (or base or root) form"
Normalize frequencies
(e.g., to account for different document lengths or to assign different weights to different documents; can use log frequencies, binary frequencies, inverse document frequencies, etc.; "text mining research and practice have clearly indicated that the best weighting may come from the use of
term-frequency
divided by
inverse-document-frequency
... "; p. 245)
Construct the term-by-document-matrix
AKA occurrence matrix (example below) --- a "common representation schema of the frequency-based relationship between the terms and documents in a tabular format where terms are listed in rows, documents are listed in columns, and the frequency between the terms and documents is listed in cells as integer values"
-
Latent semantic indexing
by single-value decomposition (SVD) "dimensionality reduction method to transform the term-by-document matrix to a manageable size by generating an intermediate representation of the frequencies using a matrix manipulation method similar to principal component analysis"; through SVD, "the analyst might identify the two or three most salient dimensions that account for most of the variability (differences) between the words and documents, thus identifying the latent semantic space that organizes the words and documents in the analysis. Once such dimensions are identified, the underlying 'meaning' of what is contained (discussed or described) in the documents has been extracted."
Doc1
|
1
|
|
|
1
|
|
|
Doc2
|
|
1
|
|
|
|
|
Doc3
|
|
|
3
|
|
1
|
|
Doc4
|
|
2
|
1
|
|
|
1
|
See
notes on data mining.
Per Sharda et al. (2014, pp. 224-225), clustering is often used to
improve search recall
("when a query matches a document its whole cluster is returned") and
precision
("grouping the documents into a number of much smaller groups of related documents, ordering them by relevance, and returning only the documents from the most relevant group or groups"). The most common clustering methods:
-
Scatter/gather
"dynamically generates a table of contents for the collection and adapts and modifies it in response to the user selection"
-
Query-specific clustering
"a hierarchical clustering approach where the most relevant documents to the posed query appear in small tight clusters that are nested in larger clusters"
Sharda et al. (2014, pp. 225): "In text mining, associations specifically refer to the direct relationships between concepts (terms) or set of concepts ... [For
A
==> _C],_ confidence is the percentage of documents that include all the concepts in
C
within the same subset of those documents that include all the concepts in
A.
Support is the percentage (or number) of documents that include all the concepts in
A
and
C."
(AKA automatic text categorization, a form of prediction)
Per Sharda et al. (2014, pp. 224), some applications:
-
indexing text (semi/automatic)
-
filtering spam
-
cataloging web pages
-
generating metadata
-
genre detection
Comparing the distribution of concepts across different subcollections, e.g. from the same source but at different points in time.
With its two parent disciplines---artificial intelligence and computational linguistics---NLP extracts more meaning from textual data because it goes beyond the 'bag of words' approach to account for syntax, and, beyond that, "grammatical and semantic constraints as well as the context" (Sharda et al., 2014, p. 210).
Per Sharda et al. (2014, p. 210), NLP faces major challenges:
-
part-of-speech tagging
-
text segmentation
(identifying word boundaries in spoken language as well as written Chinese, Japanese, Thai, etc.)
-
word sense disambiguation
(see
notes on controlled vocabularies)
-
syntatic ambiguity
("multiple possible sentence structures often need to be considered")
-
irregular input
(e.g. typos, accents)
-
identifying speech acts,
speech that is meant to provoke an action
"Often we want to categorize text by topic, which may involve dealing with whole taxonomies of topics. Sentiment classification, on the other hand, usually deals with two classes (positive versus negative), a range of polarity (e.g., star ratings for movies), or even a range in strength of opinion" (Sharda et al., 2014, p. 229).
Per Sharda et al. (2014, pp. 234-237):
-
Sentiment detection:
determine whether a given passage is 'sentimentful', perhaps by calculating its Objectivity-Subjectivity (O-S) polarity
-
N-P polarity classification:
"classify the opinion as falling under one of two opposing sentiment polarities, or locate its position on the continuum between these two polarities"
-
Target identification:
identify what---explicit or implicit in the sentence (or other unit of analysis)---the expressed sentiment is directed towards (its target)
-
the challenge posed by this step varies greatly by domain
-
can be multiple valid or invalid targets in a sentence
-
Collection and aggregation:
polarity is calculated at the word level, which can then be aggregated to the sentence/phrase and document levels through simple summing; weighted averaging; or "as complex as using one or more machine-learning techniques to create a predictive relationship between the words (and their polarity values) and phrases or sentences"
Per Sharda et al. (2014):
-
Sentiments can be explicit or implicit, "where the text implies an opinion"; the latter is much more difficult to detect
-
"A document containing several opinionated statements would have a mixed polarity overall, which is different from not having a polarity at all"
-
"an article may contain negative news without explictly using any subjective words or terms"
Per Sharda et al. (2014, pp. 236-237):
-
Lexicon
-
Training documents
-
data
"Product-review Web sites like Amazon, C-NET, eBay, RottenTomatoes, and the Internet Movie Database (IMDB) have all been extensively used as sources of annotated data. The star (or tomato, as it were) system provides an explicit label of the overall polarity of the review, and it is often taken as the gold standard in algorithm evaluation"
-
algorithms
artificial neural networks, support vector machines, k-nearest neighbor, naive Bayes, decision trees, expectation maximization-based clustering
Per Sharda et al. (2014, pp. 240-241), web mining, AKA web data mining, "is essentially the same as data mining that uses data generated over the web". They contrast two common terms, noting that Web analytics has a narrower meaning but is replacing its parent term in popular discussion:
"all [Web] data ... including transaction, social, and usage data"
|
"Web site usage data"
|
"discover previously unknown patterns and relationships"
|
"describe what happened on a website"
|
"predictive or prescriptive analytics methodology"
|
"predefined, metrics-driven descriptive analysis"
|
Per Sharda et al. (2014, p. 239) --- the Web is:
-
Big, growing, and constantly updated
-
Complex, e.g. authoring style, content variation, lack of unified structure, not specific to a domain
Web content and metadata can be scraped and mined by web crawlers, to:
-
reveal the
structure
of the Web, for example identifying
authoritative pages
and
hubs
on the basis of hyperlinks
-
build a corpus of
content
for knowledge discovery through text mining.
See
notes on search engines
for a discussion of how web crawlers are used there.
(AKA clickstream analysis)
Per Sharda et al. (2014, p. 251):
-
Web analytics may be done on
-
data from one's own web properties
(on-site)
or
-
data from other sites
(off-site,
including: email, sales and lead history, social media data)
-
On-site data can be in the form of
-
server logs,
"where the Web server records file requests made by browsers", or
-
page tagging,
"which uses JavaScript embedded in the site page code to make image requests to a thid-party analytics-dedicated server whenever a page is rendered by a Web browser (or when a mouse click occurs)"
Citing TWG (2013), Sharda et al. present their metrics in four categories (2014, pp.253-256):
How were they using my website?
-
page views, average page views per visitor
-
time on site
-
downloads
-
click map (clicks within webpages)
-
click paths (do you need to 'eduate' new visitors or 'motivate' returning ones?)
Where did they come from?
-
referral web sites (where does your best traffic originate?)
-
search engines (keywords, landing pages)
-
direct hits (bookmarked and clicked, or typed the URL directly into the browser)
-
online and offline marketing campaigns (create dedicated page to catch traffic originating from these sources, e.g.
www.mycompany.com/offer50)
"What do my visitors look like?"
--- segmentation (and potentially, differentiation of landing pages) based on:
-
keywords (do they echo yours, or find the site via their own?)
-
content groupings ("analyze specific sections of your Web site that correspond with specific products, services, campaigns")
-
geography
-
time of day
-
when do people browse vs. buy?
"What does it all mean for the business?"
-
views
-
new visitors
-
returning visitors
-
actions
-
leads
-
sales/purchases/submissions
-
abandonment/exit and completion rates (#page_actions/#page_views)
As defined by Gartner, social analytics is "monitoring, analyzing, measuring and interpreting digital interactions and relationships of people, topics, ideas and content" (qtd. in Sharda et al., 2014, p. 257).
Per Sharda et al. (2014):
-
Mathematical graph theory, c. 1950s
-
Network analysis, c. 1980s
Per Sharda et al. (2014):
-
Communication
(flow of information)
-
(social relationships)
-
Innovation
(flow of ideas)
Per Sharda et al. (2014):
CONNECTIONS
-
Homophily
(to what extend friends are similar)
-
Multiplexity
(nodes connected in multiple ways, e.g. people connected through multiple social roles)
-
Mutuality/reciprocity
(of interactions)
-
Network closure
(to what extent friends are also friends, AKA
transivity)
-
Propinquity
(to what extent friendship reflects geographical proximity)
DISTRIBUTIONS
-
Bridge
(node that single-handedly connects separate clusters)
-
Structural holes
("absence of ties between two parts of a network")
-
Centrality
(influence/importance of a node, calculated different ways ---
betweenness, closeness, eigenvector, alpha, and degree centrality)
-
Density
("proportion of direct ties in a network relative to the total number possible")
-
Distance
("minimum number of ties required to connect two particular actors")
-
Tie strength
("linear combination of time, emotional intensity, intimacy, and reciprocity ...
strong ties
are associated with homophily, intimacy, propinquity, and transitivity, while
weak ties
are associated with bridges)
SEGMENTATION
-
Cliques
versus
social circles
(lots of direct ties, versus looser circles;
clustering coefficient
higher for cliques)
-
Cohesion
("minimum humber of members who, if removed from the group, would disconnect the group")
Per Sharda et al. (2014, p. 261),
social media includes
"online magazine, Internet forums, Web logs, social blogs, microblogging, wikis, social networks, podcasts, pictures, video, and product/service evaluations/ratings"; they cite Kaplan and Haenlein's (2010) typology of social media based on theories from "media research (social presence, media richness) and social processes (self-presentation, self-disclosure):
-
collaborative projects,
e.g. Wikipedia
-
blogs and microblogs,
e.g. Tumblr
-
content communities,
e.g. YouTube
-
social networking sites,
e.g. Facebook
-
virtual game worlds,
e.g. World of Warcraft
-
virtual social worlds,
e.g. Second Life
Sharda et al. (2014, p. 262) summarize Morgan et al. (2010) regarding
differences between social and traditional media.
For social media,
-
Quality
is not always safeguarded with traditional editorial processes; it varies widely
-
Reach
can be similar, but traditional media scales via hierarchy and social media via network (i.e., virally)
-
Frequency
and
immediacy
can be higher for social media because it's "easier, faster, and cheaper", "resulting in fresher content"
-
Accessibility
(as readers) and
usability
(as authors) is higher for social media
-
Mutability
is clearly higher for digital content
Sharda et al. (2014, pp. 262-263) summarize Brogan and Bastone's (2011) stratification of
social media users
on the basis of time and intensity of use:
-
Inactives
-
Spectators
-
Collectors
-
Joiners
-
Critics
-
Creators
IBM Watson's DeepQA is a "massively parallel, text mining-focused, probabilistic evidence-based computational architecture ... [using] more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses" (Sharda et al., 2014, pp. 203-204):
https://en.wikipedia.org/wiki/Attensity
# reverse order of elements:
list.reverse(), my_string[::-1]
# selectively replace:
str_name.replace(‘this’,’with this’)
# find index of known element:
list.index(‘str name’)
# times element occurs:
list.count(‘em_name’) makes tuple with (index,value): enumerate(my_list)
# remove punctuation
import string
line.translate(None, string.punctuation)
# modify case
my_string.lower()
my_string.upper()
my_string.capitalize()
my_string.title()
# remove whitespace by default, or remove specified characters
my_string.strip('chars')
my_string.lstrip()
my_string.rstrip()
# search for substrings within string or subset of string (i inclusive to j exclusive)
str_index = my_string.find(x,i,j)
str_index = my_string.index(x,i,j) # raises ValueError if not found
str.endswith(x,i,j)
str.startswith(x,i,j)
my_string.count(x,i,j)
-
https://docs.python.org/3/library/re.html
-
https://docs.python.org/3/howto/regex.html
-
http://nbviewer.jupyter.org/github/ptwobrussell/Mining-the-Social-Web-2nd-Edition/tree/master/ipynb/
# match the beginning of a string:
re.match(pattern, text, flags)
re.match(r’Jac’, data) # the r denotes a raw string
# search anywhere in a string:
# first match only:
re.search(pattern, text, flags)
# all nonoverlapping:
re.findall(pattern, text, flags)
# phone number, note escaped parentheses:
re.search(r’\(\d\d\d\) \d\d\d-\d\d\d\d’, data)
# make parentheses, space, hyphen optional in phone number
r’\)?\d{3})?\s?-?\d\{3}-\d{4}’
flags:
-
re.IGNORECASE or re.I will ignore word case
-
re.VERBOSE or re.X let regexp span lines & contain (ignored) whitespace or comments
-
re.MULTILINE or re.M to make a pattern regard lines in your text as the beginning or end of a string
-
multiple flags: re.findall(pattern, data, flag|flag|flag)
# store regex for reuse:
my_regex = re.compile(pattern, flags)
re.search(my_regex, data)
# OR
my_regex.search(data)
# loop to obtain iterable of match objects:
for match in my_regex.finditer(data):
print(‘{first} {last} <{email}>’.format(**match.groupdict()))
-
\w = any Unicode word character, \W = anything not a Unicode word character
-
\s = any whitespace, \S = anything not whitespace, = tab
-
\d = any number 0-9, \D = any non-number
-
\b = word boundaries, \B = not word boundaries
counts, for when something occurs multiple times:
-
{3} = occurs 3 times, {,3} = 0-3 times, {3,} = 3 or more times, {3-5} = 3-5 times
-
\w? = 0-1 word characters, \w* = 0-infinite word characters, \w+ = 1-infinite word characters
sets let us combine explicit characters and escape patterns into pieces that can be repeated multiple time; they also let us specify pieces that should be left out of any matches: [aple] finds apple and pale, [a-z] finds any lowercase letter, [A-Z] finds uppercase, [a-zA-Z] finds any case, [^2] finds anything not two, [0-9] finds any number, [.]+ finds any # of , .
# groups search for multiple conditions simultaneously; note that ^ marks the beginning of the string, and $ marks the end; unnamed groups returned as tuples, named groups as dicts:
my_var = re.findall (r’’’
^(?P<name>[-\w ]+,\s[-\w ]+)\t # search for lastname, firstname
(\)?\d{3})?\s?-?\d\{3}-\d{4})? # search for phone number, optional
(?<email>[-\w\d.+]+ @[-\w\d.]+)\t$ # search for emails
‘’’, data, flags)
# groups addressing
my_var.groups()
my_var.group_dict()
my_var.group(‘group_name’)
my_var.group(1)
Sharda, R., Delen, D., & Turban, E. (2014).
Business intelligence: A managerial perspective on analytics
(3rd ed.). New York City, NY: Pearson.