pkb contents > big data | just under 2456 words | updated 12/30/2017

1. What is Big Data?

Per Sharda et al. (2014, pp. 280-282):

Volume

Year Estimated World Data
2009 0.8 ZB
2010 >1 ZB
2011 1.8 ZB
2020 35 ZB

Variety (in format; about 80-85% unstructured)

Velocity

Veracity

Variability ("Daily, seasonal, and event-driven peak data loads")

Value (one hopes)

1.1. Sources of Big Data

"Web logs, RFID, GPS systems, sensor networks, social networks, Internet-based text documents, Internet search indexes, detail call records, astronomy, atmospheric science, biology, genomics, nuclear physics, biochemical experiments, medical records, scientific research, military surveillance, photography archives, video archives, and large-scalre e-commerce practices" (Sharda et al., 2014, pp. 278-280).

1.2. Business applications of Big Data

Sharda et al. (2014, pp. 287):

Per Zhu et al. (2014, pp. 16-17), there are four categories of business goals that companies may fruitfully pursue with Big Data:

REVENUE

CUSTOMER SERVICES

BUSINESS DEVELOPMENT

BUSINESS AGILITY & GOVERNANCE

1.2.1. Business applications of stream analytics

Per Sharda et al. (2014, pp. 317-321):

e-COMMERCE ("analysis of [clickstream] data can turn browsers into buyers and buyers into shopaholics")

TELECOMMUNICATIONS

LAW ENFORCEMENT & CYBER SECURITY

POWER INDUSTRY (smart meters)

FINANCIAL SERVICES

HEALTH SCIENCES

GOVERNMENT

1.3. Implementing Big Data initiatives

1.3.1. Big Data maturity model

Per Zhu et al. (2014, p. 26):

1.3.2. When Big Data versus data warehousing?

Use data warehouses for:

Use Hadoop as:

Requirement DW Hadoop
Low latency, interactive reports, and OLAP X
ANSI 2003 SQL compliance is required X X
Preprocessing or exploration of raw unstructured data X
Online archives alternative to tape X
High-quality cleansed and consistent data X ?
100s to 1,000s of concurrent users ? X
Discover unknown relationships in the data X
[Complex parallel] process logic ? X
CPU intense analysis X
System, users, and data governance X
Many flexible programming languages running in parallel X
Unrestricted, ungoverned sandbox explorations X
Analysis of provisional data X
Extensive security and regulatory compliance X ?

1.3.3. Success factors for Big Data initiatives

Sharda et al. (2014, pp. 285-286) cite Watson's (2012) "critical success factors" as follows:

They also synthesize Lampitt (2012) and a Tableau white paper (pp. 312-313):

2. Big Data technologies

2.1. High-performance computing

2.2. Generic Big Data architectures

Per Zhu et al. (2014, p. 6):

Per Tetadata, their landscape of products AKA Unified Data Architecture:

Per AsterData, cited in Sharda et al. (2014, p. 283):

2.3. Big Data storage

2.3.1. Hadoop

Per Sharda et al. (2014, pp. 291, 294), "Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distibuted, unstructured data"; it is distributed storage plus distributed processing via the MapReduce framework.

2.3.1.1. Hadoop components

2.3.1.2. Hadoop suprojects

Per Sharda et al. (2014, pp. 292-293):

2.3.2. What are NoSQL databases?

Per Connolly and Begg (2015):

NoSQL databases use non-relational data models ...

... plus some of these other features ...

... to store Big Data, achieving better performance by:

2.3.2.1. NoSQL databases versus other data store options

Per Sharda et al. (2014, p. 295): "[W]hereas Hadoop is adept at supporting large-scale, batch-style historical analysis, NoSQL databases are aimed, for the most part (although there are some important exceptions), at serving up discrete data stored among large volumes of multi-structured data ... [a capability] sorely lacking in relational database technology ... the downside of most NoSQL databases today is that they trade ACID (atomicity, consistency, isolation, durability) compliance for performance and scalability. Many also lack mature management and monitoring tools".

2.3.2.2. NoSQL database software

2.4. Big Data analytics

See notes on data science.

2.4.1. MapReduce

Per Dean and Ghemawat's seminal paper (2004): "MapReduce is a programming model and an associated implementaiton for processing and generating large data sets. Programs written in this functional style are automatically patallelized and executed on a large cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system."

Another way of putting this, from Russom (2010) by way of Sharda et al. (2014, p. 295): "MapReduce provides control for analytics, not analytics per se. MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault-tolerance for any kind of application that you can hand code---not just analytics."

Per Sharda et al. (2014, p. 290):

2.4.2. Data stream mining

(AKA data-in-motion analytics, AKA in-motion analytics, AKA real-time data analytics)

Per Sharda et al. (2014, p. 215), stream analytics began in the energy industry, and has become important because:

Some related concepts:

3. Sources

3.1. Cited

Connolly, T. & Begg, C. (2015). Database systems: A practical approach to design, implementation, and management (6th ed.). New York City, NY: Pearson Education.

Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. Retrieved from https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Sharda, R., Delen, D., & Turban, E. (2014). Business intelligence: A managerial perspective on analytics (3rd ed.). New York City, NY: Pearson.

Zhu, W-D., Gupta, M., Kumar, V., Perepa, S., Sathi, A., & Statchuk, C. (2014). Building Big Data and analytics solutions in the cloud. IBM Redpaper. Retrieved from https://www.redbooks.ibm.com/redpapers/pdfs/redp5085.pdf

3.2. References

3.3. Read

3.4. Unread