pkb contents
> data science | just under 1184 words | updated 12/30/2017
Via Mason and Wiggins (2010):
OSEMN Model
|
Obtain
|
Scrub
|
Explore
|
Model
|
iNterpret
|
Alt Terms
|
Acquire
|
Clean
|
Analyze
|
Apply
|
Wrangle
|
Skills & Tools
|
-
Plain text
-
CSV
-
JSON
-
XML/HTML
-
Query DB
-
Query API
-
REST
-
Encoding
|
-
Filter data
-
Extract data
-
Extract values
-
Replace values
-
Handle NULL, missing data
-
Convert formats
|
-
Summary stats
-
Visualization
-
Clustering
-
Classification
-
Regression
-
Dimension reduction
|
-
Conclusion
-
Implications
-
Communication
|
-
What are major risks in web scraping?
-
How do you parse scraped web data (HTML, JSON, XML)?
-
How is authorization implemented in Google APIs?
See
notes on data visualization.
[https://medium.com/@eytanadar/banning-exploration-in-my-infovis-class-9578676a4705]
(https://medium.com/
@eytanadar/banning-exploration-in-my-infovis-class-9578676a4705
)
See
notes on models,
statistics,
machine learning,
and
text analytics.
Per Sharda et al. (2014, p. 300):
-
"Work closely with a product engineering team to identify and answer important product questions
-
Answer product questions by using appropriate statistical techniques on available data
-
Communicate findings to product managers and engineers
-
Drive the collection of new data and the refinement of existing data sources
-
Analyze and interpret the results of product experiments
-
Develop bext practices for instrumentation and experimentation and communicate those to product engineering teams
Per Sharda et al. (2014, p. 299):
SOFT
-
Domain expertise, problem definition, and decision making
-
Curiosity and creativity
-
Communication and interpersonal
HARD
-
Data access and management (both traditional and new data systems)
-
Programming, scripting, and hacking
-
Internet and social media/social networking technologies
R, Python, Bash, SQL on MySQL, Spark, Excel, Tableau are most common; see
2016 Data Science Salary Survey
and
2016 Stack Overflow Developer Survey.
Per Janssens (2015):
-
Agile:
supports faster iteration through a read-eval-print loop (REPL) versus an edit-compile-run-debug loop.
-
Augmenting:
amplifies rather than replaces existing tools
-
Scalable:
ability to automate commands means they're repeatable, supporting scalable analytic workflows
-
Extensible:
because command line tools are language-agnostic
-
Ubiquitous
(via Linux/Unix)
Janssens, J. (2015).
Data science at the command line: Facing the future with time-tested tools.
Sebastopol, CA: O'Reilly.
Mason, H. & Wiggins, C. (2010). A taxonomy of data science [blog post].
dataists.
Retrieved from
http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
Sharda, R., Delen, D., & Turban, E. (2014).
Business intelligence: A managerial perspective on analytics
(3rd ed.). New York City, NY: Pearson.
-
Formulas
-
What is a modern, SaaS-based BI stack?
-
What's a modern BI stack?
-
What we know about spreadsheet errors
-
Eight (no, nine!) problems with Big Data
-
What is the difference between Data Analytics, Data Analysis,
-
Deep learning vs machine learning vs pattern recognition
-
Big data: the four layers that everyone must know
-
The curse of big data
-
The parable of Google flu: traps in Big Data analysis
-
The Cardinal Sin of Data Mining
-
Approaching (Almost) Any Machine Learning Problem
-
Techniques of Big Data
-
Top 3 algorithms in plain English
-
Decision tree algorithm
-
association rule mining
-
random forest classifier
-
How to detect a pattern: problem and solution
-
Big data acronyms and abbreviations
-
Where predictive analytics is having the biggest impact
-
21 data science systems used by Amazon to operate its business
-
How is big data used in practice?
-
24 Uses of Statistical Modeling
-
Machine Learning Becomes Mainstream: How to Increase Your Competitive Advantage
-
How Companies are Using Machine Learning
-
Improving operations using data analytics
-
Text analysis of Trump's tweets confirms he writes only the (angrier) Android half
-
Facebook V: Predicting Check Ins
-
EdX: The Analytics Edge
-
Of prediction and policy: Applying algorithms to public policy
-
Beginner’s Guide to the History of Data Science
-
A very short history of data science
-
Data science: the end of statistics?
-
What statisticians think about data scientists
-
Data Science Compared to 16 Analytic Disciplines
-
High versus low-level data science
-
Data science cheat sheet
-
Skills checklist for first data science job
-
38 seminal papers all data scientists should read
-
What statistics concepts are needed for excelling at data science?
-
4 easy steps to becoming a data scientist
-
Top Algorithms and Methods Used by Data Scientists
-
DS curriculum:
Pluralsight
,
General Assembly
,
DataQuest
,
Udacity
,
Data School
,
Open Source Data Science masters
-
Getting started with data science
-
Data science projects for data science apprentices
-
New coder tutorials
-
Confronting jargon
-
My data science journey
-
The data science industry: Who does what?
-
Data science career paths: Different roles in the industry
-
9 types of data scientists
-
Data science interview prep:
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
,
14
-
5 secrets for writing the perfect data science resume
-
80,000 Hours Career Guide
-
What factors can increase your data science salary?
-
Data Science for IoT vs Classic Data Science: 10 Differences
-
Data wrangling with Python.
-
Coursera - Using Python to Access Web Data
-
Scraping the web
-
Prevent web scraping
-
How to prevent getting blacklisted while scraping
-
Some traps to know and avoid in web scraping
-
HTML scraping with requests and lmxl
-
Don’t parse HTML with regex
-
Automate the Boring Stuff with Python: Web Scraping
-
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
-
The Data Scrub
-
Why data preparation should not be overlooked