数据分析师成长和进阶免费教程

我整理了国外一些靠谱的大数据免费教程，推出一套网络自学攻略。
目前是 Alpha Version，将逐步翻译，整理，补充

注：

这是非常技术流的教程，涉及大数据处理，编程和统计。不是Excel sheet，PowerPoint或者商业咨询市场分析类型，如果你是目的是做普通的Business Analyst 或者 BI 咨询，你不需要这个教程。
针对大数据（1 TB+ ）的处理和分析（如果你的数据只是几个Excel sheet，请略过）
所有教程内容都是英文，你可能需要翻墙（后果自负）。

教程亮点：

全部免费哦！
帮助完全没有概念的菜鸟快速入门（教授基础的统计学和编程知识, 无需基础但要有常识）
从数据采集，分析，到最终可视化展示，教授大数据分析全过程的重要理念，方法和工具。
所需时间：310+ 小时。
- 菜鸟：要那么长时间？太慢了？
- 回答：什么？啥基础都没有，想要多快？你学了9年英语还要3个月新东方考GRE呢。
- 菜鸟：我有些学过了
- 回答：你不会跳过啊，菜鸟。

申明：我在英文环境下学习和培养的专业能力，很多术语的中文名称不了解，欢迎拍砖。

这个教程包括以下几个方面:

基础课程：

exploratory and predictive statistics （统计学：检测数据和预测分析）
basic Python （Python编程基础）
advanced computer program design （电脑程序设计原理，进阶）
an introduction to algorithms (算法基础）
R for statistical analysis （使用 R 做统计分析）
practical machine learning techniques （机器学习基本技法）
Unix
data visualization best practices （数据视觉化展示技巧）

进阶套餐：
套餐A - 展示: Visualizing Data 数据视觉化
套餐B - 算法：Analyzing Social Networks （社交网络分析）
套餐C - 技术： Big Data: Hadoop and MapReduce （大数据，Hadoop 和 MapReduce技能）

作为一个需要花费时间整理的攻略，不知道以上内容大家是否刚兴趣。如果点赞人数超过50人，我就把教程写出来。

-------------------------------割割哥-------------------------------------------

统计篇

Exploratory and Predictive Statistics - 初级统计学

统计学扫盲
1. Statistics - Udemy ( 12 小时）
这个教程涵盖了统计学第一年的基础内容。简单粗暴，给你一个统计学的基本概念。这个课程虽然不能让你吃上猪肉，但是可以让你见到猪跑。

Optional 完整基础入门课程（Strongly recommend if you have the time）
2.1 Introduction to Statistics Descriptive Statistics （50 小时）
The focus of Stat2.1x is on descriptive statistics. The goal of descriptive statistics is to summarize and present numerical information in a manner that is illuminating and useful. The course will cover graphical as well as numerical summaries of data, starting with a single variable and progressing to the relation between two variables. Methods will be illustrated with data from a variety of areas in the sciences and humanities.

2.2 Introduction to Statistics: Probability （50 小时）
The focus of Stat2.2x is on probability theory: exactly what is a random sample, and how does randomness work? If you buy 10 lottery tickets instead of 1, does your chance of winning go up by a factor of 10? What is the law of averages? How can polls make accurate predictions based on data from small fractions of the population? What should you expect to happen "just by chance"? These are some of the questions we will address in the course.

2.3 Introduction to Statistics: Inference （50 小时）
The focus of Stat2.3x is on statistical inference: how to make valid conclusions based on data from random samples. At the heart of the main problem addressed by the course will be a population (which you can imagine for now as a set of people) connected with which there is a numerical quantity of interest (which you can imagine for now as the average number of MOOCs the people have taken).
we will discuss good ways to select the subset (yes, at random); how to estimate the numerical quantity of interest, based on what you see in your sample; and ways to test hypotheses about numerical or probabilistic aspects of the problem

编程篇

Basic Python
1. Intro to Python (3 - 5 小时）扫盲

This is a great place to start if you have no programming background at all or want to brush up. If you have programming experience but have never seen Python, you may still want to skim through these lessons. You’ll learn basic programming techniques, such as loops, lists and dictionaries, functions, classes, and file input/ output.

1.1 彩蛋 Complete the Python Statistics Problem Set ( 0.5 小时 )

2. Videos and Problem Sets of Design of Computer Programs (20 - 30 小时）
This class will teach you to write elegant and efficient code. This will be essential in order to manipulate data effectively and write code that is reusable and easy for others to understand. You will also learn about some of the more sophisticated Python techniques, such as generator functions and list comprehensions.

Optional: Computer programming and Python 完整基础入门课程

2. Introduction to Computer Science and Programming Using Python (135 小时）
This course focuses on breadth rather than depth. The goal is to provide students with a brief introduction to many topics so they will have an idea of what is possible when they need to think about how to use computation to accomplish some goal later in their career.

A Notion of computation
The Python programming language
Some simple algorithms
Testing and debugging
An informal introduction to algorithmic complexity
Data structures

SQL and JSON

1. Introduction to Database ( 10 小时 - 只需要看前面的基础部分）
Watch the videos on Relational Databases, JSON Data, Relational Algebra, and SQL, and complete the exercises for those sections.

Algorithm 入门

1. Introduction to Algorithms (SMA 5503) （15小时 - 只需要看前面的基础部分）
This course teaches techniques for the design and analysis of efficient algorithms, emphasizing methods useful in practice. Topics covered include: sorting; search trees, heaps, and hashing; divide-and-conquer; dynamic programming; amortized analysis; graph algorithms; shortest paths; network flow; computational geometry; number-theoretic algorithms; polynomial and matrix calculations; caching; and parallel computing.

工具篇

1. Unix Basics [4:20] （ 1 小时）
大部分的大数据开发和分析环境在Unix系统中进行，如果你用Mac或者Unix，You need to learn how to talk to your computer using the command line.
Watch

[Lecture 3: Linux and Server-Side Javascript]
[Lecture 4a: The Linux Command Line ]

2. Try Git （1小时）
Git is a version control system. It enables programmers to work together on large projects without overwriting each other’s work. Furthermore, it saves old versions of code in case you make a mistake and need to revert back. It can also be a useful portfolio of your programming and analysis projects to show potential employers.

分析篇

Data Visualization Best Practices （数据视觉化展示技巧）

1. Introduction to Infographics and Data Visualization ( 5 小时）
These videos are enjoyable and they make a nice break from the more technically challenging courses in this path. However, while the material in the course may be easy to understand, data visualization is a deeper topic than it seems. These examples should help illuminate what makes a good visualization and give ideas for some more creative ways to display information. You will also learn general principles of graphic design and visual perception.

Optional: Information Dashboard Design: The Effective Visual Communication of Data by Stephen Few - Dashboard 设计的经典书籍

Python 数据分析

Python 有很多针对统计和数据分析的library，常用的有：Pandas, Scipy, Numpy, and Scikit
1. Introduction to Pandas （ 1 小时）
2. explore SciPy and Numpy libraries （5 小时）

机器学习 Practical Machine Learning

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability and optimization. Learning algorithms enable a wide range of applications, from everyday tasks such as product recommendations and spam filtering to bleeding edge applications like self-driving cars and personalized medicine. In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, machine learning techniques are fast becoming a core component of large-scale data processing pipelines.

1. Introduction to Big Data with Apache Spark (30 小时 with Python)
teach students how to use PySpark (part of Apache Spark) to deliver against their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems.

Learn how to use Apache Spark to perform data analysis
How to use parallel programming to explore data sets
Apply Log Mining, Textual Entity Recognition and Collaborative Filtering to real world data questions

2. Scalable Machine Learning (35 小时 - With Python and Spark )
This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Apache Spark, a cluster computing system well-suited for large-scale machine learning tasks. You will implement scalable algorithms for fundamental statistical models (linear regression, logistic regression, matrix factorization, principal component analysis) while tackling key problems from various domains: online advertising, personalized recommendation, and cognitive neuroscience.

The underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines
Exploratory data analysis, feature extraction, supervised learning, and model evaluation
Application of these principles using Apache Spark
How to implement scalable algorithms for fundamental statistical models

Optional: Statistical Learning ( 30 小时 - with R )
This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).

大数据分析实战 with R

注： R 并不适合真正的大数据应用，这些课程是一个补充，可以略过
1. Try R ( 5 小时）
R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you.
This course will teach you the basics of R: data types, summary statistics, functions, and control structures.

2. The Analytics Edge （100 小时）

An applied understanding of many different analytics methods, including linear regression, logistic regression, CART, clustering, and data visualization
How to implement all of these methods in R
An applied understanding of mathematical optimization and how to solve optimization models in spreadsheet software

作者：合歡樹
链接：http://www.zhihu.com/question/29265587/answer/46676970

希望能给想成为数据分析师朋友一点帮助，欢迎加群一起交流、探讨QQ 群：:570180534！！！

数据分析师成长和进阶免费教程

猜你喜欢