我眼中的数据科学家
# 我眼中的数据科学家
## 1. Computer Science Foundation
### Data Structure
### Algorithm
- LeetCode Question (t)
### OO Programming
- Design Pattern (book) (l)
- Refactoring (book) (l)
### Functional Programming
- Scala (l)
### Debugging
### Linux / Shell script
- Common linux command (l)
> ls, cp, (l)
- Network knowledge (scp, curl, ftp, http...) (l)
- Awk, Vim (l)
- Shell script (l)
### Makefile / CMake (l)
### Parallel Programming (l)
### C++
- C++ Primer (l)
- C++ Acceleartor (l)
- 50 Principle of C++ (l)
### Database Management
- SQL (make a cheatsheet for it !!! ) (m)
- database optimization (write this in resume is more concrete than just say 'database management') (l)
-----
## 2. Python Skill
### Core Python knowledge
### Python Standard Library
### Numpy/SciPy/Matplotlib/pandas
- Pandas DataFrame cheatsheet
- Introduction to Numpy
> make a cheatsheet of Numpy ! (t)
### Python's Machine Learning Package
- Theano
- Scikit-learn
> Scikit-learn's interface is not well designed for statistics analysis. It is mainly for ML(makeing prediction, classification).
> Scikit-learn's implementation of Regression. Is it use GSD or just matrix operation?? How is it different from statsmodel's regression implementation?(which I am pretty sure it just use matrix implementation)
>- worth to spend sometime to read it source code
- statsmodels
> understand how they implement basic OLS and GLM model
> learn the design of statistical package through reading source code
### Python's interface to Hadoop (Impyla, Happybase, etc...)
- impyla (spend some time to study the source code of impyla)
- happybase
----
## 3. Hadoop / AWS
### Hadoop Software
- Hbase / Impala
- buy Hadoop Definite Guide (book)
### MapReduce
### AWS
- setup instance / account / system
- Use AWS as MapReduce tool to do data analysis (EMR)
- AWS programming
----
## 4. Machine Learning (except deep learning)
### R language
- Cheetsheet of R's core syntax
- Use R to do data mining (data cleaning, preprocessing, machine learning)
- Use R to do big data ?? (interface to Hadoop? )
### Common model
- SVM
- Logistic Regression
- Random Forest
- Boosting Tree / Boosting method
### Optimization / Numerical Method
- Gradient Descent / SGD
- Newton's method
- How they are applied to solve the ML problem. And how to program them
### Learn through example and practise
- Study through Kaggle example and kaggle blog
+ study a example of email spanning blog post and learn how they do such task
+ study a example of text classification blog post
----
## 5. Deep Learning
### Area:
- Image Processing
- NLP
- Finance
### Model:
- CNN
- RNN
+ RNN's theory and implementation (Torch or Caffe)
+ RNN's application (and preprocessing)
+ Watch Oxford's online course (about use Torch + RNN)
+ etc...
- Unsupervised Learning
- Reinforcement Learning
### Feature Engineering
### Tools:
- Caffe
> TO DO:
> beside image, what else can caffe do?
> Consider of future career, I am probably not very interested in image type data
> I am more interested in NLP, finance and other application. Is Caffe still a good choice for them?
> Does caffe has a good RNN implementation ?
- Torch
- Python/Theano
### Theory of Neural Network
> Forward/backward computation
> Loss function selection
> Optimization method (Alex has a good paper that cover these theory)
> How loss function/optimization apply to different domain (NLP, .etc )
----
## 6. Statistics / Math
### A/B test
- when to use which test
- read the book "introduction to biostatistics"
### Mathematical Statistics (2nd year)
### ANOVA
- how to interpret the result and concept
### Probability Theory
### Stochastic Process (important for finance)
----
## 7. Domain Knowledge
### Finance
### Signal Processing
-----
## 8. Data Scientist Job Seeking
- Keep reading the data scientist/engineer requirement from different industry
- Keep reading the interview question / feedback on glassdoor
> do they focus more on statistics or engineering/cs ?
> do they require coding test?
> what kind of statistics question do they ask?
> what kind of CS background do they require?
> what are the useful project for getting interview?
> How to connect data science with finance?
> How to connect data science with future career and business ?
- do coding challenges (LeetCode, others ... )
## 9. Career Development
-------------
##### ==========================================
## Priority list:
- t: top priroity
- m: middle priority
- l: low priority