我眼中的数据科学家

2016-02-22  本文已影响64人  abrocod

# 我眼中的数据科学家

## 1. Computer Science Foundation

### Data Structure

### Algorithm

- LeetCode Question (t)

### OO Programming

- Design Pattern (book) (l)

- Refactoring (book) (l)

### Functional Programming

- Scala (l)

### Debugging

### Linux / Shell script

- Common linux command (l)

> ls, cp, (l)

- Network knowledge (scp, curl, ftp, http...) (l)

- Awk, Vim (l)

- Shell script (l)

### Makefile / CMake (l)

### Parallel Programming (l)

### C++

- C++ Primer (l)

- C++ Acceleartor (l)

- 50 Principle of C++ (l)

### Database Management

- SQL (make a cheatsheet for it !!! ) (m)

- database optimization (write this in resume is more concrete than just say 'database management')  (l)

-----

## 2. Python Skill

### Core Python knowledge

### Python Standard Library

### Numpy/SciPy/Matplotlib/pandas

- Pandas DataFrame cheatsheet

- Introduction to Numpy

> make a cheatsheet of Numpy ! (t)

### Python's Machine Learning Package

- Theano

- Scikit-learn

> Scikit-learn's interface is not well designed for statistics analysis. It is mainly for ML(makeing prediction, classification).

> Scikit-learn's implementation of Regression. Is it use GSD or just matrix operation?? How is it different from statsmodel's regression implementation?(which I am pretty sure it just use matrix implementation)

>- worth to spend sometime to read it source code

- statsmodels

> understand how they implement basic OLS and GLM model

> learn the design of statistical package through reading source code

### Python's interface to Hadoop (Impyla, Happybase, etc...)

- impyla (spend some time to study the source code of impyla)

- happybase

----

## 3. Hadoop / AWS

### Hadoop Software

- Hbase / Impala

- buy Hadoop Definite Guide (book)

### MapReduce

### AWS

- setup instance / account / system

- Use AWS as MapReduce tool to do data analysis (EMR)

- AWS programming

----

## 4. Machine Learning (except deep learning)

### R language

- Cheetsheet of R's core syntax

- Use R to do data mining (data cleaning, preprocessing, machine learning)

- Use R to do big data ?? (interface to Hadoop? )

### Common model

- SVM

- Logistic Regression

- Random Forest

- Boosting Tree / Boosting method

### Optimization / Numerical Method

- Gradient Descent /  SGD

- Newton's method

- How they are applied to solve the ML problem. And how to program them

### Learn through example and practise

- Study through Kaggle example and kaggle blog

+ study a example of email spanning blog post and learn how they do such task

+ study a example of text classification blog post

----

## 5. Deep Learning

### Area:

- Image Processing

- NLP

- Finance

### Model:

- CNN

- RNN

+ RNN's theory and implementation (Torch or Caffe)

+ RNN's application (and preprocessing)

+ Watch Oxford's online course (about use Torch + RNN)

+ etc...

- Unsupervised Learning

- Reinforcement Learning

### Feature Engineering

### Tools:

- Caffe

> TO DO:

> beside image, what else can caffe do?

> Consider of future career, I am probably not very interested in image type data

> I am more interested in NLP, finance and other application. Is Caffe still a good choice for them?

> Does caffe has a good RNN implementation ?

- Torch

- Python/Theano

### Theory of Neural Network

> Forward/backward computation

> Loss function selection

> Optimization method (Alex has a good paper that cover these theory)

> How loss function/optimization apply to different domain (NLP, .etc )

----

## 6. Statistics / Math

### A/B test

- when to use which test

- read the book "introduction to biostatistics"

### Mathematical Statistics (2nd year)

### ANOVA

- how to interpret the result and concept

### Probability Theory

### Stochastic Process (important for finance)

----

## 7. Domain Knowledge

### Finance

### Signal Processing

-----

## 8. Data Scientist Job Seeking

- Keep reading the data scientist/engineer requirement from different industry

- Keep reading the interview question / feedback on glassdoor

> do they focus more on statistics or engineering/cs ?

> do they require coding test?

> what kind of statistics question do they ask?

> what kind of CS background do they require?

> what are the useful project for getting interview?

> How to connect data science with finance?

> How to connect data science with future career and business ?

- do coding challenges (LeetCode, others ... )

## 9. Career Development

-------------

##### ==========================================

## Priority list:

- t: top priroity

- m: middle priority

- l: low priority

上一篇下一篇

猜你喜欢

热点阅读