Introduction to Data Science in
本文主要是作者在学习coursera的Introduction to Data Science in Python课程的学习笔记,仅供参考。
1. 50 Years of Data Science
(1) Data Exploration and Preparation
(2) Data Representation and Transformation
(3) Computing with Data
(4) Data Modeling
(5) Data Visualization and Presentation
(6) Science about Data Science
2. Functions
def add_numbers(x, y, z = None, flag = False):
if (flag):
print('Flag is true!')
if (z == None):
return x + y
else:
return x + y + z
print(add_numbers(1, 2, flag=true))
Assign function add_numbers to a variable a:
a = add_numbers
a = (1, 2, flag=true)
3. 查看数据类型
type('This is a string')
-> str
type(None)
-> NoneType
4. Tuple 元组
Tuples are an immutable data structure (cannot be altered).
元组是一个不变的数据结构(无法更改)。
x = (1, 'a', 2, 'b')
type(x)
->tuple
5. List 列表
Lists are a mutable data structure.
列表是可变的数据结构。
x = [1, 'a', 2, 'b']
type(x)
->list
6. Append 附加
Use append to append an object to a list.
使用附加将对象附加到列表。
x.append(3.3)
print(x)
->[1, 'a', 2, 'b', 3.3]
7. Loop through each item in the list
for item in x:
print(item)
->1
a
2
b
3.3
8. Using the indexing operator to loop through each item in the list
i = 0
while( i != len(x) ):
print(x[I])
i = i +1
->1
a
2
b
3.3
9. List 基本操作
(1)Use + to concatenate连接 lists
[1, 2] + [3, 4]
-> [1, 2, 3, 4]
(2)Use * to repeat lists
[1]*3
->[1, 1, 1]
(3) Use the in operator to check if something is inside a list
1 in [1, 2, 3]
->True
10. String 基本操作
(1)Use bracket notation to slice a string.
使用方括号符号来分割字符串。
x = 'This is a string'
print(x[0])
->T
print(x[0:1])
->T
print(x[0:2])
->Th
print(x[-1]) # the last element
->g
print(x[-4:-2]) # start from the 4th element from the end and stop before the 2nd element from the end
->ri
x[:3] # This is a slice from the beginning of the string and stopping before the 3rd element.
->Thi
x[3:] # this is a slice starting from the 4th element of the string and going all the way to the end.
-> s is a string
(2) New example on list
firstname = 'Christopher'
lastname = 'Brooks'
print(firstname + ' ' + lastname)
->Christopher Brooks
print(firstname*3)
->ChristopherChristopherChristopher
print('Chris' in firstname)
->True
(3) Split returns a list of all the words in a string, or a list split on a specific character.
firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0]
lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1]
print(firstname)
->Christopher
print(lastname)
->Brooks
(4) Make sure you convert objects to strings before concatenating串联.
'Chris' + 2
->Error
'Chris' + str(2)
->Chris2
11. Dictionary 字典
(1)Dictionaries associate keys with values
x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}
x['Christopher Brooks']
->brooksch@umich.edu
x['Kevyn Collins-Thompson'] = None
x['Kevyn Collins-Thompson']
->没有输出
(2)Iterate over all of the keys:
遍历所有的键:
for name in x:
print(x[name])
->brooksch@umich.edu
billg@microsoft.com
None
(3) Iterate over all of the values:
for email in x.values():
print(email)
->brooksch@umich.edu
billg@microsoft.com
None
(4) Iterate over all of the items in the list:
for name, email in x.items():
print(name)
print(email)
->Christopher Brooks
brooksch@umich.edu
Bill Gates
billg@microsoft.com
Kevyn Collins-Thompson
None
(5) unpack a sequence into different variables:
将序列解压为不同的变量:
x = ('Christopher', 'Brooks', 'brooksch@umich.edu')
fname, lname, email = x
fname
->Christopher
lname
->Brooks
(6) Make sure the number of values you are unpacking matches the number of variables being assigned.
x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Anbor')
fname, lname, email = x
->error
12. More on Strings
(1) Simple Samples
print('Chris' + 2)
->error
print('Chris' + str(2))
->Chris2
(2) Python has a built in method for convenient string formatting.
sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris' }
sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'
print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))
->Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96
13. Reading and Writing CSV files
(1)导入csv
import csv
%precision 2
with open('mpg.csv') as csvfile:
mpg = list(csv.DictReader(csvfile)) # 将csvfile转化为元素为字典的list
mpg[:3]
->
[OrderedDict([('', '1'),
('manufacturer', 'audi'),
('model', 'a4'),
('displ', '1.8'),
('year', '1999'),
('cyl', '4'),
('trans', 'auto(l5)'),
('drv', 'f'),
('cty', '18'),
('hwy', '29'),
('fl', 'p'),
('class', 'compact')]),
OrderedDict([('', '2'),
('manufacturer', 'audi'),
('model', 'a4'),
('displ', '1.8'),
('year', '1999'),
('cyl', '4'),
('trans', 'manual(m5)'),
('drv', 'f'),
('cty', '21'),
('hwy', '29'),
('fl', 'p'),
('class', 'compact')]),
OrderedDict([('', '3'),
('manufacturer', 'audi'),
('model', 'a4'),
('displ', '2'),
('year', '2008'),
('cyl', '4'),
('trans', 'manual(m6)'),
('drv', 'f'),
('cty', '20'),
('hwy', '31'),
('fl', 'p'),
('class', 'compact')])]
(2)查看list长度
len(mpg)
->234
(3)keys gives us the column names of our csv
mpg[0].keys()
->odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])
(4)Find the average cty fuel economy across all car. All values in the dictionaries are strings, so we need to convert to float.
sum(float(d['hwy']) for d in mpg) / len(mpg)
->23.44
(5)Use set to return the unique values for the number of cylinders the cars in our dataset have.
使用set返回数据集中汽车具有的汽缸数的唯一值。
cylinders = set(d['cyl'] for d in mpg)
cylinders
->'4', '5', '6', '8'
(6) We are grouping the cars by number of cylinder, and find the average cty mpg for each group.
CtyMpgByCyl = []
for c in cylinders:
summpg = 0
cyltypecount = 0
for d in mpg:
if d['cyl'] == c:
summpg += float(d['cty'])
cyltypecount += 1
CtyMpgByCyl.append((c, summpg / cyltypecount))
CtyMpgByCyl.sort(key = lambda x: x[0])
CtyMpgByCyl
->[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]
(7) Use set to return the unique values for the class types in our dataset
vehicleclass = set(d['class'] for d in mpg)
vehicleclass
->{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}
(8) How to find the average hwy mpg for each class of vehicle in our dataset.
HwyMpgByClass = []
for t in vehicleclass:
summpg = 0
vclasscount = 0
for d in mpg:
if d['class'] == t:
summpg += float(d['hwy'])
vclasscount += 1
HwyMpgByClass.append((t, summpg / vclasscount))
HwyMpgByClass.sort(key = lambda x: x[1])
HwyMpgByClass
->
[('pickup', 16.88),
('suv', 18.13),
('minivan', 22.36),
('2seater', 24.80),
('midsize', 27.29),
('subcompact', 28.14),
('compact', 28.30)]
14. Dates and Times
(1) 安装Datetime和Times的包
import datetime as dt
import time as tm
(2) Time returns the current time in seconds since the Epoch
tm.time()
->1583932727.90
(3) Convert the timestamp to datetime
dtnow = dt.datetime.fromtimestamp(tm.time())
dtnow
->
datetime.datetime(2020, 3, 11, 13, 18, 56, 990293)
(4) Handy datetime attributes: get year, month, day, etc. from a datetime
dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second
->(2020, 3, 11, 13, 18, 56)
(5) Timedelta is a duration expressing the difference between two dates.
delta = dt.timedelta(days = 100)
delta
->datetime.timedelta(100)
(6) date.today returns the current local date
today = dt.date.today()
today
->datetime.date(2020, 3, 11)
(7) the date 100 days ago
today - delta
->datetime.date(2019, 12, 2)
(8) compare dates
today > today - delta
-> True
15. Objects and map()
(1) an example of a class in python:
class Person:
department = 'School of Information'
def set_name(self, new_name)
self.name = new_name
def set_location(self, new_location)
self.location = new_location
person = Person()
person.set_name('Christopher Brooks')
person.set_location('Ann Arbor, MI, USA')
print('{} live in {} and work in the department {}'.format(person.name, person.location, person.department))
(2) mapping the min function between two lists
store1 = [10.00, 11.00, 12.34, 2.34]
store2 = [9.00, 11.10, 12.34, 2.01]
cheapest = map(min, store1, store2)
cheapest
-><map at 0x7f74034a8860>
(3) iterate through the map object to see the values
for item in cheapest:
print(item)
->
9.0
11.0
12.34
2.01
16. Lambda and List Comprehensions
(1) an example of lambda that takes in three parameters and adds the first two
my_function = lambda a, b, c: a+b
my_function(1, 2, 3)
->3
(2) iterate from 0 to 999 and return the even numbers.
my_list = []
for number in range(0, 1000):
if number % 2 == 0:
my_list.append(number)
my_list
->[0, 2, 4,...]
(3) Now the same thing but with list comprehension
my_list = [number for number in range(0, 1000) if number % 2 == 0]
my_list
->[0, 2, 4,...]
17. Numpy
(1) import package
import numpy as np
18.creating array数组(tuple元组,list列表)
(1) create a list and convert it to a numpy array
mylist = [1, 2, 3]
x = np.array(mylist)
x
->array([1, 2, 3])
(2) just pass in a list directly
y = np.array([4, 5, 6])
y
->array([4, 5, 6])
(3) pass in a list of lists to create a multidimensional array
m = np.array([[[7, 8, 9,],[10, 11, 12]])
m
->
array([[ 7, 8, 9],
[10, 11, 12]])
(4) use the shape method to find the dimensions of array
m.shape
->(2,3)
(5) arange returns evenly spaced values within a given interval
n = np.arange(0, 30, 2)
n
->array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])
(6) reshape returns an array with the same data with a new shape
n = n.reshape(3, 5)
n
->
array([[ 0, 2, 4, 6, 8],
[10, 12, 14, 16, 18],
[20, 22, 24, 26, 28]])
(7) linspace returns evenly spaced numbers over a specified interval
o = np.linspace(0, 4, 9)
o
->array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])
(8) resize changes the shape and size of array in-space
o.resize(3, 3)
o
->
array([[ 0. , 0.5, 1. ],
[ 1.5, 2. , 2.5],
[ 3. , 3.5, 4. ]])
(9) ones returns a new array of given shape and type, filled with ones
np.ones((3, 2))
->
array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]])
(10) zeros returns a new array of given shape and type, filled with zeros
np.zeros((2,3))
->
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
(11) eye returns a 2D array with ones on the diagonal and zeros
np.eye(3)
->
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
(12) diag extracts a diagonal or constructs a diagonal array
np.diag(y)
->
array([[4, 0, 0],
[0, 5, 0],
[0, 0, 6]])
(13)creating an array using repeating list
np.array([1, 2, 3]*3)
->array([1, 2, 3, 1, 2, 3, 1, 2, 3])
(14) repeat elements of an array using repeat
np.repeat([1, 2, 3], 3)
->array([1, 1, 1, 2, 2, 2, 3, 3, 3])
(15) combine arrays
p = np.ones([2, 3], int)
p
->
array([[1, 1, 1],
[1, 1, 1]])
(16) use vstack to stack arrays in sequence vertically (row wise).
np.vstack([p, 2*p])
->
array([[1, 1, 1],
[1, 1, 1],
[2, 2, 2],
[2, 2, 2]])
(17) use hstack to stack arrays in sequence horizontally (column wise).
np.hstack([p, 2*p])
->
array([[1, 1, 1, 2, 2, 2],
[1, 1, 1, 2, 2, 2]])
19. Operations
(1) element wise + - * /
print(x+y)
print(x-y)
->
[5 7 9]
[-3 -3 -3]
print(x*y)
print(x/y)
->
[ 4 10 18]
[ 0.25 0.4 0.5 ]
print(x**2)
->[1 4 9]
(2) Dot Product
x.dot(y) # x1y1+x2y2+x3y3
->32
(3)
z = np.array([y, y**2])
print(z)
print(len(z)) #number of rows of array
->
[[ 4 5 6]
[16 25 36]]
2
(4) transpose array
z
->
[[ 4 5 6]
[16 25 36]]
z.T
->
array([[ 4, 16],
[ 5, 25],
[ 6, 36]])
(5) use .dtype to see the data type of the elements in the array
z.dtype
->dtype('int64')
(6) use .astype to cast to a specific type
z = z.astype('f')
z.dtype
->dtype('float32')
(7) math functions
a = np.array([-4, -2, 1, 3, 5])
a.sum()
->3
a.max()
->5
a.min()
->-4
a.mean()
->0.59999999998
a.std()
->3.2619012860600183
a.argmax()
->4
a.argmin()
->0
(8) indexing / slicing
s = np.arange(13)**2
s
->array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])
(9)use bracket notation to get the value at a specific index
s[0], s[4], s[-1]
->(0, 16, 144)
(10) use : to indicate a range.array[start:stop]
s[1:5]
->array([ 1, 4, 9, 16])
(11) use negatives to count from the back
s[-4:]
->array([ 81, 100, 121, 144])
(12) A second : can be used to indicate step-size.array[start : stop : stepsize]
Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.
s[-5::-2]
->array([64, 36, 16, 4, 0])
(13) look at the multidimensional array
r = np.arange(36)
r.resize((6,6))
r
->
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
(14) use bracket notation to slice
r[2, 2]
->14
(15) use : to select a range of rows or columns
r[3, 3:6]
->array([21, 22, 23])
(16) select all the rows up to row2 , and all the columns up to the last column.
r[:2, :-1]
->
array([[ 0, 1, 2, 3, 4],
[ 6, 7, 8, 9, 10]])
(17) a slice of last row, only every other element
r[-1, ::2]
->array([30, 32, 34])
(18) perform conditional indexing.
r[r > 30]
->array([31, 32, 33, 34, 35])
(19) assigning all values in the array that are greater than 30 to the value of 30
r[r > 30] = 30
r
->
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 30, 30, 30, 30, 30]])
(20) copy and modify arrays
r2 = r[:3, :3]
r2
->
array([[ 0, 1, 2],
[ 6, 7, 8],
[12, 13, 14]])
(21)set this slice's values to zero([:] selects the entire array)
r2[:] = 0
r2
->
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
(22) r has also be changed
r
->
array([[ 0, 0, 0, 3, 4, 5],
[ 0, 0, 0, 9, 10, 11],
[ 0, 0, 0, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 30, 30, 30, 30, 30]])
(23) to avoid this, use .copy()
r_copy = r.copy()
r_copy
->
array([[ 0, 0, 0, 3, 4, 5],
[ 0, 0, 0, 9, 10, 11],
[ 0, 0, 0, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 30, 30, 30, 30, 30]])
(24) now when r_copy is modified, r will not be changed
r_copy[:] =10
print(r_copy, '\n')
print(r)
->
[[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]
[10 10 10 10 10 10]]
[[ 0 0 0 3 4 5]
[ 0 0 0 9 10 11]
[ 0 0 0 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 30 30 30 30 30]]
(25) create a new 4*3 array of random numbers 0-9
test = np.random.randint(0, 10, (4,3))
test
->
array([[1, 8, 2],
[6, 1, 5],
[7, 8, 0],
[7, 6, 2]])
(26) iterate by row
for row in test:
print(row)
->
[1 8 2]
[6 1 5]
[7 8 0]
[7 6 2]
(27) iterate by index
for i in range(len(test)):
print(test[I])
->
[1 8 2]
[6 1 5]
[7 8 0]
[7 6 2]
(28) iterate by row and index
for i, row in enumerate(test):
print('row', i, 'is', row)
->
row 0 is [1 8 2]
row 1 is [6 1 5]
row 2 is [7 8 0]
row 3 is [7 6 2]
(29) use zip to iterate over multiple iterables
test2 = test**2
test2
->
array([[ 1, 64, 4],
[36, 1, 25],
[49, 64, 0],
[49, 36, 4]])
for i, j in zip(test, test2):
print(i, '+', j, '=', i+j)
->
[1 8 2] + [ 1 64 4] = [ 2 72 6]
[6 1 5] + [36 1 25] = [42 2 30]
[7 8 0] + [49 64 0] = [56 72 0]
[7 6 2] + [49 36 4] = [56 42 6]