pandas

2023-01-27  本文已影响0人  山猪打不过家猪

1.读取csv文件

1.1读取csv

1.1.1全表读取
df = pd.read_csv('1.csv')
1.1.2根据列的index读取
cols_index= [1] #索引是从0开始,所以这里是第二列的值
df = pd.read_csv('1.csv',usecols=cols_index)
df.head()
image.png
1.1.3根据列名读取
cols_name= ['Id','dateTime','name']
df = pd.read_csv('1.csv',usecols=cols_name)
df.head()
1.1.4无表头读取
df = pd.read_csv('1.csv',header = None)
image.png
1.1.5有表头读取
df = pd.read_csv('1.csv',header=1)
df.head()
1.1.6跳行读取
df = pd.read_csv('1.csv',skiprows=1)
df.head()
1.1.7查看列名
df = pd.read_csv('directory.csv')
df.columns
image.png

1.2获取表中数据类型

1.2.1 获取所有列的信息
import pandas as pd

df = pd.read_csv('1.csv',header=None)
df.info()
image.png
1.2.2获取列数和行数
df = pd.read_csv('directory.csv')
df.shape
image.png
1.2.3只查看所有数据类型
df = pd.read_csv('directory.csv')
df.dtypes
image.png
1.4根据行数获取数据
image.png
1.5获取列的数据类型
print("text_data.dtypes")
image.png
1.6指定列的类型
data = pd.read_csv("vt_tax_data_2016.tsv",dtype={"zipcode":str})
1.7指定缺失数据
image.png
1.8处理损坏的数据
image.png

2.读取xlsx文件

excle_demo = pd.read_excel("fcc_survey.xlsx")
2.1处理特殊格式的excel
image.png
image.png
2.2获取excel的不同sheet
excle_demo = pd.read_excel("fcc_survey.xlsx",sheet_name=0)
excle_demo = pd.read_excel("fcc_survey.xlsx",sheet_name= '2007')
excle_demo = pd.read_excel("fcc_survey.xlsx",sheet_name= None)
2.3 合并结构相同但是名字不同的sheet
image.png
excle_demo = pd.read_excel("fcc_survey.xlsx",sheet_name= None)
all_responses = pd.DataFrame()
for sheet_name,frame in excel_demo.items():
    frame['Year'] = sheet_name #添加一列名为Year,将sheet_name变为列的值
    all_responses = all_responses.append(frame)
2.4处理布尔值(yes,no,false,true,Na)
image.png
2.5 获取表格所有含有Na列的总数
# Load the data
survey_data = pd.read_excel("fcc_survey_subset.xlsx")
# Count NA values in each column
print(survey_data.isna().sum())
>>>
ID.x                        0
HasDebt                     0
HasFinancialDependents      7
HasHomeMortgage           499
HasStudentDebt            502
dtype: int64
2.6 处理时间和日期
image.png image.png

3.读取数据库

3.1创建数据库链接
# Import sqlalchemy's create_engine() function
from sqlalchemy import create_engine

# Create the database engine
engine = create_engine("sqlite:///data.db")

# View the tables in the database
print(engine.table_names())
3.2 执行sql查询
# Create database engine for data.db
engine = create_engine("sqlite:///data.db")

# Write query to get date, tmax, and tmin from weather
query = """
SELECT date, 
       tmax, 
       tmin
  FROM weather;
"""

# Make a dataframe by passing query and engine to read_sql()
temperatures = pd.read_sql(query,engine)

# View the resulting dataframe
print(temperatures)

4. 读取json

4.1获取API中的json数据
## Get data from an API

api_url = "https://api.yelp.com/v3/businesses/search"

# Get data about NYC cafes from the Yelp API
response = requests.get(api_url, 
                headers=headers, 
                params=params)

# Extract JSON data from the response
data = response.json()

# Load data to a data frame
cafes = pd.DataFrame(data['businesses'])

# View the data's dtypes
print(cafes.dtypes)
4.2 爬取的json保存在本地
response = requests.get(
    'https://www.ixigua.com/api/searchv2/complex/json%20pandas/30',
    params=params,
    cookies=cookies,
    headers=headers,
)
rjson = response.json()
##dict-json使用dumps
json_data = json.dumps(rjson)

with open("data.json", "w") as file:
    file.write(json_data)

5.pandas提高效率

5.1 提高循环算数的效率iterrows()
image.png
5.2 提高循环算数的效率itertuples()
image.png
5.3循环的替换方法.apply()
image.png
上一篇 下一篇

猜你喜欢

热点阅读