数据清洗之统一输入

2020-07-02 本文已影响0人小杨每天要早睡早起哦

Kaggle: Data Cleaning Challenge: Inconsistent Data Entry

使用unique方法，可以查看某列中所有的唯一值数据。

cities = suicide_attacks['City'].unique()
# sort them alphabetically and then take a closer look
cities.sort()
cities

由于大小写和空格的问题，实际同一含义的字符串，被识别成了多个。我只是粗略的看了一下就能发现很多。因此接下来需要做的就是统一大小并且清楚掉这些多余的空格。

统一大小写

suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# 统一转为小写字母

清除空格

suicide_attacks['City'] = suicide_attacks['City'].str.strip()

经过以上两步可以明显看到唯一值数量上有减少，那些因为大小写和空格问题带来的唯一值都被去掉了。

还没有到此为止哦，经过统一大小写和清除空格的数据中还存在着一些不安分因子，比如下图中圈出的d. i khan和d.i khan，空格出现在了中间，两者的相似度非常高。

目前的数据量较小，我们可以人工识别处理，但当数量达到千万级别时，这时候就需要借助python的第三方库 fuzzywuzzy来实现模糊匹配。

模糊匹配

导入库和模块

from fuzzywuzzy import process
from fuzzywuzzy import fuzzy

extract方法返回模糊匹配的字符串和相似度

cities = suicide_attacks['City'].unique()
process.extract('d.i khan', cities, limit=10, scorer=fuzzy.token_sort_radio)

接下来只需要替换掉匹配率>90%的数据就可以了（至于为什么匹配率选了90%，小杨推测笔者的意思是88%匹配度对应的d.g khan和原数据无任何关联，所以概率肯定是在88%以上，取整就是90%？）

通用的“替换”方法

笔者写了一个通用的方法，可以用于替换数据表格中某一列

def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")