数据清洗(二):python数据清洗

Pandas数据清洗流程:

  • 1.数据的读写:read_csv、read_excel、to_csv to_excel
  • 2.数据的探索与描述:info、head、describe、shape、value_counts
  • 3.数据的选择与整理:df索引、列的选取、显式loc、隐式iloc、掩码、映射函数map、apply
  • 4.数据的分组:groupby
  • 5.数据的分割、合并:索引、drop、pop、del、append、concat、merge
  • 6.缺失值、异常值、重复值的处理:fillna、drop_duplicates
  • 7.文本字符串的处理:series.str

1、数据的读取

%pwd

'C:\\Users\\Administrator\\Desktop\\python\\机器学习案例汇总'

import numpy as np
import pandas as pd

df = pd.read_csv('data/1.data清洗/taobao_data.csv',encoding='utf-8')

df.head()

宝贝价格成交量卖家位置0新款中老年女装春装雪纺打底衫妈妈装夏装中袖宽松上衣中年人t恤99.016647夏奈凤凰旗舰店江苏1中老年女装清凉两件套妈妈装夏装大码短袖T恤上衣雪纺衫裙裤套装286.014045夏洛特的文艺上海2母亲节衣服夏季妈妈装夏装套装短袖中年人40-50岁中老年女装T恤298.013458云新旗舰店江苏3母亲节衣服中老年人春装女40岁50中年妈妈装套装夏装奶奶装两件套279.013340韶妃旗舰店浙江4中老年女装春夏装裤大码 中年妇女40-50岁妈妈装夏装套装七分裤59.012939千百奈旗舰店江苏


df.tail(3)

宝贝价格成交量卖家位置97母亲节妈妈装夏装套装女40-50岁夏季衣服两件套中老年春装连衣裙195.04000若澜锦蒂旗舰店浙江98母亲节衣服夏季中老年女装夏装短袖套装雪纺衫T恤妈妈装两件套498.03968蕴涵旗舰店江苏99中老年女装春装t恤纱袖针织衫40-50岁妈妈装七分袖上衣夏装打底衫688.03956潮流前线9170浙江

2、数据的探索


df.shape
(100, 5)

df.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
 0   &#x51FA;&#x53D1;&#x5730;     30821 non-null  object
 1   &#x76EE;&#x7684;&#x5730;     30821 non-null  object
 2   &#x4EF7;&#x683C;      30821 non-null  int64
 3   &#x8282;&#x7701;      30821 non-null  int64
 4   &#x8DEF;&#x7EBF;&#x540D;     30821 non-null  object
 5   &#x9152;&#x5E97;      30821 non-null  object
 6   &#x623F;&#x95F4;      30821 non-null  object
 7   &#x53BB;&#x7A0B;&#x822A;&#x53F8;    30821 non-null  object
 8   &#x53BB;&#x7A0B;&#x65B9;&#x5F0F;    30821 non-null  object
 9   &#x53BB;&#x7A0B;&#x65F6;&#x95F4;    30821 non-null  object
 10  &#x56DE;&#x7A0B;&#x822A;&#x53F8;    30821 non-null  object
 11  &#x56DE;&#x7A0B;&#x65B9;&#x5F0F;    30821 non-null  object
 12  &#x56DE;&#x7A0B;&#x65F6;&#x95F4;    30821 non-null  object
dtypes: int64(2), object(11)
memory usage: 3.1+ MB
</class>
df_free.groupby(['出发地','目的地'],as_index=False).mean().head(3)

出发地目的地价格节省0上海三亚1627.35444.391上海丽江1981.49569.382上海乌鲁木齐3223.76711.80


df1 = df_free.groupby(['出发地','目的地'],as_index=False).mean()

df1.drop('节省',axis=1,inplace=True)
df1.head()

出发地目的地价格0上海三亚1627.35001上海丽江1981.49002上海乌鲁木齐3223.76003上海九寨沟1893.71254上海北京1317.0900

df_free.groupby(['出发地','目的地'],as_index=False).agg({'价格':np.mean}).head(3)

出发地目的地价格0上海三亚1627.351上海丽江1981.492上海乌鲁木齐3223.76

df_free.groupby(['出发地','目的地'],as_index=False)['价格'].mean().head(3)

出发地目的地价格0上海三亚1627.351上海丽江1981.492上海乌鲁木齐3223.76

df_free.head(2)

出发地目的地价格节省路线名酒店房间去程航司去程方式去程时间回程航司回程方式回程时间0北京厦门1866492北京-厦门3天2晚 | 入住厦门温特雅酒店 + 联合航空/首都航空往返机票厦门温特雅酒店 舒适型 3.9分/5分标准房(大床)(预付) 大床 不含早 1间2晚联合航空 KN5927直飞16:55-19:45首都航空 JD5376直飞22:15-01:151北京厦门2030492北京-厦门3天2晚 | 入住厦门华美达长升大酒店 + 联合航空/首都航空往返机票厦门华美达长升大酒店 4.1分/5分标准房(错峰出游) 大/双床 双早 1间2晚联合航空 KN5927直飞16:55-19:45首都航空 JD5376直飞22:15-01:15

new_df=df_free.groupby(['出发地','目的地'],as_index=False).mean()
new_df.head()

出发地目的地价格节省0上海三亚1627.3500444.3901上海丽江1981.4900569.3802上海乌鲁木齐3223.7600711.8003上海九寨沟1893.7125492.4254上海北京1317.0900344.650

del new_df['节省']
new_df.head(3)

出发地目的地价格0上海三亚1627.351上海丽江1981.492上海乌鲁木齐3223.76

new_df.shape
(313, 3)
df2 = new_df.set_index(['出发地','目的地'])
df2.head(3)

价格出发地目的地上海三亚1627.35丽江1981.49乌鲁木齐3223.76

df_free.groupby(['出发地','目的地']).agg({'价格':np.mean})

价格出发地目的地上海三亚1627.3500丽江1981.4900乌鲁木齐3223.7600九寨沟1893.7125北京1317.0900………青岛海口1718.7900深圳1738.4800西安1185.9400重庆1547.2900长沙1804.0100

313 rows × 1 columns

df_all.head()

出发地目的地路线总数0北京厦门3591北京青岛4712北京杭州12283北京丽江11604北京九寨沟168

df3 = df_all.set_index(['出发地','目的地'])
df3

路线总数出发地目的地北京厦门359青岛471杭州1228丽江1160九寨沟168………厦门北京1444成都丽江1160西安876武汉厦门352杭州1234

247 rows × 1 columns

df4 = pd.concat([df2,df3],axis=1,sort=True)
df4.head()

价格路线总数出发地目的地上海三亚1627.3500397丽江1981.49001159乌鲁木齐3223.7600136九寨沟1893.7125168北京1317.09001444

df4.head()

价格路线总数出发地目的地上海三亚1627.3500397丽江1981.49001159乌鲁木齐3223.7600136九寨沟1893.7125168北京1317.09001444

df4.shape
(313, 2)

df4_ = df4.dropna()
df4_.shape
(247, 2)

用merge方法进行合并

new_df.head(3)

出发地目的地价格0上海三亚1627.351上海丽江1981.492上海乌鲁木齐3223.76

df_all.head()

出发地目的地路线总数0北京厦门3591北京青岛4712北京杭州12283北京丽江11604北京九寨沟168

df5 = pd.merge(new_df,df_all)
df5.head()

出发地目的地价格路线总数0上海三亚1627.35003971上海丽江1981.490011592上海乌鲁木齐3223.76001363上海九寨沟1893.71251684上海北京1317.09001444

df5.shape
(247, 4)
df_free.head(2)

出发地目的地价格节省路线名酒店房间去程航司去程方式去程时间回程航司回程方式回程时间0北京厦门1866492北京-厦门3天2晚 | 入住厦门温特雅酒店 + 联合航空/首都航空往返机票厦门温特雅酒店 舒适型 3.9分/5分标准房(大床)(预付) 大床 不含早 1间2晚联合航空 KN5927直飞16:55-19:45首都航空 JD5376直飞22:15-01:151北京厦门2030492北京-厦门3天2晚 | 入住厦门华美达长升大酒店 + 联合航空/首都航空往返机票厦门华美达长升大酒店 4.1分/5分标准房(错峰出游) 大/双床 双早 1间2晚联合航空 KN5927直飞16:55-19:45首都航空 JD5376直飞22:15-01:15

pd.pivot_table(df_free,index=['出发地'],columns=['目的地'],values=['价格']).head(2)

价格目的地三亚三亚湾上海丽江乌鲁木齐九寨沟兰州北京北海南京…西安鄂尔多斯重庆银川长春长沙陵水青岛黄山鼓浪屿出发地上海1627.35NaNNaN1981.493223.761893.7125NaN1317.09NaNNaN…1381.83NaN1641.17NaNNaN1147.92NaN886.321290.33NaN北京2760.40NaNNaN1958.162362.061953.7400NaNNaNNaNNaN…1283.86NaNNaN1025.14NaN1300.92NaN1016.96NaNNaN

2 rows × 52 columns

dff = df_free[df_free['出发地']=='杭州']
dff.head(3)

出发地目的地价格节省路线名酒店房间去程航司去程方式去程时间回程航司回程方式回程时间8760杭州厦门1172326杭州-厦门3天2晚 | 入住厦门温特雅酒店 + 山东航空往返机票厦门温特雅酒店 舒适型 3.9分/5分标准房(大床)(预付) 大床 不含早 1间2晚山东航空 SC1174直飞22:10-23:45山东航空 SC8837直飞06:50-08:208761杭州厦门1336326杭州-厦门3天2晚 | 入住厦门华美达长升大酒店 + 山东航空往返机票厦门华美达长升大酒店 4.1分/5分标准房(错峰出游) 大/双床 双早 1间2晚山东航空 SC1174直飞22:10-23:45山东航空 SC8837直飞06:50-08:208762杭州厦门1445367杭州-厦门3天2晚 | 入住厦门毕思特酒店 + 山东航空往返机票厦门毕思特酒店 高档型 4.4分/5分标准大床房(特惠)[双… 大床 双早 1间2晚山东航空 SC1174直飞22:10-23:45山东航空 SC8837直飞06:50-08:20

dff['出发地'].value_counts()
&#x676D;&#x5DDE;    2053
Name: &#x51FA;&#x53D1;&#x5730;, dtype: int64
dff.shape
(2053, 13)
pd.pivot_table(dff,values=['价格'],
               index=['出发地','目的地'],
               columns=['去程方式'])

价格去程方式直飞经停出发地目的地杭州三亚1839.35NaN丽江NaN2918.320000九寨沟NaN1952.700000北海NaN2910.000000厦门1208.38NaN呼和浩特NaN1557.570000哈尔滨NaN1690.990000大连1710.09NaN天子山1682.03NaN天津1506.03NaN张家界1682.03NaN成都NaN1898.811111桂林1438.98NaN武汉1190.16NaN沈阳1948.03NaN西双版纳NaN1716.310000西宁NaN2851.820000西安1242.89NaN重庆1644.17NaN长沙1160.92NaN青岛1049.23NaN

8. 缺失值、异常值、重复值的处理

data = pd.read_csv('data/1.data清洗/hz_weather.csv')
data.head()

日期最高气温最低气温天气风向风力02017-01-01177晴西北风2级12017-01-02168多云东北风2级22017-01-03158多云东北风1级32017-01-041511小雨西北风2级42017-01-051311小到中雨北风2级

data.shape
(118, 6)
data.isnull().sum()
&#x65E5;&#x671F;      0
&#x6700;&#x9AD8;&#x6C14;&#x6E29;    0
&#x6700;&#x4F4E;&#x6C14;&#x6E29;    0
&#x5929;&#x6C14;      0
&#x98CE;&#x5411;      0
&#x98CE;&#x529B;      0
dtype: int64

data.isnull().mean()
&#x65E5;&#x671F;      0.0
&#x6700;&#x9AD8;&#x6C14;&#x6E29;    0.0
&#x6700;&#x4F4E;&#x6C14;&#x6E29;    0.0
&#x5929;&#x6C14;      0.0
&#x98CE;&#x5411;      0.0
&#x98CE;&#x529B;      0.0
dtype: float64
data2=pd.pivot_table(data,
                     index=['天气'],
                     columns=['风向'],
                     values=['最高气温'])
data2

最高气温风向东北风东南风东风北风南风西北风西南风天气中雨11.000000NaN18.511.000000NaNNaNNaN多云14.11111113.7515.013.33333324.33333321.018.666667小到中雨NaNNaNNaN13.000000NaNNaNNaN小雨13.50000011.0014.59.250000NaN14.013.000000晴10.25000015.8018.018.00000019.00000018.027.500000阴13.50000016.0013.015.60000014.00000011.0NaN阵雨NaN21.0026.015.33333328.00000012.027.500000雨夹雪NaNNaNNaN7.000000NaNNaNNaN


data2.isnull().mean()

      &#x98CE;&#x5411;
&#x6700;&#x9AD8;&#x6C14;&#x6E29;  &#x4E1C;&#x5317;&#x98CE;    0.375
      &#x4E1C;&#x5357;&#x98CE;    0.375
      &#x4E1C;&#x98CE;     0.250
      &#x5317;&#x98CE;     0.000
      &#x5357;&#x98CE;     0.500
      &#x897F;&#x5317;&#x98CE;    0.375
      &#x897F;&#x5357;&#x98CE;    0.500
dtype: float64
data2.isnull().sum()/data2.shape[0]

      &#x98CE;&#x5411;
&#x6700;&#x9AD8;&#x6C14;&#x6E29;  &#x4E1C;&#x5317;&#x98CE;    0.375
      &#x4E1C;&#x5357;&#x98CE;    0.375
      &#x4E1C;&#x98CE;     0.250
      &#x5317;&#x98CE;     0.000
      &#x5357;&#x98CE;     0.500
      &#x897F;&#x5317;&#x98CE;    0.375
      &#x897F;&#x5357;&#x98CE;    0.500
dtype: float64
data2.mean()
      &#x98CE;&#x5411;
&#x6700;&#x9AD8;&#x6C14;&#x6E29;  &#x4E1C;&#x5317;&#x98CE;    12.472222
      &#x4E1C;&#x5357;&#x98CE;    15.510000
      &#x4E1C;&#x98CE;     17.500000
      &#x5317;&#x98CE;     12.814583
      &#x5357;&#x98CE;     21.333333
      &#x897F;&#x5317;&#x98CE;    15.200000
      &#x897F;&#x5357;&#x98CE;    21.666667
dtype: float64
data2.fillna(data2.mean())

最高气温风向东北风东南风东风北风南风西北风西南风天气中雨11.00000015.5118.511.00000021.33333315.221.666667多云14.11111113.7515.013.33333324.33333321.018.666667小到中雨12.47222215.5117.513.00000021.33333315.221.666667小雨13.50000011.0014.59.25000021.33333314.013.000000晴10.25000015.8018.018.00000019.00000018.027.500000阴13.50000016.0013.015.60000014.00000011.021.666667阵雨12.47222221.0026.015.33333328.00000012.027.500000雨夹雪12.47222215.5117.57.00000021.33333315.221.666667

删除空值
可使用函数df.dropna,并不会改变原始数据

data2.dropna()

最高气温风向东北风东南风东风北风南风西北风西南风天气多云14.11111113.7515.013.33333324.33333321.018.666667晴10.25000015.8018.018.00000019.00000018.027.500000

data2.dropna(axis=1,how='all')

最高气温风向东北风东南风东风北风南风西北风西南风天气中雨11.000000NaN18.511.000000NaNNaNNaN多云14.11111113.7515.013.33333324.33333321.018.666667小到中雨NaNNaNNaN13.000000NaNNaNNaN小雨13.50000011.0014.59.250000NaN14.013.000000晴10.25000015.8018.018.00000019.00000018.027.500000阴13.50000016.0013.015.60000014.00000011.0NaN阵雨NaN21.0026.015.33333328.00000012.027.500000雨夹雪NaNNaNNaN7.000000NaNNaNNaN

缺失值的填充


data2.fillna('missing')

最高气温风向东北风东南风东风北风南风西北风西南风天气中雨11missing18.511.000000missingmissingmissing多云14.111113.751513.33333324.33332118.6667小到中雨missingmissingmissing13.000000missingmissingmissing小雨13.51114.59.250000missing1413晴10.2515.81818.000000191827.5阴13.5161315.6000001411missing阵雨missing212615.333333281227.5雨夹雪missingmissingmissing7.000000missingmissingmissing


data2.fillna(method='pad',limit=1)

最高气温风向东北风东南风东风北风南风西北风西南风天气中雨11.000000NaN18.511.000000NaNNaNNaN多云14.11111113.7515.013.33333324.33333321.018.666667小到中雨14.11111113.7515.013.00000024.33333321.018.666667小雨13.50000011.0014.59.250000NaN14.013.000000晴10.25000015.8018.018.00000019.00000018.027.500000阴13.50000016.0013.015.60000014.00000011.027.500000阵雨13.50000021.0026.015.33333328.00000012.027.500000雨夹雪NaN21.0026.07.00000028.00000012.027.500000


data2.bfill()

最高气温风向东北风东南风东风北风南风西北风西南风天气中雨11.00000013.7518.511.00000024.33333321.018.666667多云14.11111113.7515.013.33333324.33333321.018.666667小到中雨13.50000011.0014.513.00000019.00000014.013.000000小雨13.50000011.0014.59.25000019.00000014.013.000000晴10.25000015.8018.018.00000019.00000018.027.500000阴13.50000016.0013.015.60000014.00000011.027.500000阵雨NaN21.0026.015.33333328.00000012.027.500000雨夹雪NaNNaNNaN7.000000NaNNaNNaN


data2.fillna(data2.mean())

最高气温风向东北风东南风东风北风南风西北风西南风天气中雨11.00000015.5118.511.00000021.33333315.221.666667多云14.11111113.7515.013.33333324.33333321.018.666667小到中雨12.47222215.5117.513.00000021.33333315.221.666667小雨13.50000011.0014.59.25000021.33333314.013.000000晴10.25000015.8018.018.00000019.00000018.027.500000阴13.50000016.0013.015.60000014.00000011.021.666667阵雨12.47222221.0026.015.33333328.00000012.027.500000雨夹雪12.47222215.5117.57.00000021.33333315.221.666667

异常值的处理

data = pd.read_csv('data/1.data清洗/hz_weather.csv')
data

日期最高气温最低气温天气风向风力02017-01-01177晴西北风2级12017-01-02168多云东北风2级22017-01-03158多云东北风1级32017-01-041511小雨西北风2级42017-01-051311小到中雨北风2级…………………1132017-04-261913阵雨北风5级1142017-04-272111多云西北风2级1152017-04-282613晴西南风2级1162017-04-292915晴西南风3级1172017-04-302916多云东风2级

118 rows × 6 columns

low = data['最高气温'].mean()-3*data['最高气温'].std()
high = data['最高气温'].mean()+3*data['最高气温'].std()

(data['最高气温'] > high) | (data['最高气温'] < low)
data[(data['最高气温'] > high) | (data['最高气温'] < low) ]

日期最高气温最低气温天气风向风力

def func(data,column):
    high = data[column].mean() + 3 * data[column].std()
    low = data[column].mean() - 3* data[column].std()
    res_df = data[(data[column] > high) | (data[column]< low)]
    return res_df
func(data,'最低气温')

日期最高气温最低气温天气风向风力

data.head()

日期最高气温最低气温天气风向风力02017-01-01177晴西北风2级12017-01-02168多云东北风2级22017-01-03158多云东北风1级32017-01-041511小雨西北风2级42017-01-051311小到中雨北风2级

data.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
 0   Rank                     105 non-null    int64
 1   City                     105 non-null    object
 2   State                    105 non-null    object
 3   Population               105 non-null    object
 4   Date of census/estimate  105 non-null    object
dtypes: int64(1), object(4)
memory usage: 4.2+ KB
</class>
df_pop['Population'][:3]
0    8,615,246
1    3,437,916
2    3,165,235
Name: Population, dtype: object
type(df_pop['Population'][0])
str
int(df_pop['Population'][0].replace(',',''))
8615246

实现方法一:

df_pop.shape[0]
105

re = []
for i in range(df_pop.shape[0]):
    re.append(int(df_pop['Population'][i].replace(',','')))
len(re)
105
df_pop['Pop'] = re
df_pop.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype
 0   Rank                     105 non-null    int64
 1   City                     105 non-null    object
 2   State                    105 non-null    object
 3   Population               105 non-null    object
 4   Date of census/estimate  105 non-null    object
 5   Pop                      105 non-null    int64
dtypes: int64(2), object(4)
memory usage: 5.0+ KB
</class>

实现方法二:

df_pop['Population'].apply(lambda x:int(x.replace(',','')))
0      8615246
1      3437916
2      3165235
3      2872086
4      2273305
        ...

100     309869
101     309105
102     308735
103     308269
104     306888
Name: Population, Length: 105, dtype: int64

df_pop['subtr']=df_pop['Population'].apply(lambda x:int(x.replace(',','')))

df_pop.head()

RankCityStatePopulationDate of census/estimatePopsubtr01London[2]United Kingdom8,615,2461 June 20148615246861524612BerlinGermany3,437,91631 May 20143437916343791623MadridSpain3,165,2351 January 20143165235316523534RomeItaly2,872,08630 September 20142872086287208645ParisFrance2,273,3051 January 201322733052273305

df_pop.columns
Index(['Rank', 'City', 'State', 'Population', 'Date of census/estimate', 'Pop',
       'subtr'],
      dtype='object')
df_pop['State'].values
array([' United Kingdom', ' Germany', ' Spain', ' Italy', ' France',
       ' Romania', ' Austria', ' Germany', ' Hungary', ' Poland',
       ' Spain', ' Germany', ' Italy', ' Bulgaria', ' Czech Republic',
       ' Belgium', ' United Kingdom', ' Germany', ' Italy', ' Sweden',
       ' Italy', ' France', ' Netherlands', ' Croatia', ' Spain',
       ' Poland', ' United Kingdom', ' Poland', ' Germany', ' Latvia',
       ' Spain', ' Italy', ' Spain', ' Greece', ' Poland', ' Netherlands',
       ' Finland', ' Germany', ' United Kingdom', ' Italy', ' Germany',
       ' Germany', ' Germany', ' Spain', ' Denmark', ' United Kingdom',
       ' Portugal', ' Poland', ' Germany', ' Lithuania', ' Germany',
       ' Germany', ' Sweden', ' Ireland', ' United Kingdom', ' Germany',
       ' Netherlands', ' United Kingdom', ' Belgium', ' United Kingdom',
       ' Germany', ' Germany', ' France', ' United Kingdom', ' Poland',
       ' France', ' Spain', ' Estonia', ' United Kingdom',
       'Slovakia Slovak Republic', ' Poland', ' Spain', ' Italy',
       ' Spain', ' Italy', ' Czech Republic', ' Poland', ' Germany',
       ' Spain', ' United Kingdom', ' Poland', ' France', ' Germany',
       ' Bulgaria', ' Bulgaria', ' Spain', ' United Kingdom',
       ' Netherlands', ' Spain', ' Germany', ' United Kingdom',
       ' Denmark', ' Romania', ' United Kingdom', ' Italy', ' Greece',
       ' United Kingdom', ' Romania', ' Italy', ' Spain', ' Germany',
       ' Sweden', ' United Kingdom', ' Poland', ' Lithuania'],
      dtype=object)
df_pop['State'][0].strip()

'United Kingdom'
str_stri=df_pop['State'].apply(lambda x: x.strip())
str_stri[10]
'Spain'
df_pop['State']=str_stri

df_pop['State'].values
array(['United Kingdom', 'Germany', 'Spain', 'Italy', 'France', 'Romania',
       'Austria', 'Germany', 'Hungary', 'Poland', 'Spain', 'Germany',
       'Italy', 'Bulgaria', 'Czech Republic', 'Belgium', 'United Kingdom',
       'Germany', 'Italy', 'Sweden', 'Italy', 'France', 'Netherlands',
       'Croatia', 'Spain', 'Poland', 'United Kingdom', 'Poland',
       'Germany', 'Latvia', 'Spain', 'Italy', 'Spain', 'Greece', 'Poland',
       'Netherlands', 'Finland', 'Germany', 'United Kingdom', 'Italy',
       'Germany', 'Germany', 'Germany', 'Spain', 'Denmark',
       'United Kingdom', 'Portugal', 'Poland', 'Germany', 'Lithuania',
       'Germany', 'Germany', 'Sweden', 'Ireland', 'United Kingdom',
       'Germany', 'Netherlands', 'United Kingdom', 'Belgium',
       'United Kingdom', 'Germany', 'Germany', 'France', 'United Kingdom',
       'Poland', 'France', 'Spain', 'Estonia', 'United Kingdom',
       'Slovakia Slovak Republic', 'Poland', 'Spain', 'Italy', 'Spain',
       'Italy', 'Czech Republic', 'Poland', 'Germany', 'Spain',
       'United Kingdom', 'Poland', 'France', 'Germany', 'Bulgaria',
       'Bulgaria', 'Spain', 'United Kingdom', 'Netherlands', 'Spain',
       'Germany', 'United Kingdom', 'Denmark', 'Romania',
       'United Kingdom', 'Italy', 'Greece', 'United Kingdom', 'Romania',
       'Italy', 'Spain', 'Germany', 'Sweden', 'United Kingdom', 'Poland',
       'Lithuania'], dtype=object)
import numpy as np
import pandas as pd
df1=pd.read_csv('data/1.data清洗/getlinks.csv')
df1

titlelink0网民最喜欢的旅游目的地榜单出炉http://cntour.cn/news/4221/1让生活更幸福是旅游业的使命http://cntour.cn/news/4212/2″一带一路”国家中东欧游客增两倍http://cntour.cn/news/4202/3旅游业改革开启旅游强国新篇章http://cntour.cn/news/4191/

df1.shape
(4, 2)
df1['link']
0    http://cntour.cn/news/4221/
1    http://cntour.cn/news/4212/
2    http://cntour.cn/news/4202/
3    http://cntour.cn/news/4191/
Name: link, dtype: object

pandas使用str.extract提取信息


df1['link'].str.extract('(\d)',expand=False )

0    4
1    4
2    4
3    4
Name: link, dtype: object

df1['link'].str.extract('(\d+)',expand=True )

004221142122420234191


df1['link'].str.extract('(\d.*)',expand=True )

004221/14212/24202/34191/

df1['link']
0    http://cntour.cn/news/4221/
1    http://cntour.cn/news/4212/
2    http://cntour.cn/news/4202/
3    http://cntour.cn/news/4191/
Name: link, dtype: object

df1['link'].str.extract('/(\d.*)/',expand=True )

004221142122420234191

df1['link'].str.extract('//(.+)/',expand=True  )

00cntour.cn/news/42211cntour.cn/news/42122cntour.cn/news/42023cntour.cn/news/4191

df1['link'].str.extract('//(.*)/',expand=True )

00cntour.cn/news/42211cntour.cn/news/42122cntour.cn/news/42023cntour.cn/news/4191

df1['link'].str.extract('//(.+?)/',expand=True )

00cntour.cn1cntour.cn2cntour.cn3cntour.cn

df1['link']
0    http://cntour.cn/news/4221/
1    http://cntour.cn/news/4212/
2    http://cntour.cn/news/4202/
3    http://cntour.cn/news/4191/
Name: link, dtype: object

练习1:匹配所有的cntour
练习2:匹配出所有的news

df1['link'].str.extract('//(.*).cn')

00cntour1cntour2cntour3cntour

df1['link'].str.extract('//(.+)\.',expand=True )

00cntour1cntour2cntour3cntour

df1['link'].str.extract('n/(.+?)/')

00news1news2news3news

【知识点】
*和 + 这两个限定符都是贪婪的,因为他们会尽可能多的匹配文字
当加上?,实现非贪婪匹配,也叫最小匹配

10.实战演练

df=pd.read_csv('data/1.data清洗/qunar_routes.csv')
df.head()

出发地目的地路线信息酒店信息0杭州丽江机酒自由行\n丽江5日自由行,入住丽江添富太和休闲度假酒店+接送机,品古城文化,享至尊服务,…4.4分\n高档型1杭州丽江机酒自由行\n丽江6日自由行,入住丽江添富太和休闲度假酒店+接送机,品古城文化,享至尊服务,…4.4分\n高档型2杭州丽江机酒自由行\n丽江+香格里拉 丽江+香格里拉6日自由行,机票+特色客栈,丽江往返+接机\n特…4.4分\n经济型3杭州丽江机酒自由行\n丽江+大理 丽江、大理双城6天自由行,3晚丽江特色客栈,2晚大理洱海边客栈,赠…4.4分\n经济型4杭州丽江机酒自由行\n丽江5日自由行,入住丽江听心祥和院+接送机\n5天4晚|天天出发|飞机|舒适游…4.7分\n高档型

df.shape
(120, 4)
df.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
 0   &#x51FA;&#x53D1;&#x5730;     120 non-null    object
 1   &#x76EE;&#x7684;&#x5730;     120 non-null    object
 2   &#x8DEF;&#x7EBF;&#x4FE1;&#x606F;    120 non-null    object
 3   &#x9152;&#x5E97;&#x4FE1;&#x606F;    120 non-null    object
 4   &#x9152;&#x5E97;&#x8BC4;&#x5206;    120 non-null    object
 5   &#x9152;&#x5E97;&#x7B49;&#x7EA7;    120 non-null    object
 6   &#x4EF7;&#x683C;      120 non-null    object
 7   &#x5929;&#x6570;      119 non-null    object
dtypes: object(8)
memory usage: 7.6+ KB
</class>

df['酒店评分']=df['酒店评分'].apply(lambda x:float(x))

df['价格']=df['价格'].apply(lambda x: float(x))
df.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype
 0   &#x51FA;&#x53D1;&#x5730;     120 non-null    object
 1   &#x76EE;&#x7684;&#x5730;     120 non-null    object
 2   &#x8DEF;&#x7EBF;&#x4FE1;&#x606F;    120 non-null    object
 3   &#x9152;&#x5E97;&#x4FE1;&#x606F;    120 non-null    object
 4   &#x9152;&#x5E97;&#x8BC4;&#x5206;    120 non-null    float64
 5   &#x9152;&#x5E97;&#x7B49;&#x7EA7;    120 non-null    object
 6   &#x4EF7;&#x683C;      120 non-null    float64
 7   &#x5929;&#x6570;      119 non-null    object
dtypes: float64(2), object(6)
memory usage: 7.6+ KB
</class>

df['天数'].isnull().sum()

df=df.loc[df['天数'].notnull(),:]
df.info()
<class 'pandas.core.frame.dataframe'>
Int64Index: 119 entries, 0 to 119
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype
 0   &#x51FA;&#x53D1;&#x5730;     119 non-null    object
 1   &#x76EE;&#x7684;&#x5730;     119 non-null    object
 2   &#x8DEF;&#x7EBF;&#x4FE1;&#x606F;    119 non-null    object
 3   &#x9152;&#x5E97;&#x4FE1;&#x606F;    119 non-null    object
 4   &#x9152;&#x5E97;&#x8BC4;&#x5206;    119 non-null    float64
 5   &#x9152;&#x5E97;&#x7B49;&#x7EA7;    119 non-null    object
 6   &#x4EF7;&#x683C;      119 non-null    float64
 7   &#x5929;&#x6570;      119 non-null    int64
dtypes: float64(2), int64(1), object(5)
memory usage: 8.4+ KB
</class>

df.index = range(df.shape[0])
df.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype
 0   &#x51FA;&#x53D1;&#x5730;     119 non-null    object
 1   &#x76EE;&#x7684;&#x5730;     119 non-null    object
 2   &#x8DEF;&#x7EBF;&#x4FE1;&#x606F;    119 non-null    object
 3   &#x9152;&#x5E97;&#x4FE1;&#x606F;    119 non-null    object
 4   &#x9152;&#x5E97;&#x8BC4;&#x5206;    119 non-null    float64
 5   &#x9152;&#x5E97;&#x7B49;&#x7EA7;    119 non-null    int64
 6   &#x4EF7;&#x683C;      119 non-null    float64
 7   &#x5929;&#x6570;      119 non-null    int64
dtypes: float64(2), int64(2), object(4)
memory usage: 7.6+ KB
</class>
df.head()

出发地目的地路线信息酒店信息酒店评分酒店等级价格天数0杭州丽江机酒自由行\n丽江5日自由行,入住丽江添富太和休闲度假酒店+接送机,品古城文化,享至尊服务,…4.4分\n高档型4.431888.051杭州丽江机酒自由行\n丽江6日自由行,入住丽江添富太和休闲度假酒店+接送机,品古城文化,享至尊服务,…4.4分\n高档型4.431872.062杭州丽江机酒自由行\n丽江+香格里拉 丽江+香格里拉6日自由行,机票+特色客栈,丽江往返+接机\n特…4.4分\n经济型4.411517.063杭州丽江机酒自由行\n丽江+大理 丽江、大理双城6天自由行,3晚丽江特色客栈,2晚大理洱海边客栈,赠…4.4分\n经济型4.411612.064杭州丽江机酒自由行\n丽江5日自由行,入住丽江听心祥和院+接送机\n5天4晚|天天出发|飞机|舒适游…4.7分\n高档型4.731740.05

数据处理好之后,便可以选取特征带入模型,进行预测

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif']=['Simhei']
plt.rcParams['axes.unicode_minus']=False

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

df["酒店等级"].plot(ax=axes[0], kind='hist', title="酒店等级")
df["酒店评分"].plot(ax=axes[1], kind='hist', title="酒店评分")
df["价格"].plot(ax=axes[2], kind='hist', title="价格");

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5bGQzXzg-1633320025791)(output_283_0.png)]

11. 时序数据处理

import time
import datetime as dt
import pandas as pd
import numpy as np
time.time()

1610678087.3518448
time.localtime()
time.struct_time(tm_year=2021, tm_mon=1, tm_mday=15, tm_hour=10, tm_min=34, tm_sec=50, tm_wday=4, tm_yday=15, tm_isdst=0)

time.strftime

  • %Y Year with century as a decimal number.

  • %m Month as a decimal number [01,12].

  • %d Day of the month as a decimal number [01,31].

  • %H Hour (24-hour clock) as a decimal number [00,23].

  • %M Minute as a decimal number [00,59].

  • %S Second as a decimal number [00,61].

  • %z Time zone offset from UTC.

  • %a Locale’s abbreviated weekday name.

  • %A Locale’s full weekday name.

  • %b Locale’s abbreviated month name.

  • %B Locale’s full month name.

  • %c Locale’s appropriate date and time representation.

  • %I Hour (12-hour clock) as a decimal number [01,12].

  • %p Locale’s equivalent of either AM or PM.

time.strftime('%Y-%m-%d',time.localtime())
'2021-01-15'
time.strftime('%Y-%m-%d %H:%M:%S',time.localtime())
'2021-01-15 10:35:30'

time.mktime

  • year (including century, e.g. 1998)
  • month (1-12)
  • day (1-31)
  • hours (0-23)
  • minutes (0-59)
  • seconds (0-59)
  • weekday (0-6, Monday is 0)
  • Julian day (day in the year, 1-366)
  • DST (Daylight Savings Time) flag (-1, 0 or 1)
time.localtime()
time.struct_time(tm_year=2021, tm_mon=1, tm_mday=15, tm_hour=10, tm_min=37, tm_sec=19, tm_wday=4, tm_yday=15, tm_isdst=0)
time.mktime(time.localtime())
1610678241.0
time.localtime(1552922267.0)
time.struct_time(tm_year=2019, tm_mon=3, tm_mday=18, tm_hour=23, tm_min=17, tm_sec=47, tm_wday=0, tm_yday=77, tm_isdst=0)
time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(1552922267.0))
'2019-03-18 23:17:47'
dt.datetime

datetime.datetime
dt.datetime(2019,3,20)
datetime.datetime(2019, 3, 20, 0, 0)
pd.date_range

<function pandas.core.indexes.datetimes.date_range(start="None," end="None," periods="None," freq="None," tz="None," normalize="False," name="None," closed="None," **kwargs) -> pandas.core.indexes.datetimes.DatetimeIndex>
</function>
pd.date_range(dt.datetime(2019,3,20),periods=4)
DatetimeIndex(['2019-03-20', '2019-03-21', '2019-03-22', '2019-03-23'], dtype='datetime64[ns]', freq='D')
pd.date_range('2019-3-20',periods=4)

DatetimeIndex(['2019-03-20', '2019-03-21', '2019-03-22', '2019-03-23'], dtype='datetime64[ns]', freq='D')
pd.date_range('2019-3-20','2019-4-20')

DatetimeIndex(['2019-03-20', '2019-03-21', '2019-03-22', '2019-03-23',
               '2019-03-24', '2019-03-25', '2019-03-26', '2019-03-27',
               '2019-03-28', '2019-03-29', '2019-03-30', '2019-03-31',
               '2019-04-01', '2019-04-02', '2019-04-03', '2019-04-04',
               '2019-04-05', '2019-04-06', '2019-04-07', '2019-04-08',
               '2019-04-09', '2019-04-10', '2019-04-11', '2019-04-12',
               '2019-04-13', '2019-04-14', '2019-04-15', '2019-04-16',
               '2019-04-17', '2019-04-18', '2019-04-19', '2019-04-20'],
              dtype='datetime64[ns]', freq='D')
pd.date_range('2019-3-20 11:00','2019-3-20 23:00',freq='H')

DatetimeIndex(['2019-03-20 11:00:00', '2019-03-20 12:00:00',
               '2019-03-20 13:00:00', '2019-03-20 14:00:00',
               '2019-03-20 15:00:00', '2019-03-20 16:00:00',
               '2019-03-20 17:00:00', '2019-03-20 18:00:00',
               '2019-03-20 19:00:00', '2019-03-20 20:00:00',
               '2019-03-20 21:00:00', '2019-03-20 22:00:00',
               '2019-03-20 23:00:00'],
              dtype='datetime64[ns]', freq='H')
ser=pd.Series(np.arange(10),index=pd.date_range('2019-3-20',periods=10))

ser
2019-03-20    0
2019-03-21    1
2019-03-22    2
2019-03-23    3
2019-03-24    4
2019-03-25    5
2019-03-26    6
2019-03-27    7
2019-03-28    8
2019-03-29    9
Freq: D, dtype: int32
ser['2019-03-25']
print(ser.index[2])

2019-03-22 00:00:00
ser.index[2].year
2019
ser.index[2].month
ser.index[2].day
22
ser.index[2].week
12
data=pd.read_csv('data/1.data清洗/hz_weather.csv')
data.head()

日期最高气温最低气温天气风向风力02017-01-01177晴西北风2级12017-01-02168多云东北风2级22017-01-03158多云东北风1级32017-01-041511小雨西北风2级42017-01-051311小到中雨北风2级

df=data[['日期','最高气温','最低气温']]
df.head()

日期最高气温最低气温02017-01-0117712017-01-0216822017-01-0315832017-01-04151142017-01-051311

df.info()
<class 'pandas.core.frame.dataframe'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
 0   &#x65E5;&#x671F;      118 non-null    datetime64[ns]
 1   &#x6700;&#x9AD8;&#x6C14;&#x6E29;    118 non-null    int64
 2   &#x6700;&#x4F4E;&#x6C14;&#x6E29;    118 non-null    int64
dtypes: datetime64[ns](1), int64(2)
memory usage: 2.9 KB
</class>

df.set_index('日期',inplace=True)
df.index
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
               '2017-01-09', '2017-01-10',
               ...

               '2017-04-21', '2017-04-22', '2017-04-23', '2017-04-24',
               '2017-04-25', '2017-04-26', '2017-04-27', '2017-04-28',
               '2017-04-29', '2017-04-30'],
              dtype='datetime64[ns]', name='&#x65E5;&#x671F;', length=118, freq=None)
df.index<'2017-3-1'
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

df_jan=df[(df.index>='2017-1-1')&(df.index<'2017-2-1')]
df_jan.info()

`

DatetimeIndex: 31 entries, 2017-01-01 to 2017-01-31
Data columns (total 2 columns):
# Column Non-Null Count Dtype

Original: https://blog.csdn.net/qq_41081716/article/details/120602698
Author: 古杜且偲
Title: 数据清洗(二):python数据清洗

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/754760/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球