二、python中Pandas数据框操作及数据提取

2023年7月15日上午1:01 • 人工智能 • 阅读 42

二、Pandas数据框操作及数据提取


import pandas as pd
import numpy as np

数据框行列操作
1.1 创建DataFrame

data = {"col1":['Python', 'C', 'Java', 'R', 'SQL', 'PHP', 'Python', 'Java', 'C', 'Python'],
       "col2":[6, 2, 6, 4, 2, 5, 8, 10, 3, 4],
       "col3":[4, 2, 6, 2, 1, 2, 2, 3, 3, 6]}
df = pd.DataFrame(data)
df

1.2 设置索引

df['new_index'] = range(1,11)
df.set_index('new_index')

1.3 重置索引(行号)

df.reset_index(drop=True,inplace = True)
df

1.4 更改列名


df.columns = ['grammer', 'score', 'cycle', 'id']

df.rename(columns={'col1':'grammer', 'col2':'score', 'col3':'cycle','new_index':'id'}, inplace=True)
df.head()

1.5 调整列顺序
(1) 将所有列倒序排列


df.iloc[:, ::-1]

df.iloc[:, [-1,-2,-3,-4]]

(2) 交换两列位置

temp = df['grammer']
df.drop(labels=['grammer'], axis=1, inplace=True)
df.insert(1, 'grammer', temp)
df

(3) 更改全部列顺序

order = df.columns[[0, 3, 1, 2]]
df = df[order]
df

1.6 删除行列
(1) 删除id这一列


del df['id']

df['id'] = range(1,11)
df.drop('id',axis=1, inplace=True)

(2) 添加一行grammer=’css’数据，并删除该行

df.drop(labels=[df[df['grammer']=='css'].index[0]],axis=0,inplace=True)
df

1.7 将grammer列和score列合并成新的一列

df['new_col'] = df['grammer'] + df['score'].map(str)
df

1.8 将数据按行的方式逆序输出

df.iloc[::-1, :]

数据读取与保存
2.1 读取excel文件

excel = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')
excel.head()

2.2 读取csv文件

csv = pd.read_csv('/home/mw/input/pandas_exercise/pandas_exercise/exercise_data/drinks.csv')
csv.head()

2.3 读取tsv文件

tsv = pd.read_csv('/home/mw/input/pandas_exercise/pandas_exercise/exercise_data/chipotle.tsv', sep = '\t')
tsv.head()

2.4 dataframe保存为csv文件

df.to_csv('course.csv')

2.5 读取时设置显示行列的参数：pd.set_option()


pd.set_option('display.max_columns', None)
pd.set_option('display.max_columns', 2)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_rows', 10)

pd.set_option('display.float_format',lambda x: '%.2f'%x)

pd.set_option('display.width', 100)

pd.set_option('precision', 1)

pd.set_option('expand_frame_repr', False)

提取指定行列的数据


df = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')
df.head()

3.1 提取第32行数据


df.loc[32]

df.iloc[32,:]

3.2 提取education这一列数据

df['education']

3.3 提取后两列(education, salary)数据


df[['education', 'salary']]

df.iloc[:, 1:]

3.4 提取第一列位置在1,10,15上的值


df.iloc[[1,10,15], 0]

df['createTime'][[1,10,15]]

df['createTime'].take([1,10,15])

提取重复值所在的行列数据

4.1 判断createTime列数据是否重复

df.createTime.duplicated()

4.2 判断数据框中所有行是否存在重复

df.duplicated()

4.3 判断education列和salary列数据是否重复(多列组合查询)

df.duplicated(subset = ['education','salary'])

4.4 判断重复索引所在行列数据

df.index.duplicated()

按指定条件提取元素值
这里为了运行后续代码，通过random函数随机添加一列数据；

import random
df['value'] = [random.randint(1,100) for i in range(len(df))]
df.head()

5.1 提取value列元素值大于90的行

df[df['value'] > 90]

5.3 提取某列最大值所在的行

df[df['value'] == df['value'].max()]

5.4 提取value和value1之和大于150的最后三行

df[(df['value'] + df['value1']) > 150].tail(3)

提取含空值的行列
为了演示代码，这里设置一些值;

df.loc[[2,10,45,87], 'value'] = np.nan
df.loc[[19,30,55,97,114], 'value1'] = np.nan
df.loc[[24,52,67,120], 'education'] = 111
df.loc[[8,26,84], 'salary'] = '--'

6.1 提取value列含有空值的行

df[df['value'].isnull()]

6.2 提取每列缺失值的具体行数

for columname in df.columns:
    if df[columname].count()  != len(df):

        loc = df[columname][df[columname].isnull().values == True].index.tolist()
        print('列名："{}",第{}行位置有缺失值'.format(columname, loc))

7. 提取某列不是数值或(包含)字符串的行
7.1 提取education列数值类型不是字符串的行

temp = pd.DataFrame()
for i in range(len(df)):
    if type(df['education'][i]) != str:
        temp = temp.append(df.loc[i])
temp

7.3 提取education列值为’硕士’的行`


df[df['education'] == '硕士']

results = df['education'].str.contains('硕士')
results.fillna(value=False, inplace=True)
df[results]

其他提取操作
8.1 提取学历为本科和硕士的数据，只显示学历和薪资两列


df[df['education'].isin(['本科', '硕士'])] [['education', 'salary']]

df.loc[df['education'].isin(['本科', '硕士']), ['education', 'salary']]

8.2 提取salary列以’25k’开头的行


df[df['salary'].str.match('25k')]

df[df['salary'].str.startswith('25k')]

8.3 提取value列中不在value1列出现的数字

df['value'][~df['value'].isin(df['value1'])]

8.4 提取value列和value1列出现频率最高的数字


temp = df['value'].append(df['value1'])
temp.value_counts(ascending=False)
temp.value_counts(ascending=False).index[:5]

8.5 提取value列中可以整除10的数字位置


df[df['value'] % 10 == 0].index

np.argwhere(np.array(df['value'] % 10 == 0))

作业练习：


df = pd.read_excel('/home/mw/input/pandas1206855/pandas120.xlsx')
df.head()

df1 = df[(df['salary']=='25k-35k') & (df['education']=='本科')]

df2 = df[df['salary'].str.endswith('40k')]

result = []
import re
for x in df['salary']:
    result.append((int(re.split('[k-]', x)[0])+int(re.split('[k-]', x)[2]))/2)
df['result'] = result
df3 = df[df['result']>30][['createTime','education','salary']]
len(df3)

answer_2 = pd.concat([df1, df2, df3], axis=0)

data = pd.concat([answer_2.iloc[:,0],answer_2.iloc[:,1],answer_2.iloc[:,2]])
df = pd.DataFrame(data, columns=['answer'])
df['id'] = range(len(df))
df = df[['id', 'answer']]

df.to_csv('answer_2.csv', index=False, encoding='utf-8-sig')

学习资料：
https://www.heywhale.com/home/activity

Original: https://blog.csdn.net/Hexiaolian123/article/details/122583350
Author: 酸菜鱼摆摆
Title: 二、python中Pandas数据框操作及数据提取

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/693237/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

YOLOv4+tensorflow2.0训练自己的数据

人工智能 2023年5月26日
0071
基于Python的人脸互换系统设计与实现

全套资源下载地址：https://download.csdn.net/download/sheziqiong/86770095全套资源下载地址：https://download.c…

人工智能 2023年7月19日
0074
pytorch矩阵乘法总结

文章目录 * – 点乘 torch.mul(a,b) – 二维矩阵乘 torch.mm(a,b) – 三维矩阵乘 torch.bmm(a,b) …

人工智能 2023年7月21日
0054
ML之PDP：基于FIFA 2018 Statistics(2018年俄罗斯世界杯足球赛)球队比赛之星分类预测数据集利用DT决策树&RF随机森林+PDP部分依赖图可视化实现模型可解释性之详细攻略

ML之PDP：基于FIFA 2018 Statistics(2018年俄罗斯世界杯足球赛)球队比赛之星分类预测数据集利用DT决策树&RF随机森林+PDP部分依赖图可视化实现…

人工智能 2023年7月1日
0074
基于BP神经网络的车牌识别系统的设计

一、基本原理概述基于BP神经网络的的汽车牌照识别系统的处理过程分为预处理、边缘提取、车牌定位、字符分割、字符识别五大模块。具体涉及以下几个过程： ① 原始车牌图像：由数码相机或其…

人工智能 2023年6月18日
0058
（保姆式教程：从下数据到画图）python如何利用EOF分析SSTA海温异常现象并画图

最近，在学习如何利用python中的EOF 对太平洋附近的1979-2004年出现的海温异常进行分析。 EOF分析是气象分析中常见的一种分析方法，也被称为经验正交函数。经过EOF分…

人工智能 2023年7月4日
0065
Pandas数据分析(上)｜一文读懂Series和DataFrame

前言本文主要介绍Pandas中两个重要的数据结构：Series 和 DataFrame。二者在pandas数据分析与处理中是使用最多的数据结构。因此,学习Pandas这两个重要…

人工智能 2023年7月16日
0081
jq事件绑定：on事件、off事件、方法函数、one事件

事件绑定：事件绑定有形式有：on 、one、方法函数、off解绑事件； on和one的用法是一样的，但区别是on是每次点击都会触发，one是只触发一次； on有三种用法：普通绑定…

人工智能 2023年6月26日
0072
Windows 10安装YOLOv5训练及验证狗狗照片

Install YOLOv5 $git clone https://github.com/ultralytics/yolov5 $cd yolov5 $pip install -r…

人工智能 2023年7月12日
0059
数据分类：支持向量机

一、作业要求编写SVM算法程序（可从网络查找相应代码），平台自选。使用SVM 算法，分别用三种核函数对给定样本数据集建立分类模型。其中数据文件中维度”类型&#822…

人工智能 2023年7月2日
0097
修改YOLOv5 detect.py代码使其能逐个视频检测保存，同时对每个视频内参数进行单独操作

真没怎么看懂YOLOv5的detect.py代码的逻辑，看了YOLOv3，和YOLOv4的detect逻辑，基本都是用opencv对每个视频进行操作，感觉还清晰易懂一点，YOLOv…

人工智能 2023年6月19日
0079
NeRF 源码分析解读（三）

NeRF 源码分析解读（三）光线的生成上一章节我们对 NeRF 模型的初始化代码进行了分析，即 create_nerf() 部分，本章节我们继续对 NeRF 代码进行分析注释。…

人工智能 2023年7月21日
0071
HRNet人体关键点检测

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019 oral) 文章…

人工智能 2023年6月25日
0078
clip预训练模型综述

什么是CLIP Title : Learning transferable visual models from natural language supervision 2021…

人工智能 2023年7月26日
0064
MPC（模型预测控制）控制小车沿轨迹移动——C++实现

任务说明要求如下图所示，给定一条轨迹，要求控制小车沿这条轨迹移动，同时可以适用于系统带有延时的情况。注意，本篇文章只给出部分C++代码参考。主要流程首先用运动学自行车模型（K…

人工智能 2023年6月15日
0080
颜色、形状和纹理：使用 OpenCV 进行特征提取

如何从图像中提取特征？第一次听说”特征提取”一词是在 YouTube 上的机器学习视频教程中，它清楚地解释了我们如何在大型数据集中提取特征。很简单，数据集…

人工智能 2023年6月19日
0055

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

二、python中Pandas数据框操作及数据提取

大家都在看