python—数据分析(二)

2023年7月8日上午3:47 • 人工智能 • 阅读 61

Series和DataFrame中数据的基本功能：

reindex方法是创建一个新对象，其数据对Series和DataFrame的新索引，它们的主要区别在DataFrame可以对index或columns使用reindex方法。
Series的reindex用法

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
frame = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

frame1 = frame.reindex(['a', 'b', 'c', 'd', 'f'])

frame2 = pd.Series(['blue', 'puple', 'yellow'], index=[0,2,4])

frame2.reindex(range(6), method='ffill')
frame.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)

再来看看DataFrame的reindex方法

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
   ....:                      index=['a', 'c', 'd'],
   ....:                      columns=['Ohio', 'Texas', 'California'])

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)
frame.reindex(columns=states).ffill()

frame.loc[['a', 'b', 'c', 'd'], states]

丢弃指定轴上的项

丢弃某条轴上的一个或多个项只要有一个索引数组或列表即可。采用drop方法返回在指定行轴或列轴删除指定值的新对象。

在Series中方法drop只能对一维组进行对 *行索引的删除

frame = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

frame_drop = frame.drop('c')

frame.drop('c', inplace=True)

在DataFrame中方法drop可以删除任意轴上的索引值
可以通过传递 axis=数字或 *axis=’columns’ 来删除列的值

 data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   .....:                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
   .....:                     columns=['one', 'two', 'three', 'four'])
 data

 data.drop(['Utah', 'one'])
 data.drop('four', axis=1)
 data.drop(['two', 'four'], axis='columns')

 data.drop('Ohio', axis='index')
 data.drop(['Utah', 'New York'])

np.arange()和 range()的用法及区别：
1.a = np.arange()

1.一个参数是默认起点0，步长为1
a = np.arange(5) #输出：[0 1 2 3 4 5]
2.两个参数是默认步长为1
a = np.arange(3,6) #输出[3 4 5]
3.三个参数则是起点为0，终点为3，步长为0.5
a = np.arange(0, 3, 0.5) # 输出[ 0. 0.5 1. 1.5 2. 2.5]
2. range(start, stop[, step])

start为开始，stop为结束和step为步长
3.arange()是Numpy中的函数，range()则是python自带函数，它们主要区别是arange支持 步长为小数而arange不支持、range返回的是 object而arange返回的是 ndarray类型，都可以用于迭代

索引、选取、过滤操作
对于Series的单个索引和切片处理

frame = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

frame['b']
frame[3]
frame[2:4]
frame[frame < 3]
frame['a':'c']
frame['a', 'c'] = 5

DataFrame进行索引就是获取对象的一个列或多个列

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                   index=['age', 'year', 'Utah', 'New York'],
                   columns=['one', 'two', 'three', 'four'])
data['one']
data[['one', 'three', 'four']]
data[:2]
data[2:]
data[data['four'] > 5]
data < 5
data[data < 5] = 22

loc和iloc的使用
loc[行索引, 列索引]的用法


data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                   index=['age', 'year', 'Utah', 'New York'],
                   columns=['one', 'two', 'three', 'four'])
data.loc['age', ['two', 'three']]
data.loc[:'Utah','two']

iloc[行序, 列序]

data.iloc[2, [3, 0, 1]]
data.iloc[3]
data.iloc[:, :3][data.three>5]

整数索引

pandas处理整数和python内置函数索引不同。
若pandas中index与columns整数值一样时，索引frame[x]会产生歧义而发生错误
从而引入非整数索引来消除歧义
索引数据最好是要使用loch和iloc

算数运算和数据对齐

andas可以对不同的索引的对象进行算术运算。

若frame1+frame2中存在不同索引对，则返回的结果是两个对象的 索引并集。
数据对齐时产生对于数据 不重叠的索引都是返回缺失值NaN。
对象相加时若没有相同的行或列标签，则都返回NaN值(DataFrame对齐操作可以同时对 行和列操作)
在算术方法中填充值

在对象进行算术运算时出现索引不重叠返回的都是NAN值，若不想要以NaN值输出而引入fill_value参数(frame.方法(frame1, fill_value=xx))

排序sort和排名rank

sort是对行或列索引进行排序(升序、降序或自定义顺序)
rank是对行或列索引进行排名(从高到低、从低到高或自定义顺序排名)一般产生新的名次序列。

1. &#x6392;&#x5E8F;

sort_index()：按索引排序

sort_index(axis=1,ascending=False)
axis&#xFF1A;&#x884C;&#x7D22;&#x5F15;&#x6216;&#x5217;&#x7D22;&#x5F15;&#xFF0C;&#x9ED8;&#x8BA4;&#x4E3A;&#x5217;&#x7D22;&#x5F15;&#x6392;&#x5E8F;&#xFF0C;axis = 1&#x6216;axis='index' &#x8868;&#x793A;&#x5BF9;&#x884C;&#x8FDB;&#x884C;&#x6392;&#x5E8F;;
ascending&#xFF1A;&#x6392;&#x5E8F;&#x65B9;&#x5F0F;&#xFF0C;&#x9ED8;&#x8BA4;&#x5347;&#x5E8F;&#xFF0C;ascending=False &#x8868;&#x964D;&#x5E8F;,ascending=True&#x8868;&#x793A;&#x5347;&#x5E8F;&#x3002;

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
 frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
   .....:                      index=['three', 'one'],
   .....:                      columns=['d', 'a', 'b', 'c'])
frame.sort_index()
frame.sort_index(axis='columns')

frame.sort_index(axis=1, ascending=True)

sort_values()：按数值排序

    sort_values(ascending=False, by=&#x2018;b&#x2019;)
    by&#xFF1A;&#x6392;&#x5E8F;&#x7684;&#x4F9D;&#x636E;&#xFF0C;by=&#x2018;b&#x2019; &#x8868;&#x793A;&#x6839;&#x636E;b&#x5217;&#x7684;&#x6570;&#x636E;&#x8FDB;&#x884C;&#x6392;&#x5E8F;&#xFF1B;
    ascending&#xFF1A;&#x6392;&#x5E8F;&#x65B9;&#x5F0F;(True&#x4E3A;&#x5347;&#x5E8F;&#x3001;False&#x4E3A;&#x964D;&#x5E8F;)&#x3002;

obj = pd.Series([4, 7, -3, 2])
obj.sort_values()
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame.sort_values(by='b')
frame.sort_values(by=['a', 'b'])

2. &#x6392;&#x540D;rank()&#xFF1A;&#x5BF9;&#x6570;&#x636E;&#x8FDB;&#x884C;&#x6392;&#x540D;

df.rank(ascending=False, method=’max’)
ascending：排名方式(True为升序、False为降序);
method：排名方式，包括：”average”、”min”、”max”、”first”、”dense”。(默认情况下，rank是通过”为各组分配一个平均排名”的方式破坏平级关系)

方法说明average默认在相等分组中为各个值分配平均排名min使用整个分组最小排名max使用整个分组最大排名fist按照原始数据中出现的顺序排名

obj = pd.Series([7,-5,7,4,2,0,4])
obj.rank()

obj.rank(method='first')
obj.rank(ascending=False, method='max')
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame.rank(axis='columns')

重复标签的轴索引

对于带有重复值的索引，数据选取的行为将会有所不同。如果索引对应多个值则输出一个Series(DataFrame)，索引单个值则输出一个标量。

汇总和计算描述统计


df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                 [np.nan, np.nan], [0.75, -1.3]],
                 index=['a', 'b', 'c', 'd'],
                 columns=['one', 'two'])

方法作用df.sum()也可以传入axis=1(行)或axis=’columns'(列)返回一个含有列和的Seriesdf.mean(axis=’columns’, skipna=False)NA值会自动被排除，除非整个切片（这里指的是行或列）都是NA (通过skipna选项可以禁用该功能)df.idxmax()间接统计(比如达到最大或最小值的索引)df.cumsum()累加df.describe()用于一次性产生多个汇总统计

相关系数和协方差

方法作用corr()用于计算两个Series中重叠的、非NA的、按索引对齐的值的相关系数(矩阵)cov()用于计算协方差(矩阵)corrwith()计算其列或行跟另一个Series或DataFrame之间的相关系数

唯一值、值相关计算和成员资格

1.unique()得到Series中的唯一值数组

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

2.value_counts用于计算一个Series中各值出现的频率

obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

3.value_counts可用于任何数组或序列

pd.value_counts(obj.values, sort=False)

a    3
b    2
c    3
d    1
dtype: int64

4.isin用于判断矢量化集合的成员资格，可用于过滤Series或DataFrame列中数据的子集

mask = obj.isin(['b', 'c'])

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
obj[mask]
0    c
5    b
6    b
7    c
8    c
dtype: object

5.Index.get_indexer方法是给你一个索引数组，从可能包含重复值的数组到另一个不同值的数组

to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

unique_vals = pd.Series(['c', 'b', 'a'])

 pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2])

Original: https://blog.csdn.net/ex_6450/article/details/125818173
Author: 小白只对大佬的文章感兴趣
Title: python—数据分析(二)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/677691/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

matlab训练神经网络模型并导入simulink详细步骤

之前的神经网络相关文章： Matlab-RBF神经网络拟合数据 Matlab RBF神经网络及其实例 4.深度学习(1) –神经网络编程入门本文介绍一下怎么把训练好的…

人工智能 2023年7月12日
0089
python–敲击木鱼积累功德小项目（更新版（2））

前言：前几天上课闲着没事写了一个python敲击木鱼积累功德的小项目，当时纯粹就是写着玩，回顾一下鼠标事件的东西还记不记得，发现这个博客的点赞和收藏量还挺高的，我当时也没有把它当回…

人工智能 2023年7月30日
0054
使用LightGCN实现推荐系统（手撕代码+分析）

目录 1. 简介 2. 导包 3. 加载并处理数据 3.1 数据集 3.2 映射数据集的索引 3.3 得到用户电影交互图 3.4 划分数据集 3.5 随机抽取正样本和负样本 4. …

人工智能 2023年6月13日
0087
使用pandas遍历csv表格数据的效率问题（df.loc/iloc与df.at/iat的异同）

在处理数据量较大的表格（25万行）时，需要遍历表格中的每个值，前期使用df.loc进行遍历，不仅非常耗时，而且运行十几个小时程序经常崩溃。。。解决方法在网上搜索解决方案，使用d…

人工智能 2023年7月8日
0098
meituan_waimai_meishi.csv 美团外卖平台字段数据分析

[root@gree139 ~]# start-all.sh [root@gree139 ~]# zkServer.sh start [root@gree139 ~]# zkSer…

人工智能 2023年6月11日
0076
如何在Framework中进行模型的迁移和转换

如何在Framework中进行模型的迁移和转换在机器学习领域，迁移学习（Transfer Learning）是指在一个任务上训练好的模型，通过在另一个相关任务上进行微调（Fine…

人工智能 2024年1月1日
0048
Windows下使用Anaconda3安装tensorflow2.6.1-CPU版本

Windows下使用Anaconda安装tensorflow-CPU 一、安装Anaconda 二、安装TensorFlow * 1、使用Anaconda创建新环境 2、安装Ten…

人工智能 2023年5月25日
0070
backtrader数据基础

cerebro = bt.Cerebro() cerebro.addstrategy(TestStrategy2) codes=[‘600862.SH’,’300326.SZ’,’…

人工智能 2023年7月8日
0077
事件抽取相关调研-厂商篇

事件抽取相关调研 1. 事件抽取-厂商 * 1.1 华为云 1.2 百度 1.3 科大讯飞 1.4 深擎科技 1.5 幂律智能 1.6 云孚语义 2.后续本篇不具体谈技术，主要是…

人工智能 2023年6月10日
00125
秀,Pandas 一行代码爬取半个月天气预报~

谈及Pandas的read.xxx系列的函数，大家的第一反应会想到比较常用的pd.read_csv()和pd.read_excel() 但是大多数人估计没用过pd.read_htm…

人工智能 2023年7月18日
0061
pytorch、torch下载与安装

如果不知道自己的电脑的cuda版本号是多少，可以在cmd输入以下命令查看： nvcc -V 用Aaconda Prompt或者cmd命令cd到文件下载的目录进行安装（两种安装方式…

人工智能 2023年7月21日
0071
小爱音箱mini系统故障怎么办_小爱音箱MINI使用1年多了，我来说说使用感受

前言家里的蓝牙音箱入了有几个了，今天来说说小爱音箱MINI，这个音箱用了也有1年多了，因为需要插电，所有主要是在家里用，闲暇的时候可以听听书，听听相声。小爱mini音箱可玩性…

人工智能 2023年5月27日
00170
【PyTorch】从头搭建并训练一个神经网络模型（图像分类、CNN）

目录 0. 前言 1. 使用torchvision加载数据集并做预处理 2. 定义（搭建）自己的神经网络 3. 定义损失函数（Loss Function）和优化器（Optimize…

人工智能 2023年7月21日
0067
基于知识图谱(androdi版本)的学习类软件教育应用APP设计

资源下载地址：https://download.csdn.net/download/sheziqiong/85620430 1 项目概述本项目是一套基于知识图谱(Knowledg…

人工智能 2023年6月1日
0091
pandas.read_csv() 处理 CSV 文件的 6 个有用参数

pandas.read_csv 有很多有用的参数，你都知道吗？本文将介绍一些 pandas.read_csv()有用的参数，这些参数在我们日常处理CSV文件的时候是非常有用的。 p…

人工智能 2023年7月16日
0059
《kaldi语音识别实战》：带特征变换的三音素模型训练——train_lda_mllt.sh，train_sat.sh解析

人工智能 2023年5月23日
0081

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

python—数据分析(二)

Series和DataFrame中数据的基本功能：

汇总和计算描述统计

大家都在看