python数据分析—— pandas

2023年7月7日下午10:33 • 人工智能 • 阅读 75

python数据分析—— pandas

1.模块导入
2. Series对象的创建和索引
3 DataFrame的创建及相关属性
4 DataFrame修改索引、添加数据及删除数据
*
4.1 DataFrame修改index columns
4.2 添加数据
4.3 删除数据
5 数据处理
6 数据合并
7 多层索引
8 时间序列
9 分组聚合
10 读取、写入 Excel
11 获取数据库中的数据

1.模块导入

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

Series对象的创建和索引

se = Series([value for value in range(5,10)],index = [value for value in range(1,6)])
se.values
se.index
se.items
list(se.iteritems())

dict = {'a':1,'b':3,"c":555}
se = Series(dict)

se['b']
se[1]

se['a':'c']
se[0:2]

se.index = list('bnm')

se.drop(['b'])

"""
    两个Series之间的计算是根据索引进行的
    也支持np的函数
"""
series1 = Series(range(1,4),index = list('abc'))
series2 = Series(range(100,103),index = list('cab'))
series1 + series2
series1**2
np.sqrt(series1)

3 DataFrame的创建及相关属性


df1 = DataFrame(np.random.randint(1,16,(3,5)),index = [value for value in range(1,4)],
                columns = [value for value in range(1,6)])
df2 = DataFrame({'烷烃':[value for value in np.random.randint(1,100,5)],
                 '烯烃':[value for value in np.random.randint(1,100,5)],
                 '芳烃':[value for value in np.random.randint(1,100,5)],
                  })

df3 = DataFrame({'烷烃':Series([value for value in np.random.randint(1,100,4)],index = list("abce")),
                 '烯烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                 '芳烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                })

df3.to_dict()

df = DataFrame({'烷烃':Series([value for value in np.random.randint(1,100,4)],index = list("abce")),
                 '烯烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                 '芳烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                })
df.shape
df.index.tolist()
df.columns.tolist()
df.dtypes
df.ndim
df.values
df.info()
df.head(2)
df.tail(3)

df["烷烃"]
df[["烷烃","烯烃"]]

df["a":"a"]
df[0:1]

df["a":"c"]

df.loc[["a","c"],"烷烃":"芳烃"]
df.iloc[0:2,2]

df.loc["a","烯烃"]

df.sort_values(by = "烯烃", ascending = False)

4 DataFrame修改索引、添加数据及删除数据

4.1 DataFrame修改index columns

df = DataFrame({'烷烃':Series([value for value in np.random.randint(1,100,4)],index = list("abce")),
                 '烯烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                 '芳烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                })

df.index = list("jsais")

def map(x):
    return x+'wj'
df.rename(index = map,columns = map,inplace = False)
df.rename(index = {"j":"wj"},columns = {"烷烃":"不饱和烃"},inplace = True)

df.set_index("不饱和烃",drop = "True")
df.index.name = None

4.2 添加数据

df = DataFrame({'烷烃':Series([value for value in np.random.randint(1,100,4)],index = list("abce")),
                 '烯烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                 '芳烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                })

df["密度"] = [value for value in np.random.randint(700,800,5)]

df.insert(0,"溴值",[value for value in np.random.randint(20,80,5)])

df.iloc[1] = [value for value in np.random.randint(20,80,5)]

df.append(DataFrame({"溴值":5,"烷烃":10,"烯烃":66,"芳烃":9888,"密度":66},index = ["j"]))

df1 = DataFrame({"语文":Series([i for i in np.random.randint(1,100,20)],index = range(1,21)),
                 "数学":Series([i for i in np.random.randint(1,100,20)],index = range(1,21)),
})

df2 = DataFrame({"英语":Series([i for i in np.random.randint(1,100,20)],index = range(22,42)),
                 "化学":Series([i for i in np.random.randint(1,100,20)],index = range(22,42)),
})

pd.concat([df1,df2],axis = 0)
pd.concat([df1,df2],axis = 1)

4.3 删除数据

df = DataFrame({'烷烃':Series([value for value in np.random.randint(1,100,4)],index = list("abce")),
                 '烯烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                 '芳烃':Series([value for value in np.random.randint(1,100,5)],index = list("abcde")),
                })
df.drop(["烷烃"],axis = 1, inplace = False)

df.drop(["a"],axis = 0, inplace = False)

5 数据处理

import pandas as pd
from pandas import Series,DataFrame
import numpy as np
from numpy import nan as NaN

se1 = Series([value for value in np.random.randint(1,100,10)],index = np.arange(1,11))
se2 = Series([value for value in np.random.randint(555,10000,10)],index = np.arange(1,11))

df1 = DataFrame({
                "数学":Series([80+value*10 for value in np.random.randn(10)],index = np.arange(1,11)),
                "语文":Series([75+value*8 for value in np.random.randn(10)],index = [ i for i in np.arange(1,16) if (i not in [5,8,9,14,3] )]),
                "英语":Series([75+value*8 for value in np.random.randn(10)],index = [ i for i in np.arange(1,16) if (i not in [2,8,4,5,14] )]),
                "理综":Series([75+value*8 for value in np.random.randn(10)],index = [ i for i in np.arange(1,16) if (i not in [9,7,3,6,14] )])
})

df1.append(DataFrame({"数学":NaN,"语文":NaN,"理综":NaN,"英语":NaN,},index = [14]),sort=False)
http://localhost:8888/notebooks/python/data%20miniing/wj_py/ipynb/pandas.ipynb

df1.dropna()
df1.dropna(how = "all")
df1.dropna(how = "all",axis = 1)
df1.dropna(thresh = 2)

df1.isnull()
df1.notnull()
df1[df1.notnull()]

df1.fillna(0,inplace = False)
df1.fillna({"数学":60,"英语":70})
df1.loc[:,"数学"].fillna(50)

df1.fillna(method = "ffill",axis = 0)
df1.fillna(method = "bfill",axis = 1)
df1.fillna(method = "ffill",axis = 0,limit = 1)

df1 = df1.append(df1.loc[2,:])
df1.duplicated()
df1.drop_duplicates()
df1.drop_duplicates(["语文"])
df1.drop_duplicates(["语文","英语"],keep = "last")
df1.drop_duplicates(["语文"],inplace = False)

6 数据合并

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

df1 = DataFrame((60+10*np.random.randn(50)).reshape(10,5),index = np.arange(1,11),
                columns = ["math","English","chemisry","physics","biology"])
df1.index.name = "编号"
df1.columns.name = "成绩"

df2 = DataFrame((70+8*np.random.randn(18)).reshape(6,3),index = np.arange(8,14),
                columns = ["P.E.","history","politics"])

df1.join(df2,how = "left")
df1.join(df2,how = "right")
df1.join(df2,how = "outer")

df1 = DataFrame((60+10*np.random.randn(50)).reshape(10,5),index = np.arange(1,11),
                columns = ["math","English","chemisry","physics","biology"])
df1.insert(0,"name",["A","B","C","D","E","G","H","I","J","K"])
df1.index.name = "编号"
df1.columns.name = "成绩"

df2 = DataFrame((70+8*np.random.randn(18)).reshape(6,3),index = np.arange(1,7),
                columns = ["P.E.","history","politics"])
df2.insert(0,"name",["A","B","C","M","N","O"])

pd.merge(df1,df2,how = "inner")
pd.merge(df1,df2,on = "name", how = "inner")
pd.merge(df1,df2,on = "name", how = "left")
pd.merge(df1,df2,on = "name", how = "right")
pd.merge(df1,df2,on = "name", how = "outer")

7 多层索引

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

se1 = Series([value for value in np.arange(1,7)],
             index = [["math","math","English","English","History","History"],
                      ["midterm","end of a term","midterm","end of a term","midterm","end of a term"]])

df1 = DataFrame(np.arange(1,19).reshape(6,3),
               index = [["math","math","English","English","History","History"],
                      ["midterm","end of a term","midterm","end of a term","midterm","end of a term"]],
               columns = list("abc"))

class1 = ["math","History","English"]
class2 = ["midterm","end of a term"]
m_index = pd.MultiIndex.from_product([class1,class2])
df2 = DataFrame(np.arange(1,19).reshape(6,3),
               index = m_index,
               columns = list("abc"))

se1["math"]
se1["math","midterm"]

df1.loc["math"]
df1.loc[["math","midterm"]]

8 时间序列

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

"""
时间序列频率：D        日历日的每一天
              B        工作日的每一天
              H        每小时
              T/min    每分钟
              S        每秒
              L/ms     毫秒
              U       微秒
              M        日历日的月底日期
              BM       工作日的月底日期
              MS        日历日的月初日期
              BMS       工作日的月初日期
"""

date = pd.date_range(start = "20190101",end = "20200203")
date = pd.date_range(start = "2019-02-03",end = "2019-03-05",periods = 10)
date = pd.date_range(start = "2020-03-01 08:00:00",end = "2020-04-05 00:00:00",freq  = "10min")
data = pd.date_range(start = "2020-01-08 12:00:00",end = "2020-04-05 00:00:00",
                     periods = 20,closed = None)

date = pd.date_range(start = "2020-01-01 08:00:00",end = "2028-12-31 08:00:00",periods = 1000)
df = DataFrame(np.arange(5000).reshape((1000,5)),index = date)
df.truncate(before="2020-08")
df.truncate(after="2020-08")

df["2020"]
df["2020-05"]
df["2020-05-01"]
df["2020-08-08":"2021-01-01"]

df.between_time("09:30","18:00")

date = pd.date_range(start = "2020-01-01 08:00:00",end = "2020-01-10 08:00:00",periods = 10)
se = Series(np.arange(10),index = date)

se.shift(periods = 2,freq = "d")

se.tshift(10)

pd.to_datetime(1554970740000,unit = "ms")

pd.to_datetime(1554970740000,unit = "ms").tz_localize('UTC').tz_convert('Asia/Shanghai')

pd.to_datetime("2020年10月23日",format = "%Y年%m月%d日")

data = pd.read_excel("./output.xlsx")
data = data.set_index("time",drop = True)
data.index.name = None
data.index = pd.to_datetime(data.index)

result = data[(data.index.weekday_name == "Monday") & (data.index.time == pd.to_datetime("08:00:00").time())]
dir(data.index)

9 分组聚合

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

df = DataFrame({
                "name":Series(np.random.choice(["Jim","Tom","Cyrus","Bob","Smith"],100),index = np.arange(1,101)),
                "gender":Series(np.random.choice(["male","female"],100),index  = np.arange(1,101)),
                "salary":Series(np.random.randint(5000,18000,100),index = np.arange(1,101)),
                "Year":Series(np.random.choice([2015,2016,2017,2018,2019,2020],100),index = np.arange(1,101))
})

group_name = df.groupby("name")

group_name.groups
group_name.count()

group_name.get_group("Bob")
df["salary"].groupby(df["name"])

group_mul = df.groupby(["name","Year"])

group_mul.get_group(("Bob",2016))

salary_group = pd.cut(df["salary"],bins = [5000,10000,15000,18000])
sg = df.groupby(salary_group)
sg.count()
pd.crosstab(salary_group,df["Year"])

"""
    聚合函数：
    mean
    count
    sum
    median
    std
    var
    min
    max
    prod #非nan的积
    first
    last
    mad
    mode
    abs

"""

df.groupby("name").sum()
df["salary"].groupby(df["name"]).mean()
df.groupby("name")["salary"].var()

df.groupby("name").agg(["min","max","sum","std"])

def vary(df):
    return abs(df.min()-df.max())
df.groupby("name").agg(vary)
df.groupby("Year").agg(["sum","mean",vary])
df.groupby("Year").agg(["sum","mean",("极值",vary)])
str = {
    "salary":["sum","mean"],
    "Year":vary
}
df.groupby("name").agg(str)

"""
apply 函数是pandas中自由度最高的函数
"""

def year(Year):
    if Year == 2016:
        return Year*10
    else:
        return Year
df["Year"].apply(year)

def wj(df,name,n):
    return df.sort_values(by = name)[-n:]
df.groupby("name").apply(wj,name = "salary",n = 2)

10 读取、写入 Excel

import pandas as pd

data = pd.read_excel("./new_data.xlsx",header = 1)
"""
  指定第一行为列索引  默认为第0行，header可设置为None，此时列索引变为原来的数字索引
"""
data = pd.read_excel("./new_data.xlsx",header = None,sheet_name = "Sheet2")
"""
多个表时指定表名
"""
data = pd.read_excel("./new_data.xlsx",header = None,sheet_name = "Sheet3",skiprows = 9,usecols = "F:H")
"""
拿指定数据
skiprows:跳过哪几行
usecols:使用哪几列 多写一列
"""

data = data.set_index("采样日期")
data.to_excel("./wj.xlsx")

11 获取数据库中的数据

import pymysql
import pandas

conn = pymysql.connect(host = "localhost",user = "root",passwd = "*******",db = "demo",port = 3309,charset = "utf8")
"""
host:本机或远程数据库
port:端口号 一般3309
"""
query = "SELECT id,name FROM num_table"

df = pd.read_sql_query(query,conn)

by CyrusMay 2022 04 05

Original: https://blog.csdn.net/Cyrus_May/article/details/123969305
Author: CyrusMay
Title: python数据分析—— pandas

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/677221/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

matlab:KNN实现手写数字识别

KNN算法是一种基于概率的距离算法，简单的来说就是靠近算法。首先对给出的训练集和测试集进行特征提取，计算测试集与训练集的特征欧式距离，也就是所谓的二范数。计算测试集与训练集之间的…

人工智能 2023年7月1日
00119
【声源定位】基于matlab单声源双麦克风房间冲激响应【含Matlab源码 547期】

⛄一、获取代码方式获取代码方式1：完整代码已上传我的资源：【声源定位】基于matlab单声源双麦克风房间冲激响应【含Matlab源码 547期】点击上面蓝色字体，直接付费下载，即…

人工智能 2023年5月25日
0086
机器学习必会面试知识点

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年7月31日
0048
empty怎么发音_empty怎么读

用中文说，但它不是翻译机。我需要有人来接吗？ [En] Say it in Chinese, but it’s not a translator. Do I need …

人工智能 2023年5月27日
00106
解析DeepLabv3+的网络结构及代码【Pytorch版】

论文地址：https://arxiv.org/pdf/1802.02611.pdf论文笔记：https://blog.csdn.net/oYeZhou/article/detail…

人工智能 2023年7月23日
0053
可视化软件之Power BI的安装与下载

一、Power BI简介 1、概念 Power BI是微软官方推出的可视化数据探索和交互式报告工具。 PowerBI是一款强大的可视化软件，它可以轻松的连接到数百个数据源，简化数…

人工智能 2023年7月14日
00141
Python 毕设精品实战案例——快速索引目录Part2

温馨提示：文末有 CSDN 平台官方提供的学长 QQ 名片 :) 基于性别网上学习特征分析及可视化系统本项目通过调查问卷获取大学生的网上在线学习的特征数据，利用python分析不…

人工智能 2023年6月24日
0088
XTDorne平台搭建无人机仿真环境时遇到的问题及解决

执行make px4_sitl_default gazebo遇到的问题 问题1：No package ‘gstre…

人工智能 2023年6月2日
0054
音乐音频 | 语音识别与音乐流派分类

文章目录语音识别步骤一、用SVM做音乐分类应用实例 * 1、数据集：EchoNest。 2、代码： – ①数据归一化。沿着最大方差轴旋转数据，确定数据的每个特征对…

人工智能 2023年5月25日
0098
L3/L4级自动驾驶拐点已至，谁在抢跑线控底盘千亿级风口？

高阶自动驾驶拐点已至。日前，工信部发布《关于开展智能网联汽车准入和上路通行试点工作的通知（征求意见稿），将对L3/L4级自动驾驶车辆进行准入管理并展开试点工作，这将再次助推L3/…

人工智能 2023年6月27日
0096
年末巨制：知识图谱嵌入方法研究总结

泽宇对自己的研究方向也就是知识图谱嵌入（KG Embedding）技术进行了总结，并制作了一套slides，力求让大家只看slides就可以理解并掌握知识图谱嵌入的相关知识和最新研…

人工智能 2023年6月1日
00121
通俗易懂方差（Variance）和偏差（Bias）

看了沐神的讲解，恍然大悟，b站可以不刷，但沐神一定要看。在统计模型中，通过方差和偏差来衡量一个模型。 1 方差和偏差的概念偏差（Bias）：预测值和真实值之间的误差方差（Var…

人工智能 2023年6月15日
0083
Python-混合高斯分布(GMM)的应用及K-S检验

内容简介：对几种经典的单分布模型与混合高斯分布模型进行简单介绍，随后用于交通数据分析，并进行K-S检验，讨论模型拟合情况。其中，单分布模型有：正态分布、Gamma分布、Weibu…

人工智能 2023年7月15日
0074
基于泰尔森回归的股票预测研究

基于泰尔森回归的股票预测研究绪论 * 背景目的流程主要内容 * 数据获取与数据存储数据调取以及案例数据分析模型比较分析 – 2.3.1 模型初始化 2.3….

人工智能 2023年6月18日
0090
数据分析—pandas的高级操作

一、替换操作(replace) 替换操作可以同步作用与Series和DataFrame中 df.replace(to_replace , value , replace , &#8…

人工智能 2023年6月11日
0074
【毕业设计/课程设计】基于机器学习的情感分类与分析算法设计与实现

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年6月30日
00144

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

python数据分析—— pandas

python数据分析—— pandas

4.1 DataFrame修改index columns

4.2 添加数据

4.3 删除数据

大家都在看