pandas获取数据集数据类型分布（更细粒度的分割）

2023年7月16日上午9:40 • 人工智能 • 阅读 88

方法一：使用pandas内置接口

在pandas中，获取数据类型有几个方法，以泰坦尼克号数据集为例，

1.拿到numerical数据类型

df.select_dtypes('number').columns

Index(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')

拿到categorical数据类型

df.select_dtypes('object').columns

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

还有

df.select_dtypes('category').columns

Index([], dtype='object')

方法二：pandas_profiling输出分析

以泰坦尼克号数据集为例，如果只是康康然后疯狂复制粘贴的话那没啥，但是如果想全流程自动化，就要把ProfileReport的结果用到接下来的数据处理中，不妨把结果输出到json文件。

from pandas_profiling import ProfileReport
import pandas as pd

df = pd.read_csv('train.csv',index_col=['PassengerId'])
report = ProfileReport(df,dark_mode = True,explorative=True)

report.to_file('result.json')

在这个json文件中，有如下的结构，取自己需要的统计即可，但是有亿点麻烦(doge)

方法三：自己动手写函数（相当于又把数据集分析了一遍doge）

def cols_spliting(df:pd.DataFrame, cardinality = 10, high_missing_per= 0.7, drop_high_missing = True):
    assert len(df.index) != 0

    binary_categorical_cols = []
    thin_categorical_cols = []
    uniform_categorical_cols = []
    categorical_cols = []
    numerical_cols = []
    other_cols = []

    high_missing_cols = []
    small_missing_cols = []
    missing_cols = []

    count = df.shape[0]

    for col in df.columns:
        unique = df[col].nunique()
        dtype = df[col].dtype

        missing_count = df[col].isnull().sum()
        per = missing_count/count

        # type determine
        if unique  2 and unique  cardinality and dtype == 'object':
            uniform_categorical_cols.append(col)
            categorical_cols.append(col)

        elif dtype in ['int64', 'float64']:
            numerical_cols.append(col)

        else:
            other_cols.append(col)

        # missing determine
        if  per > 0 and per  high_missing_per:
            high_missing_cols.append(col)
            missing_cols.append(col)

    print('--------------col types---------------')
    print('categorical cols with 1-2 distinct values: ' + str(binary_categorical_cols))
    print('categorical cols with 3-{} distinct values: '.format(cardinality) + str(thin_categorical_cols))
    print('categorical cols with more than {} distinct values: '.format(cardinality) + str(uniform_categorical_cols))
    print('categorical cols: ' + str(categorical_cols))
    print('numerical cols : ' + str(numerical_cols))
    print('-------------missing cols-------------')
    # print('categorical cols with 2-3 distinct valuses: ' + str(other_cols))
    print('missing cols with more than {} :'.format(high_missing_per) + str(high_missing_cols))
    print('missing cols with less than {} :'.format(high_missing_per) + str(small_missing_cols))
    print('missing cols: ' + str(missing_cols))

以泰坦尼克号为例：

cols_spliting(df)

结果为

missing cols with more than 0.7 : ['Cabin']
missing cols with less than 0.7 : ['Age', 'Embarked']
missing cols: ['Age', 'Cabin', 'Embarked']

如果我们对输出的数据还要要求，让他更适用于机器学习，不妨康康这个

pandas数据集类型划分II

Original: https://blog.csdn.net/RuGe_Lee/article/details/123301924
Author: 21岁害怕编程
Title: pandas获取数据集数据类型分布（更细粒度的分割）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/696147/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Linux 搭建NFS

前言 CentOS系统中默认已经安装了NFS服务，外加NFS服务的配置步骤也很简单，因此刘遄老师在授课时会戏称为Need For Speed极品飞车。接下来，准备配置NFS服务。首…

人工智能 2023年6月26日
0076
3D视觉——3.人体姿态估计(Pose Estimation) 算法对比即效果展示——MediaPipe与OpenPose

上一话 1.MediaPipe import cv2 import time import numpy as np from tqdm import tqdm import med…

人工智能 2023年5月26日
0090
知识图谱构建流程步骤详解

知识图谱构建流程概览 1.知识抽取 1.1 知识抽取的主要任务（1）实体识别与抽取任务：识别出待处理文本中七类（人名、机构名、地名、时间、日期、货币和百分比）命名实体。两个子…

人工智能 2023年6月10日
00226
python 知识图谱红楼_毕业设计：基于知识图谱的《红楼梦》人物关系可视化（运行篇）…

知识图谱是个非常有趣的方向，在公司业务应用方面也非常广泛。比如对搜索推荐的优化，让推荐的内容更加丰富，甚至给用户以意外之喜。比如在智能问答方面，避免多轮对话，从用户的一个简单的问句…

人工智能 2023年6月1日
00120
安装nvm，并使用nvm安装nodejs及配置环境变量

一、安装nvm 1.下载nvm 解压后点击exe文件进行安装： 2、点击下一步安装到 D:\NVM 下 3、先在D:\NVM 下创建nodejs文件夹，然后将路径设置如下： 4、点…

人工智能 2023年6月30日
0071
Colab使用教程

最近在学习NLP，但是学习 深度学习算法，需要有 GPU，也就是显卡。而显卡，需要是 NV…

人工智能 2023年5月27日
00112
数据分析 | Pandas 200道练习题，每日10道题，学完必成大神（1）

❤️ 作者简介：大家好我是小鱼干儿♛是一个热爱编程、热爱算法的大三学生，蓝桥杯国赛二等奖获得者 🐟 个人主页：https://blog.csdn.net/qq_52007481 …

人工智能 2023年7月4日
00197
机器视觉及视觉传感器

人们通过感官从自然界获取各种信息，其中以人的视觉获取的信息量最多，约占信息总量的80%。随着信息技术的发展，为计算机、机器人或其他智能机器赋予人类视觉功能，成为科学家们的奋斗目标。…

人工智能 2023年6月22日
0086
极智AI | Tengine 模型转换及量化

欢迎关注我的公&#…

人工智能 2023年5月26日
00119
教你用Python语音合成，以及文字转语音~

导语今天就给大家带来个语言识别跟语言赚文字的小工具感兴趣的铁汁萌可以往下滑了👇👇 1.直接使用在1.2官网注册后拿到APISecret和APIKey，直接复制文章2.4demo…

人工智能 2023年5月23日
0082
深度学习入门（九）——深度学习框架概览

深度学习框架概览 Caffe Theano TensorFlow Torch Pytorch MXNet cuda-convnet2 Neon Deeplearning4j CNT…

人工智能 2023年6月26日
00107
【推荐算法学习与复现】– 逻辑回归算法族 — LR

协同过滤仅仅使用有限的用户行为信息，逻辑回归算法模型大多引入用户行为、用户特征、物品特征和上下文特征等，从CF逐步过渡到综合不同特征的机器学习模型。（1）逻辑回归模型将用户特征…

人工智能 2023年6月17日
00105
建立私人知识网站并用cpolar内网穿透发布 1-2

系列文章建立私人知识网站并用cpolar内网穿透发布 1-2 建立私人知识网站并用cpolar内网穿透发布 2-2 维基百科（wiki）相信大家都有所耳闻，作为全球最有名的百…

人工智能 2023年6月26日
0075
C. Bricks and Bags Codeforces Round #831 (Div. 1 + Div. 2)

在经历了几天的卡题和没思路+看题解没看懂中终于把这一道题给磕了出来，感觉这题做不出的原因的没有想好极值的处理关系和太看重特殊情况而忽略了一般情况。。。。。传送门题目有A和B两…

人工智能 2023年6月29日
0085
【图像分类案例】(9) MobileNetV3 癌症图像二分类，附Pytorch完整代码

大家好，今天和各位分享一下如何使用 Pytorch构建 MobileNetV3卷积神经网络，并基于权重迁移学习方法解决图像二分类问题，并且评价数据集的召回率、精准率、F1等。 …

人工智能 2023年7月1日
00114
数据清洗的主要办法

1.什么是数据清洗？在获得数据后并不能直接进行数据分析处理，为什么？因为得到的数据不一定完全准确，直接使用这些数据进行分析的话可能会产生不小的偏差。所以，我们需要数据清洗这个步骤…

人工智能 2023年7月15日
00122

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

pandas获取数据集数据类型分布（更细粒度的分割）

大家都在看