Pandas数据类型之category的用法zz

创建category

使用Series创建

在创建Series的同时添加dtype=”category”就可以创建好category了。category分为两部分,一部分是order,一部分是字面量:

1

2

3

4

5

6

7

8

9

10

In [ 1 ]: s = pd.Series([ "a" , "b" , "c" , "a" ], dtype = "category" )

In [ 2 ]: s

Out[ 2 ]:

0 a

1 b

2 c

3 a

dtype: category

Categories ( 3 , object ): [ 'a' , 'b' , 'c' ]

可以将DF中的Series转换为category:

1

2

3

4

5

6

7

8

9

10

11

12

In [ 3 ]: df = pd.DataFrame({ "A" : [ "a" , "b" , "c" , "a" ]})

In [ 4 ]: df[ "B" ] = df[ "A" ].astype( "category" )

In [ 5 ]: df[ "B" ]

Out[ 32 ]:

0 a

1 b

2 c

3 a

Name: B, dtype: category

Categories ( 3 , object ): [a, b, c]

可以创建好一个pandas.Categorical ,将其作为参数传递给Series:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

In [ 10 ]: raw_cat = pd.Categorical(
....:     [ "a" , "b" , "c" , "a" ], categories = [ "b" , "c" , "d" ], ordered = False
....: )
....:

In [ 11 ]: s = pd.Series(raw_cat)

In [ 12 ]: s

Out[ 12 ]:

0 NaN

1 b

2 c

3 NaN

dtype: category

Categories ( 3 , object ): [ 'b' , 'c' , 'd' ]

使用DF创建

创建DataFrame的时候,也可以传入 dtype=”category”:

1

2

3

4

5

6

7

In [ 17 ]: df = pd.DataFrame({ "A" : list ( "abca" ), "B" : list ( "bccd" )}, dtype = "category" )

In [ 18 ]: df.dtypes

Out[ 18 ]:

A    category

B    category

dtype: object

DF中的A和B都是一个category:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

In [ 19 ]: df[ "A" ]

Out[ 19 ]:

0 a

1 b

2 c

3 a

Name: A, dtype: category

Categories ( 3 , object ): [ 'a' , 'b' , 'c' ]

In [ 20 ]: df[ "B" ]

Out[ 20 ]:

0 b

1 c

2 c

3 d

Name: B, dtype: category

Categories ( 3 , object ): [ 'b' , 'c' , 'd' ]

或者使用df.astype(“category”)将DF中所有的Series转换为category:

1

2

3

4

5

6

7

8

9

In [ 21 ]: df = pd.DataFrame({ "A" : list ( "abca" ), "B" : list ( "bccd" )})

In [ 22 ]: df_cat = df.astype( "category" )

In [ 23 ]: df_cat.dtypes

Out[ 23 ]:

A    category

B    category

dtype: object

创建控制

默认情况下传入dtype=’category’ 创建出来的category使用的是默认值:

1.Categories是从数据中推断出来的。

2.Categories是没有大小顺序的。

可以显示创建CategoricalDtype来修改上面的两个默认值:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

In [ 26 ]: from pandas.api.types import CategoricalDtype

In [ 27 ]: s = pd.Series([ "a" , "b" , "c" , "a" ])

In [ 28 ]: cat_type = CategoricalDtype(categories = [ "b" , "c" , "d" ], ordered = True )

In [ 29 ]: s_cat = s.astype(cat_type)

In [ 30 ]: s_cat

Out[ 30 ]:

0 NaN

1 b

2 c

3 NaN

dtype: category

Categories ( 3 , object ): [ 'b' < 'c' < 'd' ]

同样的CategoricalDtype还可以用在DF中:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

In [ 31 ]: from pandas.api.types import CategoricalDtype

In [ 32 ]: df = pd.DataFrame({ "A" : list ( "abca" ), "B" : list ( "bccd" )})

In [ 33 ]: cat_type = CategoricalDtype(categories = list ( "abcd" ), ordered = True )

In [ 34 ]: df_cat = df.astype(cat_type)

In [ 35 ]: df_cat[ "A" ]

Out[ 35 ]:

0 a

1 b

2 c

3 a

Name: A, dtype: category

Categories ( 4 , object ): [ 'a' < 'b' < 'c' < 'd' ]

In [ 36 ]: df_cat[ "B" ]

Out[ 36 ]:

0 b

1 c

2 c

3 d

Name: B, dtype: category

Categories ( 4 , object ): [ 'a' < 'b' < 'c' < 'd' ]

转换为原始类型

使用 Series.astype(original_dtype) 或者 np.asarray(categorical)可以将Category转换为原始类型:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

In [ 39 ]: s = pd.Series([ "a" , "b" , "c" , "a" ])

In [ 40 ]: s

Out[ 40 ]:

0 a

1 b

2 c

3 a

dtype: object

In [ 41 ]: s2 = s.astype( "category" )

In [ 42 ]: s2

Out[ 42 ]:

0 a

1 b

2 c

3 a

dtype: category

Categories ( 3 , object ): [ 'a' , 'b' , 'c' ]

In [ 43 ]: s2.astype( str )

Out[ 43 ]:

0 a

1 b

2 c

3 a

dtype: object

In [ 44 ]: np.asarray(s2)

Out[ 44 ]: array([ 'a' , 'b' , 'c' , 'a' ], dtype = object )

categories的操作

获取category的属性

Categorical数据有 categoriesordered 两个属性。可以通过 s.cat.categoriess.cat.ordered来获取:

1

2

3

4

5

6

7

In [ 57 ]: s = pd.Series([ "a" , "b" , "c" , "a" ], dtype = "category" )

In [ 58 ]: s.cat.categories

Out[ 58 ]: Index([ 'a' , 'b' , 'c' ], dtype = 'object' )

In [ 59 ]: s.cat.ordered

Out[ 59 ]: False

重排category的顺序:

1

2

3

4

5

6

7

In [ 60 ]: s = pd.Series(pd.Categorical([ "a" , "b" , "c" , "a" ], categories = [ "c" , "b" , "a" ]))

In [ 61 ]: s.cat.categories

Out[ 61 ]: Index([ 'c' , 'b' , 'a' ], dtype = 'object' )

In [ 62 ]: s.cat.ordered

Out[ 62 ]: False

重命名categories

通过给s.cat.categories赋值可以重命名categories:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

In [ 67 ]: s = pd.Series([ "a" , "b" , "c" , "a" ], dtype = "category" )

In [ 68 ]: s

Out[ 68 ]:

0 a

1 b

2 c

3 a

dtype: category

Categories ( 3 , object ): [ 'a' , 'b' , 'c' ]

In [ 69 ]: s.cat.categories = [ "Group %s" % g for g in s.cat.categories]

In [ 70 ]: s

Out[ 70 ]:

0 Group a

1 Group b

2 Group c

3 Group a

dtype: category

Categories ( 3 , object ): [ 'Group a' , 'Group b' , 'Group c' ]

使用rename_categories可以达到同样的效果:

1

2

3

4

5

6

7

8

9

10

In [ 71 ]: s = s.cat.rename_categories([ 1 , 2 , 3 ])

In [ 72 ]: s

Out[ 72 ]:

0 1

1 2

2 3

3 1

dtype: category

Categories ( 3 , int64): [ 1 , 2 , 3 ]

或者使用字典对象:

Pandas数据类型之category的用法zz

1

2

3

4

5

6

7

8

9

10

11

# You can also pass a dict-like object to map the renaming

In [ 73 ]: s = s.cat.rename_categories({ 1 : "x" , 2 : "y" , 3 : "z" })

In [ 74 ]: s

Out[ 74 ]:

0 x

1 y

2 z

3 x

dtype: category

Categories ( 3 , object ): [ 'x' , 'y' , 'z' ]

使用add_categories添加category

可以使用add_categories来添加category:

1

2

3

4

5

6

7

8

9

10

11

12

13

In [ 77 ]: s = s.cat.add_categories([ 4 ])

In [ 78 ]: s.cat.categories

Out[ 78 ]: Index([ 'x' , 'y' , 'z' , 4 ], dtype = 'object' )

In [ 79 ]: s

Out[ 79 ]:

0 x

1 y

2 z

3 x

dtype: category

Categories ( 4 , object ): [ 'x' , 'y' , 'z' , 4 ]

使用remove_categories删除category

1

2

3

4

5

6

7

8

9

10

In [ 80 ]: s = s.cat.remove_categories([ 4 ])

In [ 81 ]: s

Out[ 81 ]:

0 x

1 y

2 z

3 x

dtype: category

Categories ( 3 , object ): [ 'x' , 'y' , 'z' ]

删除未使用的cagtegory

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

In [ 82 ]: s = pd.Series(pd.Categorical([ "a" , "b" , "a" ], categories = [ "a" , "b" , "c" , "d" ]))

In [ 83 ]: s

Out[ 83 ]:

0 a

1 b

2 a

dtype: category

Categories ( 4 , object ): [ 'a' , 'b' , 'c' , 'd' ]

In [ 84 ]: s.cat.remove_unused_categories()

Out[ 84 ]:

0 a

1 b

2 a

dtype: category

Categories ( 2 , object ): [ 'a' , 'b' ]

重置cagtegory

使用 set_categories()可以同时进行添加和删除category操作:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

In [ 85 ]: s = pd.Series([ "one" , "two" , "four" , "-" ], dtype = "category" )

In [ 86 ]: s

Out[ 86 ]:

0 one

1 two

2 four

3 -

dtype: category

Categories ( 4 , object ): [ '-' , 'four' , 'one' , 'two' ]

In [ 87 ]: s = s.cat.set_categories([ "one" , "two" , "three" , "four" ])

In [ 88 ]: s

Out[ 88 ]:

0 one

1 two

2 four

3 NaN

dtype: category

Categories ( 4 , object ): [ 'one' , 'two' , 'three' , 'four' ]

category排序

如果category创建的时候带有 ordered=True , 那么可以对其进行排序操作:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

In [ 91 ]: s = pd.Series([ "a" , "b" , "c" , "a" ]).astype(CategoricalDtype(ordered = True ))

In [ 92 ]: s.sort_values(inplace = True )

In [ 93 ]: s

Out[ 93 ]:

0 a

3 a

1 b

2 c

dtype: category

Categories ( 3 , object ): [ 'a' < 'b' < 'c' ]

In [ 94 ]: s. min (), s. max ()

Out[ 94 ]: ( 'a' , 'c' )

可以使用 as_ordered() 或者 as_unordered() 来强制排序或者不排序:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

In [ 95 ]: s.cat.as_ordered()

Out[ 95 ]:

0 a

3 a

1 b

2 c

dtype: category

Categories ( 3 , object ): [ 'a' < 'b' < 'c' ]

In [ 96 ]: s.cat.as_unordered()

Out[ 96 ]:

0 a

3 a

1 b

2 c

dtype: category

Categories ( 3 , object ): [ 'a' , 'b' , 'c' ]

重排序

使用Categorical.reorder_categories() 可以对现有的category进行重排序:

1

2

3

4

5

6

7

8

9

10

11

12

In [ 103 ]: s = pd.Series([ 1 , 2 , 3 , 1 ], dtype = "category" )

In [ 104 ]: s = s.cat.reorder_categories([ 2 , 3 , 1 ], ordered = True )

In [ 105 ]: s

Out[ 105 ]:

0 1

1 2

2 3

3 1

dtype: category

Categories ( 3 , int64): [ 2 < 3 < 1 ]

多列排序

sort_values 支持多列进行排序:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

In [ 109 ]: dfs = pd.DataFrame(
.....:&#xA0;&#xA0;&#xA0;&#xA0; {
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; "A" : pd.Categorical(
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; list ( "bbeebbaa" ),
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; categories = [ "e" , "a" , "b" ],
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; ordered = True ,
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; ),
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; "B" : [ 1 , 2 , 1 , 2 , 2 , 1 , 2 , 1 ],
.....:&#xA0;&#xA0;&#xA0;&#xA0; }
.....: )
.....:

In [ 110 ]: dfs.sort_values(by = [ "A" , "B" ])

Out[ 110 ]:
A&#xA0; B

2 e&#xA0; 1

3 e&#xA0; 2

7 a&#xA0; 1

6 a&#xA0; 2

0 b&#xA0; 1

5 b&#xA0; 1

1 b&#xA0; 2

4 b&#xA0; 2

比较操作

如果创建的时候设置了ordered==True ,那么category之间就可以进行比较操作。支持 &#xA0;==, !=, >, >=, <, 和 <=< code>&#x8FD9;&#x4E9B;&#x64CD;&#x4F5C;&#x7B26;&#x3002;<!--=<-->

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

In [ 113 ]: cat = pd.Series([ 1 , 2 , 3 ]).astype(CategoricalDtype([ 3 , 2 , 1 ], ordered = True ))

In [ 114 ]: cat_base = pd.Series([ 2 , 2 , 2 ]).astype(CategoricalDtype([ 3 , 2 , 1 ], ordered = True ))

In [ 115 ]: cat_base2 = pd.Series([ 2 , 2 , 2 ]).astype(CategoricalDtype(ordered = True ))

In [ 119 ]: cat > cat_base

Out[ 119 ]:

0 True

1 False

2 False

dtype: bool

In [ 120 ]: cat > 2

Out[ 120 ]:

0 True

1 False

2 False

dtype: bool

其他操作

Cagetory本质上来说还是一个Series,所以Series的操作category基本上都可以使用,比如: Series.min(), Series.max() 和 Series.mode()。

value_counts:

1

2

3

4

5

6

7

8

9

In [ 131 ]: s = pd.Series(pd.Categorical([ "a" , "b" , "c" , "c" ], categories = [ "c" , "a" , "b" , "d" ]))

In [ 132 ]: s.value_counts()

Out[ 132 ]:

c&#xA0;&#xA0;&#xA0; 2

a&#xA0;&#xA0;&#xA0; 1

b&#xA0;&#xA0;&#xA0; 1

d&#xA0;&#xA0;&#xA0; 0

dtype: int64

DataFrame.sum():

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

In [ 133 ]: columns = pd.Categorical(
.....:&#xA0;&#xA0;&#xA0;&#xA0; [ "One" , "One" , "Two" ], categories = [ "One" , "Two" , "Three" ], ordered = True
.....: )
.....:

In [ 134 ]: df = pd.DataFrame(
.....:&#xA0;&#xA0;&#xA0;&#xA0; data = [[ 1 , 2 , 3 ], [ 4 , 5 , 6 ]],
.....:&#xA0;&#xA0;&#xA0;&#xA0; columns = pd.MultiIndex.from_arrays([[ "A" , "B" , "B" ], columns]),
.....: )
.....:

In [ 135 ]: df. sum (axis = 1 , level = 1 )

Out[ 135 ]:
One&#xA0; Two&#xA0; Three

0 3 3 0

1 9 6 0

Groupby:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

In [ 136 ]: cats = pd.Categorical(
.....:&#xA0;&#xA0;&#xA0;&#xA0; [ "a" , "b" , "b" , "b" , "c" , "c" , "c" ], categories = [ "a" , "b" , "c" , "d" ]
.....: )
.....:

In [ 137 ]: df = pd.DataFrame({ "cats" : cats, "values" : [ 1 , 2 , 2 , 2 , 3 , 4 , 5 ]})

In [ 138 ]: df.groupby( "cats" ).mean()

Out[ 138 ]:
values

cats&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;

a&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; 1.0

b&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; 2.0

c&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; 4.0

d&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; NaN

In [ 139 ]: cats2 = pd.Categorical([ "a" , "a" , "b" , "b" ], categories = [ "a" , "b" , "c" ])

In [ 140 ]: df2 = pd.DataFrame(
.....:&#xA0;&#xA0;&#xA0;&#xA0; {
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; "cats" : cats2,
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; "B" : [ "c" , "d" , "c" , "d" ],
.....:&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; "values" : [ 1 , 2 , 3 , 4 ],
.....:&#xA0;&#xA0;&#xA0;&#xA0; }
.....: )
.....:

In [ 141 ]: df2.groupby([ "cats" , "B" ]).mean()

Out[ 141 ]:
values

cats B&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;

a&#xA0;&#xA0;&#xA0; c&#xA0;&#xA0;&#xA0;&#xA0; 1.0
d&#xA0;&#xA0;&#xA0;&#xA0; 2.0

b&#xA0;&#xA0;&#xA0; c&#xA0;&#xA0;&#xA0;&#xA0; 3.0
d&#xA0;&#xA0;&#xA0;&#xA0; 4.0

c&#xA0;&#xA0;&#xA0; c&#xA0;&#xA0;&#xA0;&#xA0; NaN
d&#xA0;&#xA0;&#xA0;&#xA0; NaN

Pivot tables:

1

2

3

4

5

6

7

8

9

10

11

12

In [ 142 ]: raw_cat = pd.Categorical([ "a" , "a" , "b" , "b" ], categories = [ "a" , "b" , "c" ])

In [ 143 ]: df = pd.DataFrame({ "A" : raw_cat, "B" : [ "c" , "d" , "c" , "d" ], "values" : [ 1 , 2 , 3 , 4 ]})

In [ 144 ]: pd.pivot_table(df, values = "values" , index = [ "A" , "B" ])

Out[ 144 ]:
values

A B&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;

a c&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; 1
d&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; 2

b c&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; 3
d&#xA0;&#xA0;&#xA0;&#xA0;&#xA0;&#xA0; 4

到此这篇关于Pandas数据类型之category的用法的文章就介绍到这了,更多相关category的用法内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!

Original: https://www.cnblogs.com/end/p/16399900.html
Author: 风生水起
Title: Pandas数据类型之category的用法zz

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/9082/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部