Pandas系列6-DataFrame的分组与聚合

在对数据进行处理的时候,分组与聚合是非常常用的操作。在Pandas中此类操作主要是通过groupby函数来完成的。

先看一个实际的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 生成一个原始的DataFrame
In [70]: raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawk
...: s', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scou
...: ts', 'Scouts'],
...: 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1
...: st', '1st', '2nd', '2nd'],
...: 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ry
...: aner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
...: 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
...: 'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
...:

In [71]: df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTes
...: tScore', 'postTestScore'])

In [72]: df
Out[72]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70

通过groupby函数生成一个groupby对象,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 当针对特定列(此例是'preTestScore')进行分组时,需要通过df['colume_name'](此例是df['regiment'])来指定键名
In [73]: groupby_regiment = df['preTestScore'].groupby(df['regiment'])

# 生成的groupby对象没有做任何计算,只是将数据按键进行分组
In [74]: groupby_regiment
Out[74]: <pandas.core.groupby.SeriesGroupBy object at 0x11112cef0>

# 分组的聚合统计
In [75]: groupby_regiment.describe()
Out[75]:
count mean std min 25% 50% 75% max
regiment
Dragoons 4.0 15.50 14.153916 3.0 3.75 14.0 25.75 31.0
Nighthawks 4.0 15.25 14.453950 2.0 3.50 14.0 25.75 31.0
Scouts 4.0 2.50 0.577350 2.0 2.00 2.5 3.00 3.0

# 也可以针对特定统计单独计算
In [76]: groupby_regiment.mean()
Out[76]:
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64

整个分组统计的过程,可以通过下图更清晰地展示: 1.1-group and aggregate process

聚合函数

聚合的时候,既可以使用Pandas内置的函数进行聚合计算,也可以使用自定义的函数进行聚合计算,我们先来看下内置的函数: 1.2-built-in aggregate functions 另外,我们也可以自定义聚合函数:

1
2
3
4
5
6
7
8
9
10
11
In [81]: def my_agg(pre_test_score_group):
...: return np.sum(np.power(pre_test_score_group, 2))
...:

In [82]: df['preTestScore'].groupby(df['regiment']).apply(my_agg)
Out[82]:
regiment
Dragoons 1562
Nighthawks 1557
Scouts 26
Name: preTestScore, dtype: int64

通过上面的例子我们可以看到,通过apply函数也可以完成类似for循环的迭代,在pandas中尽可能使用apply函数来代替for循环迭代,以提高性能。

根据多个键进行分组和聚合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 如果有多个键,将多个键放到一个list当中,作为groupby的参数
In [77]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
Out[77]:
regiment company
Dragoons 1st 3.5
2nd 27.5
Nighthawks 1st 14.0
2nd 16.5
Scouts 1st 2.5
2nd 2.5
Name: preTestScore, dtype: float64

# unstack之后变成表格模式,更加清晰
In [78]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
Out[78]:
company 1st 2nd
regiment
Dragoons 3.5 27.5
Nighthawks 14.0 16.5
Scouts 2.5 2.5

References