Pandas系列6-DataFrame的分组与聚合

在对数据进行处理的时候，分组与聚合是非常常用的操作。在Pandas中此类操作主要是通过groupby函数来完成的。

先看一个实际的例子：

# 生成一个原始的DataFrame
In [70]: raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawk
    ...: s', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scou
    ...: ts', 'Scouts'],
    ...:         'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1
    ...: st', '1st', '2nd', '2nd'],
    ...:         'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ry
    ...: aner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
    ...:         'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
    ...:         'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
    ...:

In [71]: df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTes
    ...: tScore', 'postTestScore'])

In [72]: df
Out[72]:
      regiment company      name  preTestScore  postTestScore
0   Nighthawks     1st    Miller             4             25
1   Nighthawks     1st  Jacobson            24             94
2   Nighthawks     2nd       Ali            31             57
3   Nighthawks     2nd    Milner             2             62
4     Dragoons     1st     Cooze             3             70
5     Dragoons     1st     Jacon             4             25
6     Dragoons     2nd    Ryaner            24             94
7     Dragoons     2nd      Sone            31             57
8       Scouts     1st     Sloan             2             62
9       Scouts     1st     Piger             3             70
10      Scouts     2nd     Riani             2             62
11      Scouts     2nd       Ali             3             70

通过groupby函数生成一个groupby对象，如下：

# 当针对特定列（此例是'preTestScore'）进行分组时，需要通过df['colume_name'](此例是df['regiment'])来指定键名
In [73]: groupby_regiment = df['preTestScore'].groupby(df['regiment'])

# 生成的groupby对象没有做任何计算，只是将数据按键进行分组
In [74]: groupby_regiment
Out[74]: <pandas.core.groupby.SeriesGroupBy object at 0x11112cef0>

# 分组的聚合统计
In [75]: groupby_regiment.describe()
Out[75]:
            count   mean        std  min   25%   50%    75%   max
regiment
Dragoons      4.0  15.50  14.153916  3.0  3.75  14.0  25.75  31.0
Nighthawks    4.0  15.25  14.453950  2.0  3.50  14.0  25.75  31.0
Scouts        4.0   2.50   0.577350  2.0  2.00   2.5   3.00   3.0

# 也可以针对特定统计单独计算
In [76]: groupby_regiment.mean()
Out[76]:
regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64

整个分组统计的过程，可以通过下图更清晰地展示：

聚合函数

聚合的时候，既可以使用Pandas内置的函数进行聚合计算，也可以使用自定义的函数进行聚合计算，我们先来看下内置的函数：另外，我们也可以自定义聚合函数：

In [81]: def my_agg(pre_test_score_group):
    ...:     return np.sum(np.power(pre_test_score_group, 2))
    ...:

In [82]: df['preTestScore'].groupby(df['regiment']).apply(my_agg)
Out[82]:
regiment
Dragoons      1562
Nighthawks    1557
Scouts          26
Name: preTestScore, dtype: int64

通过上面的例子我们可以看到，通过apply函数也可以完成类似for循环的迭代，在pandas中尽可能使用apply函数来代替for循环迭代，以提高性能。

根据多个键进行分组和聚合

# 如果有多个键，将多个键放到一个list当中，作为groupby的参数
In [77]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
Out[77]:
regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

# unstack之后变成表格模式，更加清晰
In [78]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
Out[78]:
company      1st   2nd
regiment
Dragoons     3.5  27.5
Nighthawks  14.0  16.5
Scouts       2.5   2.5

聚合函数

根据多个键进行分组和聚合

References