pandas pivot pivot_table groupby crosstab用法与区别详解

1.pivot用法

pandas中，pivot源码中是这样解释的

Return reshaped DataFrame organized by given index / column values.

        Reshape data (produce a "pivot" table) based on column values. Uses
        unique values from specified `index` / `columns` to form axes of the
        resulting DataFrame. This function does not support data
        aggregation, multiple values will result in a MultiIndex in the
        columns. See the :ref:`User Guide <reshaping>` for more on reshaping.

        Parameters
        ----------%s
        index : str or object, optional
            Column to use to make new frame's index. If None, uses
            existing index.
        columns : str or object
            Column to use to make new frame's columns.
        values : str, object or a list of the previous, optional
            Column(s) to use for populating new frame's values. If not
            specified, all remaining columns will be used and the result will
            have hierarchically indexed columns.

            .. versionchanged:: 0.23.0
               Also accept list of column names.

        Returns
        -------
        DataFrame
            Returns reshaped DataFrame.

上面的注释很精髓的描述了pivot方法的用途：
返回由给定索引/列值组织的重塑 DataFrame。
如果对这句话暂时不太理解，没关系，我们后面会继续分析。

看个pivot的实例。

def t2():
    data = {
        "a": [1, 2, 3, 1, 2],
        "b": [10, 20, 30, 40, 50],
        "c": ['x', 'y', 'z', 'm', 'n']
    }
    data = pd.DataFrame(data)
    result = data.pivot(index='a', columns='b', values='c')
    print(result, "\n")

    result = result.fillna('un')
    print(result)

代码的输出为:

b   10   20   30   40   50
a                         
1    x  NaN  NaN    m  NaN
2  NaN    y  NaN  NaN    n
3  NaN  NaN    z  NaN  NaN 

b  10  20  30  40  50
a                    
1   x  un  un   m  un
2  un   y  un  un   n
3  un  un   z  un  un

2.pivot_table

pivot_table与pivot区别在于，pivot仅仅是对数据进行重塑，无法对数据进行聚合。同时，pivot方法中，指定的index与columns构成的数据里面如果存在重复的情况，代码将会报错。

pivot_table可以重塑数据，重塑数据的好处是使得数据更加的直观和容易分析，俗称数据透视，经常使用excel的同学对透视表就不陌生了。同时，pivot_table还可以进一步对数据进行聚合，下面我们看一个例子。

def t3():
    data = {
        "a": [1, 1, 2, 2, 3, 1, 2, 3],
        "b": [1, 1, 1, 1, 1, 2, 2, 2],
        "c": [1, 2, 3, 4, 5, 6, 7, 8]
    }
    df = pd.DataFrame(data)
    # <class 'pandas.core.frame.DataFrame'>
    result = pd.pivot_table(df, index=['a'], columns=['b'], values=['c'], aggfunc=np.sum)
    print(result, "\n")

代码输出为

3.groupby

前面提到的pivot可以对数组进行分组聚合，其实我们平时日常对数据进行分组聚合使用最多的是groupby。比如我们可以用groupby实现第二部分中的结果

def t4():
    data = {
        "a": [1, 1, 2, 2, 3, 1, 2, 3],
        "b": [1, 1, 1, 1, 1, 2, 2, 2],
        "c": [1, 2, 3, 4, 5, 6, 7, 8]
    }

    # <class 'pandas.core.series.Series'>
    df = pd.DataFrame(data).groupby(['a', 'b']).c.sum()

    for ele in df.items():
        print(ele[0], ele[1])

(1, 1) 3
(1, 2) 6
(2, 1) 7
(2, 2) 7
(3, 1) 5
(3, 2) 8

可以看到，输出与pivot_table是完全一样的。

那么pivot_table与groupby的区别在哪里呢？

如果用一句话来解释就是：pivot_table 和 groupby 都是用来聚合数据的，区别仅在于结果的形状。pivot/pivot_table是为了让数据重新排列组合更为直观，即俗称的数据透视；而groupby方法则主要是对数据进行分组聚合运算，所以我们一般进行数据聚合时就直接使用groupby方法。

4.crosstab

crosstab是用来统计分组频率的特殊透视表，是pivot_table的一种特殊情况。

    """
    Compute a simple cross tabulation of two (or more) factors. By default
    computes a frequency table of the factors unless an array of values and an
    aggregation function are passed.

    Parameters
    ----------
    index : array-like, Series, or list of arrays/Series
        Values to group by in the rows.
    columns : array-like, Series, or list of arrays/Series
        Values to group by in the columns.
    values : array-like, optional
        Array of values to aggregate according to the factors.
        Requires `aggfunc` be specified.
    rownames : sequence, default None
        If passed, must match number of row arrays passed.
    colnames : sequence, default None
        If passed, must match number of column arrays passed.
    aggfunc : function, optional
        If specified, requires `values` be specified as well.
    margins : bool, default False
        Add row/column margins (subtotals).
    margins_name : str, default 'All'
        Name of the row/column that will contain the totals
        when margins is True.

        .. versionadded:: 0.21.0

    dropna : bool, default True
        Do not include columns whose entries are all NaN.
    normalize : bool, {'all', 'index', 'columns'}, or {0,1}, default False
        Normalize by dividing all values by the sum of values.

        - If passed 'all' or `True`, will normalize over all values.
        - If passed 'index' will normalize over each row.
        - If passed 'columns' will normalize over each column.
        - If margins is `True`, will also normalize margin values.

    Returns
    -------
    DataFrame
        Cross tabulation of the data.

    See Also
    --------
    DataFrame.pivot : Reshape data based on column values.
    pivot_table : Create a pivot table as a DataFrame.

下面同样来看一个例子。

def t5():
    data = {
        "a": [1, 1, 2, 2, 3, 1, 2, 3],
        "b": [1, 1, 1, 1, 1, 2, 2, 2],
        "c": [1, 2, 3, 4, 5, 6, 7, 8]
    }
    data = pd.DataFrame(data)
    # <class 'pandas.core.frame.DataFrame'>
    df = pd.crosstab(index=data.a, columns=data.b)
    print(df, "\n")

    df2 = data.groupby(['a', 'b']).agg({'c': 'count'})
    print(df2, "\n")

    df3 = pd.crosstab(index=data['a'], columns=data['b']).cumsum(axis=0)
    print(df3)

代码输出为：