C
Claroja

# 统计函数

## 增长率pct_change

periods参数控制步长

``````In [1]: ser = pd.Series(np.random.randn(8))

In [2]: ser.pct_change()
Out[2]:
0         NaN
1   -1.602976
2    4.334938
3   -0.247456
4   -2.067345
5   -1.142903
6   -1.688214
7   -9.759729
dtype: float64``````

## 协方差Covariance

Series.cov(other, min_periods=None)

``````In [5]: s1 = pd.Series(np.random.randn(1000))

In [6]: s2 = pd.Series(np.random.randn(1000))

In [7]: s1.cov(s2)
Out[7]: 0.00068010881743108746``````

DataFrame.cov(min_periods=None)

``````In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [9]: frame.cov()
Out[9]:
a         b         c         d         e
a  1.000882 -0.003177 -0.002698 -0.006889  0.031912
b -0.003177  1.024721  0.000191  0.009212  0.000857
c -0.002698  0.000191  0.950735 -0.031743 -0.005087
d -0.006889  0.009212 -0.031743  1.002983 -0.047952
e  0.031912  0.000857 -0.005087 -0.047952  1.042487``````

## 相关系数Correlation

Method name Description
pearson?(default) Standard correlation coefficient
kendall Kendall Tau correlation coefficient
spearman Spearman rank correlation coefficient

Series.corr(other, method=’pearson’, min_periods=None)

DataFrame.corr(method=’pearson’, min_periods=1)

``````In [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
In [19]: frame.corr()
Out[19]:
a         b         c         d         e
a  1.000000  0.013479 -0.049269 -0.042239 -0.028525
b  0.013479  1.000000 -0.020433 -0.011139  0.005654
c -0.049269 -0.020433  1.000000  0.018587 -0.054269
d -0.042239 -0.011139  0.018587  1.000000 -0.017060
e -0.028525  0.005654 -0.054269 -0.017060  1.000000``````

DataFrame.corrwith(other, axis=0, drop=False)

## 数据排名

Series.rank(axis=0, method=’average’, numeric_only=None, na_option=’keep’, ascending=True, pct=False)

``````In [31]: s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))

In [32]: s['d'] = s['b'] # so there's a tie

In [33]: s.rank()
Out[33]:
a    5.0
b    2.5
c    1.0
d    2.5
e    4.0
dtype: float64``````

DataFrame.rank(axis=0, method=’average’, numeric_only=None, na_option=’keep’, ascending=True, pct=False)
axis=0则是按行排序，axis=1按列排序
ascending=True为升序，False为降序

``````In [34]: df = pd.DataFrame(np.random.np.random.randn(10, 6))

In [35]: df[4] = df[2][:5] # some ties

In [36]: df
Out[36]:
0 1 2 3 4 5
0 -0.904948 -1.163537 -1.457187  0.135463 -1.457187  0.294650
1 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.800809
2  0.401965  1.460840  1.256057  1.308127  1.256057  0.876004
3  0.205954  0.369552 -0.669304  0.038378 -0.669304  1.140296
4 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.211196
5 -1.092970 -0.689246  0.908114  0.204848       NaN  0.463347
6  0.376892  0.959292  0.095572 -0.593740       NaN -0.069180
7 -1.002601  1.957794 -0.120708  0.094214       NaN -1.467422
8 -0.547231  0.664402 -0.519424 -0.073254       NaN -1.263544
9 -0.250277 -0.237428 -1.056443  0.419477       NaN  1.375064

In [37]: df.rank(1)
Out[37]:
0 1 2 3 4 5
0  4.0  3.0  1.5  5.0  1.5  6.0
1  2.0  6.0  4.5  1.0  4.5  3.0
2  1.0  6.0  3.5  5.0  3.5  2.0
3  4.0  5.0  1.5  3.0  1.5  6.0
4  5.0  3.0  1.5  4.0  1.5  6.0
5  1.0  2.0  5.0  3.0  NaN  4.0
6  4.0  5.0  3.0  1.0  NaN  2.0
7  2.0  5.0  3.0  4.0  NaN  1.0
8  2.0  5.0  3.0  4.0  NaN  1.0
9  2.0  3.0  1.0  4.0  NaN  5.0``````

# 窗口函数

## 窗口函数介绍rolling

Series.rolling(window, min_periods=None, freq=None, center=False, win_type=None, on=None, axis=0)
window：移动窗口的大小
min_periods:？？
center：是否在中间设置标签，默认False
win type=？？

``````In [38]: s = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
r = s.rolling(window=60)
In [42]: r
Out[42]: Rolling [window=60,center=False,axis=0]
In [43]: r.mean()
Out[43]:
2000-01-01          NaN
2000-01-02          NaN
2000-01-03          NaN
2000-01-04          NaN
2000-01-05          NaN
2000-01-06          NaN
2000-01-07          NaN
...
2002-09-20   -62.694135
2002-09-21   -62.812190
2002-09-22   -62.914971
2002-09-23   -63.061867
2002-09-24   -63.213876
2002-09-25   -63.375074
2002-09-26   -63.539734
Freq: D, dtype: float64
In [44]: s.plot(style='k--')
Out[44]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff282080dd0>

In [45]: r.mean().plot(style='k')
Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff282080dd0>``````
``````In [46]: df = pd.DataFrame(np.random.randn(1000, 4),
....:                   index=pd.date_range('1/1/2000', periods=1000),
....:                   columns=['A', 'B', 'C', 'D'])
....:

In [47]: df = df.cumsum()

In [48]: df.rolling(window=60).sum().plot(subplots=True)``````

## 计算方法总结

Method Description
count() Number of non-null observations
sum() Sum of values
mean() Mean of values
median() Arithmetic median of values
min() Minimum
max() Maximum
std() Bessel-corrected sample standard deviation
var() Unbiased variance
skew() Sample skewness (3rd moment)
kurt() Sample kurtosis (4th moment)
quantile() Sample quantile (value at %)
apply() Generic apply
cov() Unbiased covariance (binary)
corr() Correlation (binary)

apply()方法可以应用在滚动窗口中。apply()的参数函数必须是指产生一个值，假设我们需要计算均值绝对离差：

``````In [49]: mad = lambda x: np.fabs(x - x.mean()).mean()

# 指数加权窗口（Exponentially Weighted Windows）

C

### Claroja

python大数据准备--pandas

michaelgbw
2017/03/04
0
0
Pandas初学者代码优化指南

dev_csdn
2017/11/21
0
0

1. Anaconda Anaconda 是一个用于科学计算的Python 发行版。支持linux ，windows 系统，提供了包管理与环境管理的功能，可以很方便的解决多版本Python的并存，切换，以及各种第三方包(如 Nu...

linzihahaha
11/12
0
0

【量化小讲堂- Python、pandas技巧系列】 如何快速上手使用Python进行金融数据分析 根据之前几篇系列帖子以及交流QQ群(461470781)中的反馈，让我觉的很有必要写一篇如何快速上手使用Pyhton进...

2015/12/01
6.1K
0
PyCharm：2017.3版即将新增科学计算模式，预览版现在可以下载使用

lemon
2017/09/15
0
0

java框架学习日志-7（静态代理和JDK代理）

21
0

1.Flink窗口 Window Assigner分配器。 窗口可以是时间驱动的（Time Window，例如：每30秒钟），也可以是数据驱动的（Count Window，例如：每一百个元素）。 一种经典的窗口分类可以分成： 翻...

17
0
my.ini

1

architect刘源源

14
0
docker dns

There is a opensource application that solves this issue, it's called DNS Proxy Server It's a DNS server that solves containers hostnames, if could not found a hostname that mat......

kut

15
0

27
0