Jupyter Notebook-分享列表
Pandas快速入门,DataFrame创建、提取、数据筛选、矩阵转置。
- http://nbviewer.jupyter.org/github/supergis/git_notebook/blob/master/pystart/pandas_quickstart.ipynb
-
Origin from http://pandas.pydata.org/pandas-docs/stable/
Pandas_QuickStart
6.1 Object Creation
Creating a Series by passing a list of values, letting pandas create a default integer index:
import pandas as pd import numpy as np import matplotlib.pyplot as plt s = pd.Series([1,3,5,np.nan,6,8]) s
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:
dates = pd.date_range('20130101', periods=6) dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) df
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -1.334013 | -0.348297 | 0.388654 | -0.225967 |
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 0.192477 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 |
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 0.794341 |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | -0.058047 |
2013-01-06 | 0.125631 | -0.116653 | -0.544575 | -1.578785 |
Creating a DataFrame by passing a dict of objects that can be converted to series-like.
df2 = pd.DataFrame({ 'A' : 1., 'B' : pd.Timestamp('20130102'), 'C' : pd.Series(1,index=list(range(4)),dtype='float32'), 'D' : np.array([3] * 4,dtype='int32'), 'E' : pd.Categorical(["test","train","test","train"]), 'F' : 'foo' }) df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
df2.dtypes
A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
In [13]: df2.<TAB>
df2.
0 1.0 1 1.0 2 1.0 3 1.0 Name: A, dtype: float64
As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.
6.2 Viewing Data
df.head()
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -1.334013 | -0.348297 | 0.388654 | -0.225967 |
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 0.192477 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 |
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 0.794341 |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | -0.058047 |
df.tail(3)
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 0.794341 |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | -0.058047 |
2013-01-06 | 0.125631 | -0.116653 | -0.544575 | -1.578785 |
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D')
df.values
array([[-1.33401275, -0.34829657, 0.38865407, -0.22596701], [-0.13997444, -1.34778853, 0.81707707, 0.19247685], [-1.0827386 , -0.5441047 , -1.42388302, -1.24736743], [ 0.03478847, -0.67722051, 0.12044917, 0.7943414 ], [ 0.42854678, -0.61015602, -0.95089113, -0.0580473 ], [ 0.12563068, -0.11665286, -0.54457518, -1.57878468]])
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | -0.327960 | -0.607370 | -0.265528 | -0.353891 |
std | 0.710887 | 0.415940 | 0.853035 | 0.896606 |
min | -1.334013 | -1.347789 | -1.423883 | -1.578785 |
25% | -0.847048 | -0.660454 | -0.849312 | -0.992017 |
50% | -0.052593 | -0.577130 | -0.212063 | -0.142007 |
75% | 0.102920 | -0.397249 | 0.321603 | 0.129846 |
max | 0.428547 | -0.116653 | 0.817077 | 0.794341 |
df.T
2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 | 2013-01-05 00:00:00 | 2013-01-06 00:00:00 | |
---|---|---|---|---|---|---|
A | -1.334013 | -0.139974 | -1.082739 | 0.034788 | 0.428547 | 0.125631 |
B | -0.348297 | -1.347789 | -0.544105 | -0.677221 | -0.610156 | -0.116653 |
C | 0.388654 | 0.817077 | -1.423883 | 0.120449 | -0.950891 | -0.544575 |
D | -0.225967 | 0.192477 | -1.247367 | 0.794341 | -0.058047 | -1.578785 |
df.sort_index(axis=1, ascending=False)
D | C | B | A | |
---|---|---|---|---|
2013-01-01 | -0.225967 | 0.388654 | -0.348297 | -1.334013 |
2013-01-02 | 0.192477 | 0.817077 | -1.347789 | -0.139974 |
2013-01-03 | -1.247367 | -1.423883 | -0.544105 | -1.082739 |
2013-01-04 | 0.794341 | 0.120449 | -0.677221 | 0.034788 |
2013-01-05 | -0.058047 | -0.950891 | -0.610156 | 0.428547 |
2013-01-06 | -1.578785 | -0.544575 | -0.116653 | 0.125631 |
df.sort_values(by='B')
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 0.192477 |
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 0.794341 |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | -0.058047 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 |
2013-01-01 | -1.334013 | -0.348297 | 0.388654 | -0.225967 |
2013-01-06 | 0.125631 | -0.116653 | -0.544575 | -1.578785 |
6.3 Selection
Getting
df['A']
2013-01-01 -1.334013 2013-01-02 -0.139974 2013-01-03 -1.082739 2013-01-04 0.034788 2013-01-05 0.428547 2013-01-06 0.125631 Freq: D, Name: A, dtype: float64
df[0:3]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -1.334013 | -0.348297 | 0.388654 | -0.225967 |
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 0.192477 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 |
df['20130102':'20130104']
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 0.192477 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 |
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 0.794341 |
6.3.2 Selection by Label
For getting a cross section using a label
df.loc[dates[0]]
A -1.334013 B -0.348297 C 0.388654 D -0.225967 Name: 2013-01-01 00:00:00, dtype: float64
Selecting on a multi-axis by label
df.loc[:,['A','B']]
A | B | |
---|---|---|
2013-01-01 | -1.334013 | -0.348297 |
2013-01-02 | -0.139974 | -1.347789 |
2013-01-03 | -1.082739 | -0.544105 |
2013-01-04 | 0.034788 | -0.677221 |
2013-01-05 | 0.428547 | -0.610156 |
2013-01-06 | 0.125631 | -0.116653 |
Showing label slicing, both endpoints are included
df.loc['20130102':'20130104',['A','B']]
A | B | |
---|---|---|
2013-01-02 | -0.139974 | -1.347789 |
2013-01-03 | -1.082739 | -0.544105 |
2013-01-04 | 0.034788 | -0.677221 |
Reduction in the dimensions of the returned object
df.loc['20130102',['A','B']]
A -0.139974 B -1.347789 Name: 2013-01-02 00:00:00, dtype: float64
For getting a scalar value
df.loc[dates[0],'A']
-1.3340127475498547
For getting fast access to a scalar (equiv to the prior method)
df.at[dates[0],'A']
-1.3340127475498547
6.3.3 Selection by Position
See more in Selection by Position Select via the position of the passed integers
df.iloc[3]
A 0.034788 B -0.677221 C 0.120449 D 0.794341 Name: 2013-01-04 00:00:00, dtype: float64
By integer slices, acting similar to numpy/python
df.iloc[3:5,0:2]
A | B | |
---|---|---|
2013-01-04 | 0.034788 | -0.677221 |
2013-01-05 | 0.428547 | -0.610156 |
By lists of integer position locations, similar to the numpy/python style
df.iloc[[1,2,4],[0,2]]
A | C | |
---|---|---|
2013-01-02 | -0.139974 | 0.817077 |
2013-01-03 | -1.082739 | -1.423883 |
2013-01-05 | 0.428547 | -0.950891 |
For slicing rows explicitly
df.iloc[1:3,:]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 0.192477 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 |
For slicing columns explicitly
df.iloc[:,1:3]
B | C | |
---|---|---|
2013-01-01 | -0.348297 | 0.388654 |
2013-01-02 | -1.347789 | 0.817077 |
2013-01-03 | -0.544105 | -1.423883 |
2013-01-04 | -0.677221 | 0.120449 |
2013-01-05 | -0.610156 | -0.950891 |
2013-01-06 | -0.116653 | -0.544575 |
For getting a value explicitly
df.iloc[1,1]
-1.3477885295869219
For getting fast access to a scalar (equiv to the prior method)
df.iat[1,1]
-1.3477885295869219
6.3.4 Boolean Indexing
Using a single column’s values to select data.
df[df.A > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 0.794341 |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | -0.058047 |
2013-01-06 | 0.125631 | -0.116653 | -0.544575 | -1.578785 |
A where operation for getting.
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | NaN | NaN | 0.388654 | NaN |
2013-01-02 | NaN | NaN | 0.817077 | 0.192477 |
2013-01-03 | NaN | NaN | NaN | NaN |
2013-01-04 | 0.034788 | NaN | 0.120449 | 0.794341 |
2013-01-05 | 0.428547 | NaN | NaN | NaN |
2013-01-06 | 0.125631 | NaN | NaN | NaN |
Using the isin() method for filtering:
df2 = df.copy()
添加一列。
df2['E'] = ['one', 'one','two','three','four','three']
df2
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | -1.334013 | -0.348297 | 0.388654 | -0.225967 | one |
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 0.192477 | one |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 | two |
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 0.794341 | three |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | -0.058047 | four |
2013-01-06 | 0.125631 | -0.116653 | -0.544575 | -1.578785 | three |
df2[df2['E'].isin(['two','four'])]
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -1.247367 | two |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | -0.058047 | four |
6.3.5 Setting
Setting a new column automatically aligns the data by the indexes
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6)) s1
2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64
Setting values by position
df.iat[0,1] = 0
Setting by assigning with a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
The result of the prior setting operations
df
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -1.334013 | 0.000000 | 0.388654 | 5 |
2013-01-02 | -0.139974 | -1.347789 | 0.817077 | 5 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | 5 |
2013-01-04 | 0.034788 | -0.677221 | 0.120449 | 5 |
2013-01-05 | 0.428547 | -0.610156 | -0.950891 | 5 |
2013-01-06 | 0.125631 | -0.116653 | -0.544575 | 5 |
A where operation with setting.
df2 = df.copy()
df2[df2 > 0] = -df2
df2
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | -1.334013 | 0.000000 | -0.388654 | -5 |
2013-01-02 | -0.139974 | -1.347789 | -0.817077 | -5 |
2013-01-03 | -1.082739 | -0.544105 | -1.423883 | -5 |
2013-01-04 | -0.034788 | -0.677221 | -0.120449 | -5 |
2013-01-05 | -0.428547 | -0.610156 | -0.950891 | -5 |
2013-01-06 | -0.125631 | -0.116653 | -0.544575 | -5 |
This website does not host notebooks, it only renders notebooks available on other websites.
Delivered by Fastly, Rendered by Rackspace
nbviewer GitHub repository.