Pandas库入门

基本介绍

提供高性能易用数据类型和分析工具

http://pandas.pydata.org

Pandas基于NumPy实现，常与NumPy和Matplotlib一同使用

In :import pandas as pd
    d = pd.Serier(range(20))
    d

Out:
0	0
1	1
2	2
3	3
4	4
...
19	19
dtype: int32

In :d.cumsum()	#计算前N项累加和

Out:
0	0
1	1
2	3
3	6
4	10
...
18	171
19	190
dtype: int32

Series，DataFrame

像对待单一数据一样对待

基于这两种数据类型有各类操作：基本操作、运算操作、特征类操作、关联类操作

对比NumPy与Pandas

基础数据类型与扩展数据类型

NumPy的基础数据类型ndarray可以表达n维数组，Pandas提供两种基于ndarray的扩展数据类型

数据结构表达与数据应用表达

NumPy关注数据的结构表达，数据的结构表达即数据构成的维度，即你给我一些数据，我关注要用什么维度将数据存储起来并表示出来，数据通过n维方式存储至一个变量中

Pandas关注数据的应用表达，使用数据的时候，怎么更有效地提取数据，以及对这些数据进行运算

我们把数据维度简历好，可以将数据结构表达清楚，但是在使用数据的时候，过于紧密的维度关系并不利于数据的实际应用，因此Pandas并没过分关注数据的结构表达，而是关注数据的应用表达，应用表达体现在数据与索引的关系

维度(数据间关系)、数据与索引间关系

Series与DataFrame都非常明确有效的索引，通过索引可以对数据进行相关分析与提取，通过数据与索引的关系可以使得数据的应用非常方便

Series类型

Series是一维带“标签”数组
Pandas的一维数据类型
Series基本操作类似ndarray和字典，根据索引对齐，不根据维度
由一组数据及与之相关的数据索引组成

index_0	--> data_a
index_1 --> data_b
index_2 --> data_c
index_3 --> data_d
索引		  数据

In :import pandas as pd
	a = pd.Series([9,8,7,6])
	a
	
Out:
0	9	# 自动索引
1	8
2	7
3	6
dtype: int64 # <--NumPy中数据类型

In :import pandas as pd
	b = pd.Series([9,8,7,6],index=['a','b','c','d'])	# 作为第二个参数，可以省略index=
	b
	
Out:
a	9	# 自定义索引
b	8
c	7
d	6
dtype: int64

创建

可以由如下类型创建

Python列表标量值 Python字典 ndarray 其他函数

标量类型创建

index表达Series类型的尺寸

In :import pandas as pd
	s = pd.Series(25,index=['a','b','c'])	# 此处不能省略index=
	s
	
Out:
a	25
b	25
c	25
dtype: int64

字典类型创建

键值对中“键”是索引，index从字典中进行选择操作

In :import pandas as pd
	d = pd.Series({'a':9,'b':8,'c':7})
	d
	
Out:
a	9
b	8
c	7
dtype: int64
    
In :e = pd.Series({'a':9,'b':8,'c':7},index=['c','a','b','d'])
    e
    
Out:
c	7.0
a	9.0
b	8.0
d	NaN
dtype: float64

ndarray创建

Python列表的话index与列表元素个数一致

ndarray中索引和数据都可以通过ndarray类型创建

In :import pandas as pd
	import numpy as np
	n = pd.Series(np.arange(5))
	n
	
Out:
0	0
1	1
2	2
3	3
4	4
dtype: int32
    
    
In :m = pd.Series(np.arange(5),index=np.arange(9,4,-1))
    m
    
Out:
9	0
8	1
7	2
6	3
5	4
dtype: int32

其他函数创建

range()函数等

基本操作

Series类型包括index和values两部分，操作类似ndarray类型与Python字典类型。

In [1]:import pandas as pd
In [2]:b = pd.Series([9,8,7,6],['a','b','c','d'])
In [3]:b
Out[3]:
a	9
b	8
c	7
d	6
dtype: int64

In [4]:b.index	# 获得索引
Out[4]:
Index(['a','b,'c','d'],dtype='object')

In [5]:b.values	# 获得数据
Out[5]:array([9,8,7,6],dtype=int64)
       
In [6]:b['b']	# 自定义索引
Out[6]:8

In [7]:b[1]		# 自动索引并存
Out[7]:8

In [8]:b[['c','d',0]]
Out[8]:
c	7.0
d	6.0
0	NaN			# 两套索引并存，但不能混用
dtype: float64

In [9]:b[['c','d','a']]
Out[9]:
c	7
d	6
a	9
dtype: int64

ndarray类似方法

索引方法相同，采用[]
NumPy中运算和操作可用于Series类型
可以通过自定义索引的列表进行切片
可以通过自动索引进行切片，如果存在自定义索引，则一同被切片

In [1]:import pandas as pd
In [2]:b = pd.Series([9,8,7,6],['a','b','c','d'])
In [3]:b
Out[3]:
a	9
b	8
c	7
d	6
dtype: int64

In [4]:b[3]		# 一个值
Out[4]:6

In [5]:b[:3]	# series类型
Out[5]:
a	9
b	8
c	7
dtype: int64

In [6]:b[b>b.median()]
Out[6]:
a	9
b	8
dtype: int64

In [7]:np.exp(b)
Out[7]:
a	8103.0839288
b	2980.957987
c	1096.633158
d	 403.428793
dtype: float64

Python字典类似方法

通过自定义索引访问
保留字in操作
使用.get()方法

In [1]:import pandas as pd

In [2]:b = pd.Series([9,8,7,6],['a','b','c','d'])

In [3]:b['b']
Out[3]:8

In [4]:'c' in b
Out[4]:True

In [5]:0 in b	# 不会判断自动索引
Out[5]:False

In [6]:b.get('f',100)
Out[6]:100

对齐操作

Series类型在运算中会自动对齐不同索引的数据

In [1]:import pandas as pd

In [2]:a = pd.Series([1,2,3],['c','d','e'])

In [3]:b = pd.Series([9,8,7,6],['a','b','c','d'])

In [4]:a+b
Out[4]:
a	NaN
b	NaN
c	8.0
d	8.0
e	NaN
dtype: float64

Name属性

Series对象和索引都可以有一个名字，存储在属性.name中

In :
import pandas as pd
b = pd.Series([9,8,7,6],['a','b','c','d'])
b.name
b.name = 'Series对象'
b.index.name = '索引列'
b

Out:
索引列
a	9
b	8
c	7
d	6
Name: Series对象,dtype: int64

类型修改

Series对象可以随时修改并即刻生效

In :
import pandas as pd
b = pd.Series([9,8,7,6],['a','b','c','d'])
b['a'] = 15
b.name = "Series"
b

Out:
a	15
b	8
c	7
d	6
Name: Series,dtype: int64

In :
b.name = "New Series"
b['b','c'] = 20
b

Out:
a	15
b	20
c	20
d	6
Name: New Series,dtype: int64

DataFrame类型

Data是二维带”标签”数组
Pandas的二维数据类型，由共用相同索引的一组列组成
DataFrame是一个表格型的数据类型，每列值类型可以不同
DataFrame既有行索引、也有列索引，基本操作类似Series
DataFrame常用于表达二维数据，但可以表达多维数据

index_0 --> data_a	data_1			data_w	(column)(axis=1)
index_1 --> data_b	data_2			data_x
index_2 --> data_c	data_3	......   data_y
index_3 --> data_d	data_4			data_z
索引(index)					多列数据
(axis=0)

创建

可以由如下类型创建

二维ndarray对象
由一维ndarray、列表、字典、元组或Series构成的字典
Series类型
其他的DataFrame类型

二维ndarry对象创建

In :
import pandas as pd
import numpy as np
d = pd.DataFrame(np.arange(10).reshape(2,5))
d

Out:
	0	1	2	3	4	# 自动列索引
0	0	1	2	3	4
1	5	6	7	8	9
# 自动行索引

一维ndarray对象字典创建

In :
import pandas as pd
dt = {'one':pd.Series([1,2,3],index=['a','b','c']),
	  'two':pd.Series([9,8,7,6],index=['a','b','c','d'])}
d = pd.DataFrame(dt)
d

Out:
	one	two	# 自动列索引
a	1.0	  9
b	2.0	  8
c	3.0	  7
d	NaN	  6
# 自动行索引

In :
pd.DataFrame(dt,index=['b','c','d'],columns=['two','three'])

Out:
	two	three
b	8	 NaN
c	7	 NaN
d	6	 NaN

列表类型的字典创建

In :
import pandas as pd
dl = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dl,index=['a','b','c','d'])
d

Out:
	one	two
a	1	9
b	2	8
c	3	7
d	4	6

举例(查询索引等)

In :
import pandas as pd
dl = {'城市':['北京','上海','广州','深圳','沈阳']，
	  '环比':[101.5, 101.2, 101.3, 102.0, 100.1],
	  '同比':[120.7, 127.3, 119.4, 140.9, 101.4],
	  '定基':[121.4, 127.8, 120.0, 145.5, 101.6]}
d = pd.DataFrame(dl,index=['c1','c2','c3','c4','c5'])
d

Out:
	同比	城市	定基	环比
c1	120.7 北京  121.4 101.5
c2  127.3 上海  127.8 101.2
c3  119.4 广州  120.0 101.3
c4  140.9 深圳  145.5 102.0
c5  101.4 沈阳  101.6 100.1

In :d.index
Out:Index(['c1','c2','c3','c4','c5'],dtype='object')
    
In :d.colums
Out:Index(['同比','城市','定基','环比'],dtype='object')

In :d.values
Out:
array([[120.7,北京,121.4,101.5],
       [127.3,上海,127.8,101.2],
       [119.4,广州,120.0,101.3],
       [140.9,深圳,145.5,102.0],
       [101.4,沈阳,101.6,100.1]],dtype=object)

###字典之间各元素是无序的，因此生成d中列顺序不一定和字典中相同

In :d['同比']
Out:
c1	120.7
c2  127.3
c3  119.4
c4  140.9
c5  101.4
Name:同比,dtype: float64
        
In :d.ix['c2']
Out:
同比	127.8
城市	上海
定基	127.8
环比	101.2
Name:c2,dtype:object

In :d['同比']['c2']
Out:127.3

数据类型操作

如何改变Series和DataFrame对象

改变指的是增加或重排Series或DataFrame的索引，或者删掉其中的部分值

索引类型及其常用方法

Series和DataFrame的索引是Index类型

Index对象是不可修改类型

In :d.index
Out:Index(['c5','c4','c3','c2','c1'],dtype='object')

In :d.columns
Out:Index(['城市','同比','环比','定基'],dtype='object')

方法	说明
.append(idx)	连接另一个Index对象，产生新的Index对象
.diff(idx)	计算差集，产生新的Index对象
.intersection(idx)	计算交集
.union(idx)	计算并集
.delete(loc)	删除loc位置处的元素
.insert(loc,e)	在loc位置增加一个元素e

In :d
Out:
	同比	城市	定基	环比
c1	120.7 北京  121.4 101.5
c2  127.3 上海  127.8 101.2
c3  119.4 广州  120.0 101.3
c4  140.9 深圳  145.5 102.0
c5  101.4 沈阳  101.6 100.1

In :
nc = d.columns.delete(2)
ni = d.index.insert(5,'c6')
nd = d.reindex(index=ni,columns=nc,method='ffill')
nd

Out:
	同比	城市	环比
c1	120.7 北京 101.5
c2  127.3 上海 101.2
c3  119.4 广州 101.3
c4  140.9 深圳 102.0
c5  101.4 沈阳 100.1
c6  101.4 沈阳 100.1

增加或重排：重新索引

.reindex()能够改变或重排Series和DataFrame索引

In :
import pandas as pd
dl = {'城市':['北京','上海','广州','深圳','沈阳']，
	  '环比':[101.5, 101.2, 101.3, 102.0, 100.1],
	  '同比':[120.7, 127.3, 119.4, 140.9, 101.4],
	  '定基':[121.4, 127.8, 120.0, 145.5, 101.6]}
d = pd.DataFrame(dl,index=['c1','c2','c3','c4','c5'])
d

Out:
	同比	城市	定基	环比
c1	120.7 北京  121.4 101.5
c2  127.3 上海  127.8 101.2
c3  119.4 广州  120.0 101.3
c4  140.9 深圳  145.5 102.0
c5  101.4 沈阳  101.6 100.1

In :
d = d.reindex(index=['c5','c4','c3','c2','c1'])
d

Out:
	同比	城市	定基	环比
c5  101.4 沈阳  101.6 100.1
c4  140.9 深圳  145.5 102.0
c3  119.4 广州  120.0 101.3
c2  127.3 上海  127.8 101.2
c1	120.7 北京  121.4 101.5

In :
d = d.reindex(colums=['城市','同比','环比','定基'])
d

Out:
	城市	同比	环比	定基
c5	沈阳  101.4 100.1 101.6
...
c1	北京  120.7 101.5 121.4

.reindex(index=None,columns=None,…)的参数

参数	说明
index,columns	新的行列自定义索引
fill_value	重新索引中，用于填充缺失位置的值
method	填充方法，ffill当前值向前填充，bfill向后填充
limit	最大填充量
copy	默认True，生成新的对象，False时，新旧相等不复制

In :
newc = d.columns.insert(4,'新增')
newd = d.reindex(colums=newc,fill_value=200)
newd

Out:
	城市	同比	环比	定基	新增
c5	沈阳  101.4 100.1 101.6 200
...
c1	北京  120.7 101.5 121.4 200

删除：drop

.drop()能够删除Series和DataFrame指定行或列索引

In [1]:import pandas as pd
In [2]:a = pd.Series([9,8,7,6],['a','b','c','d'])
In [3]:a
Out[3]:
a	9
b	8
c	7
d	6
dtype: int64

In :a.drop(['b','c'])
Out:
a	9
d	6
dtype:int64

In :d
Out:
	同比	城市	定基	环比
c5  101.4 沈阳  101.6 100.1
c4  140.9 深圳  145.5 102.0
c3  119.4 广州  120.0 101.3
c2  127.3 上海  127.8 101.2
c1	120.7 北京  121.4 101.5

In :d.drop('c5')	# 默认0轴
Out:
	同比	城市	定基	环比
c4  140.9 深圳  145.5 102.0
c3  119.4 广州  120.0 101.3
c2  127.3 上海  127.8 101.2
c1	120.7 北京  121.4 101.5

In :d.drop('同比',axis=1)
Out:
	城市	定基	环比
c4  深圳  145.5 102.0
c3  广州  120.0 101.3
c2  上海  127.8 101.2
c1	北京  121.4 101.5

数据类型运算

算数运算法则

算数运算根据行列索引，补齐后运算，运算默认产生浮点数
补齐时缺项填充NaN(空值)
二维和一维、一维和零维间为广播运算
采用+-*/符号进行的二元运算产生新的对象

In :
import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
a

Out:
	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11

In :
b = pd.DataFrame(np.arange(20).reshape(4,5))
b

Out:
	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19

In :a+b
Out:
	0	 1	  2	   3	4
0	0.0	 2.0  4.0  6.0  NaN
1	9.0 11.0  13.0 15.0	NaN
2	18.0 20.0 22.0 24.0	NaN
3	NaN	 NaN  NaN  NaN	NaN

In :a*b
Out:	# 自动补齐，缺项补NaN
	0	 1	  2	    3	 4
0	0.0	 1.0  4.0   9.0   NaN
1	20.0 30.0 42.0  56.0  NaN
2   80.0 99.0 120.0 143.0  NaN
3	NaN	 NaN  NaN   NaN	  NaN

# 不同维度间运算为广播运算，一维Series默认在轴1参与运算
In :
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4,5))
print(b)
c = pd.Series(np.arange(4))
print(c)
print(c-10)
print(b-c)

Out:
   0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

0    0
1    1
2    2
3    3
dtype: int32

0   -10
1    -9
2    -8
3    -7
dtype: int32

      0     1     2     3   4
0   0.0   0.0   0.0   0.0 NaN
1   5.0   5.0   5.0   5.0 NaN
2  10.0  10.0  10.0  10.0 NaN
3  15.0  15.0  15.0  15.0 NaN

方法形式运算

好处：增加可选参数

方法	说明
.add(d,**argws)	类型间加法运算，可选参数
.sub(d,**argws)	类型间减法运算，可选参数
.mul(d,**argws)	类型间乘法运算，可选参数
.div(d,**argws)	类型间除法运算，可选参数

In :b.add(a,fill_value = 100)
Out:
	0	 1	  2	   3	4
0	0.0	 2.0  4.0  6.0  104.0
1	9.0 11.0  13.0 15.0	109.0
2	18.0 20.0 22.0 24.0	114.0
3	115.0 116.0 117.0 118.0 119.0

In :a.mul(b,fill_value = 0)
Out:
	0	 1	  2	    3	 4
0	0.0	 1.0  4.0   9.0   0
1	20.0 30.0 42.0  56.0  0
2   80.0 99.0 120.0 143.0  0
3	0	 0 	  0     0     0

# 使用运算方法可以令一维Series参与轴0运算
In :b.sub(c,axis=0)
Out:
    0   1   2   3   4
0   0   1   2   3   4
1   4   5   6   7   8
2   8   9  10  11  12
3  12  13  14  15  16

比较运算法则

比较运算只能比较相同索引的元素，不进行补齐
二维和一维、一维和零维间为广播运算
采用> < >= <= == !=等符号进行的二元运算产生布尔对象

# 同纬度运算，尺寸一致
In :
import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
print(a)
d = pd.DataFrame(np.arange(12,0,-1).reshape(3,4))
print(d)
print(a>d)
print(a==d)

Out:
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
    0   1   2  3
0  12  11  10  9
1   8   7   6  5
2   4   3   2  1
       0      1      2      3
0  False  False  False  False
1  False  False  False   True
2   True   True   True   True
       0      1      2      3
0  False  False  False  False
1  False  False   True  False
2  False  False  False  False

# 不同维度，广播运算，默认在1轴
In :
import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
print(a)
c = pd.Series(np.arange(4))
print(c)
print(a>c)
print(c>0)

Out:
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

0    0
1    1
2    2
3    3
dtype: int32
    
       0      1      2      3
0  False  False  False  False
1   True   True   True   True
2   True   True   True   True

0    False
1     True
2     True
3     True
dtype: bool

数据的排序

一组数据表达一个或多个含义，通过摘要，我们有损地提取数据特征，获得：

基本统计（含排序）
分布/累计统计
数据特征

相关性、周期性等
数据挖掘（形成知识）

.sort_index()

在指定轴上根据索引进行排序，默认升序

.sort_index(axis=0,ascending=True)

In :
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b

Out:
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

In :
b.sort_index()

Out:
    0   1   2   3   4
a   5   6   7   8   9
b  15  16  17  18  19
c   0   1   2   3   4
d  10  11  12  13  14

In :
b.sort_index(ascending=False)

Out:
    0   1   2   3   4
d  10  11  12  13  14
c   0   1   2   3   4
b  15  16  17  18  19
a   5   6   7   8   9

In :
c = b.sort_index(axis=1,ascending = False)
c

Out:
    4   3   2   1   0
c   4   3   2   1   0
a   9   8   7   6   5
d  14  13  12  11  10
b  19  18  17  16  15

In :
c = c.sort_index()
c

Out:
    4   3   2   1   0
a   9   8   7   6   5
b  19  18  17  16  15
c   4   3   2   1   0
d  14  13  12  11  10

.sort_values()

在指定轴上根据数值进行排序，默认升序

.sort_values(axis=0,ascendinng=True)
DataFrame.sort_values(by,axis=0,ascending=True) #by:axis轴上的某个索引或索引列表

In :
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b

Out:
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

In :
c = b.sort_values(2,ascending=False)
c

Out:
    0   1   2   3   4
b  15  16  17  18  19
d  10  11  12  13  14
a   5   6   7   8   9
c   0   1   2   3   4

In :
c = c.sort_values('a',axis=1,ascending=False)
c

Out:
    4   3   2   1   0
b  19  18  17  16  15
d  14  13  12  11  10
a   9   8   7   6   5
c   4   3   2   1   0

NaN统一放到排序末尾

数据的基本统计分析

适用于Series和DataFrame类型

方法	说明
.sum()	计算数据的总和，按0轴计算，下同
.count()	非NaN值的数量
.mean() .median()	计算数据的算数平均值、算数中位值
.var() .std()	计算数据的方差、标准差
.min() .max()	计算数据的最小值、最大值
.describe()	针对0轴(各列)的统计汇总

In :
import pandas as pd
a = pd.Series([9,8,7,6],index=['a','b','c','d'])
a

Out:
a    9
b    8
c    7
d    6
dtype: int64

In :
a.describe()

Out:
count    4.000000
mean     7.500000
std      1.290994
min      6.000000
25%      6.750000
50%      7.500000
75%      8.250000
max      9.000000
dtype: float64

In :
type(a.describe())

Out:
pandas.core.series.Series

In :
a,describe()['count']

Out:
4.0

In :
a.describe()['max']

Out:
9.0

In :
import numpy as np
import pandas as pd
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b.describe()

Out:
               0          1          2          3          4
count   4.000000   4.000000   4.000000   4.000000   4.000000
mean    7.500000   8.500000   9.500000  10.500000  11.500000
std     6.454972   6.454972   6.454972   6.454972   6.454972
min     0.000000   1.000000   2.000000   3.000000   4.000000
25%     3.750000   4.750000   5.750000   6.750000   7.750000
50%     7.500000   8.500000   9.500000  10.500000  11.500000
75%    11.250000  12.250000  13.250000  14.250000  15.250000
max    15.000000  16.000000  17.000000  18.000000  19.000000

In :
type(b.describe())

Out:
<class 'pandas.core.frame.DataFrame'>

In :
b.describe().ix['max'] 		# 在pandas版本0.20.0及其以后版本中，ix已经不被推荐使用，使用loc

Out:
0    15.0
1    16.0
2    17.0
3    18.0
4    19.0
Name: max, dtype: float64

In :
b.describe()[2]

Out:
count     4.000000     
mean      9.500000     
std       6.454972     
min       2.000000     
25%       5.750000     
50%       9.500000     
75%      13.250000     
max      17.000000     
Name: 2, dtype: float64

适用于Series类型

自动索引容易获得元素区间部分(做切片)，自定义索引很难得到一个序列，不易做切片

方法	说明
.argmin() .argmax()	计算数据最大值、最小值所在位置的索引位置(自动索引)
.idxmin() .idxmax()	计算数据最大值、最小值所在位置的索引(自定义索引)

数据的累计统计分析

减少for循环使用，数据运算更加灵活
适用于Series和DataFrame类型

普通计算

方法	说明
.cumsum()	依次给出前1、2、…、n个数的和
.cumprod()	依次给出前1、2、…、n个数的积
.cummax()	依次给出前1、2、…、n个数的最大值
.cummin()	依次给出前1、2、…、n个数的最小值

In :
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b

Out:
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

In :
b.cumsum()

Out:
    0   1   2   3   4
c   0   1   2   3   4
a   5   7   9  11  13
d  15  18  21  24  27
b  30  34  38  42  46

In :
b.cumprod()

Out:
   0     1     2     3     4
c  0     1     2     3     4
a  0     6    14    24    36
d  0    66   168   312   504
b  0  1056  2856  5616  9576

In :
b.cummin()

Out:
   0  1  2  3  4
c  0  1  2  3  4
a  0  1  2  3  4
d  0  1  2  3  4
b  0  1  2  3  4

In :
b.cummax()

Out:
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

滚动计算(窗口计算)

方法	说明
.rolling(w).sum()	依次计算相邻w个元素的和
.rolling(w).mean()	依次计算相邻w个元素的算术平均值
.rolling(w).var()	依次计算相邻w个元素的方差
.rolling(w).std()	依次计算相邻w个元素的标准差
.rolling(w).min() .max()	依次计算相邻w个元素的最小值和最大值

In :
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b

Out:
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

In :
b.rolling(2).sum()

Out:
      0     1     2     3     4
c   NaN   NaN   NaN   NaN   NaN
a   5.0   7.0   9.0  11.0  13.0
d  15.0  17.0  19.0  21.0  23.0
b  25.0  27.0  29.0  31.0  33.0

In :
b.rolling(3).sum()

Out:
      0     1     2     3     4
c   NaN   NaN   NaN   NaN   NaN
a   NaN   NaN   NaN   NaN   NaN
d  15.0  18.0  21.0  24.0  27.0
b  30.0  33.0  36.0  39.0  42.0

数据的相关分析

两个事物，表示为X和Y，如何判断他们之间的存在相关性？

X增大，Y增大，两个变量正相关
X增大，Y减小，两个变量负相关
X增大，Y无视，两个变量不相关

协方差

$$cov(X,Y)={{\textstyle \sum_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}\over{n-1}}$$

协方差>0，X和Y正相关
协方差<0，X和Y负相关
协方差=0，X和Y独立无关

Pearson相关系数

$$r = {{{\textstyle \sum_{i=1}^{n}}(x_i-\bar{x})(y_i-\bar{y})}\over{\sqrt{\textstyle\sum_{i=1}^{n}(x_i-\bar{x})^2}}{\sqrt{\textstyle\sum_{i=1}^{n}(y_i-\bar{y})^2}}}$$

r的取值范围[-1,1]
0.8-1.0 极强相关
0.6-0.8 强相关
0.4-0.6 中等程度相关
0.2-0.4 弱相关
0.0-0.2 极弱相关或无相关

函数

方法	说明
.cov()	计算协方差矩阵
.corr()	计算相关系数矩阵，Pearson、Spearman、Kendall等系数

In :
import pandas as pd
hprice = pd.Series([3.03,22.93,12.75,22.6,12.33],index=['2008','2009','2010','2011','2012'])
m2 = pd.Series([8.18,18.18,9.13,7.82,6.69],index=['2008','2009','2010','2011','2012'])
hprice.corr(m2)

Out:
0.5231190329758817

人工智能 > 数据分析与展示

#数据分析 #分析工具

Pandas库入门

http://example.com/2024/05/17/20240517_Pandas/

作者

XuanYa

发布于

2024年5月17日

许可协议

机器学习引言上一篇

Matplotlib库入门下一篇