DataFrame官网参考API资料
DataFrame
DataFrame 是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等) DataFrame 即有行索引也有列索引,它可以被看作由Series组成的字典(共用一个索引)
创建 DataFrame
from pandas
import DataFrame
data
= {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
frame
= DataFrame
(data
)
frame
popstateyear
01.5Ohio200011.7Ohio200123.6Ohio200232.4Nevada200142.9Nevada2002
DataFrame的列按照指定顺序进行排序
DataFrame
(data
,columns
=['year','state','pop'])
yearstatepop
02000Ohio1.512001Ohio1.722002Ohio3.632001Nevada2.442002Nevada2.9
索引重命名
DataFrame
(data
,columns
=['year','state','pop'],index
=['one','two','three','four','five'])
yearstatepop
one2000Ohio1.5two2001Ohio1.7three2002Ohio3.6four2001Nevada2.4five2002Nevada2.9
创建空列
传入的列在数据中找不到,会产生NaN值
DataFrame
(data
,columns
=['year','state','pop','debt'],index
=['one','two','three','four','five'])
yearstatepopdebt
one2000Ohio1.5NaNtwo2001Ohio1.7NaNthree2002Ohio3.6NaNfour2001Nevada2.4NaNfive2002Nevada2.9NaN
列之间进行对比,创建 布尔值列
frame
= DataFrame
(data
,columns
=['year','state','pop','debt'],index
=['one','two','three','four','five'])
frame
['eastern'] = frame
.state
== 'Ohio'
frame
yearstatepopdebteastern
one2000Ohio1.5NaNTruetwo2001Ohio1.7NaNTruethree2002Ohio3.6NaNTruefour2001Nevada2.4NaNFalsefive2002Nevada2.9NaNFalse
通过字典嵌套(字典的字典) 进行创建
外层字典的键作为列,内层键作为行
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
)
frame
NevadaOhio
2000NaN1.520012.41.720022.93.6
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
,index
=[2001,2002,2003])
frame
NevadaOhio
20012.41.720022.93.62003NaNNaN
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
)
pdata
= {'Ohio':frame
['Ohio'][:-1],
'Nevada':frame
['Nevada'][:2]}
DataFrame
(pdata
)
NevadaOhio
2000NaN1.520012.41.7
给索引 赋名
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
)
frame
.index
.name
= 'year'
frame
NevadaOhioyear
2000NaN1.520012.41.720022.93.6
给列 赋名
frame
.columns
.name
= 'state'
frame
stateNevadaOhioyear
2000NaN1.520012.41.720022.93.6
转置
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
)
frame
.T
200020012002
NevadaNaN2.42.9Ohio1.51.73.6
.values 属性以二维ndarray形式返回DataFrame中数据
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
)
frame
.values
array([[nan, 1.5],
[2.4, 1.7],
[2.9, 3.6]])
删除列值 del
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
)
del frame
['Ohio']
frame
Nevada
2000NaN20012.420022.9
索取
获取列值
frame
= DataFrame
(data
,columns
=['year','state','pop','debt'],index
=['one','two','three','four','five'])
frame
['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
frame
.year
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
获取所有列名 .columns
frame
.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
获取所有索引名 .index
frame
.index
Index(['one', 'two', 'three', 'four', 'five'], dtype='object')
获取行值
frame
= DataFrame
(data
,columns
=['year','state','pop','debt'],index
=['one','two','three','four','five'])
frame
.ix
['three']
/Users/wuyihong/anaconda2/envs/python35/lib/python3.5/site-packages/ipykernel/__main__.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
from ipykernel import kernelapp as app
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
赋值
frame
= DataFrame
(data
,columns
=['year','state','pop','debt'],index
=['one','two','three','four','five'])
frame
['debt'] = 16.5
frame
yearstatepopdebt
one2000Ohio1.516.5two2001Ohio1.716.5three2002Ohio3.616.5four2001Nevada2.416.5five2002Nevada2.916.5
import numpy
as np
frame
['debt']=np
.arange
(5)
frame
yearstatepopdebt
one2000Ohio1.50two2001Ohio1.71three2002Ohio3.62four2001Nevada2.43five2002Nevada2.94
将 Series 赋值给 DataFrame
赋值的是一个Series,会精确匹配DataFrame的索引,所有的空位都将被填上缺失值
from pandas
import Series
val
= Series
([-1.2,-1.5,-1.7],index
=['two','four','five'])
frame
['debt'] = val
frame
yearstatepopdebt
one2000Ohio1.5NaNtwo2001Ohio1.7-1.2three2002Ohio3.6NaNfour2001Nevada2.4-1.5five2002Nevada2.9-1.7
索引对象
Index 对象是不可修改的,这样才能使Index对象在多个数据结构之间安全共享
from pandas
import Seriesies
obj
= Series
(range(3),index
=['a','b','c'])
obj
a 0
b 1
c 2
dtype: int64
index
= obj
.index
index
Index(['a', 'b', 'c'], dtype='object')
index
[1:]
Index(['b', 'c'], dtype='object')
pd.Index()
最泛化的Index对象,将轴标签表示为一个由python对象组成的NumPy数组
import numpy
as np
import pandas
as pd
pd
.Index
(np
.arange
(3))
Int64Index([0, 1, 2], dtype='int64')
index
= pd
.Index
(np
.arange
(3))
obj
= Series
([1.5,-2.5,0],index
=index
)
obj
0 1.5
1 -2.5
2 0.0
dtype: float64
obj
.index
is index
True
用逻辑变量 返回索引所包含的数据
from pandas
import DataFrame
,Series
pop
= {'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
frame
= DataFrame
(pop
)
frame
.index
.name
= 'year'
frame
.columns
.name
= 'state'
frame
stateNevadaOhioyear
2000NaN1.520012.41.720022.93.6
'Ohio' in frame
.columns
True
2003 in frame
.index
False
基本功能
.reindex
其作用是创建一个适应新索引的新对象 参数
index用作索引的新序列。即可以是index实例,也可以是其他序列型的python数据结构。 index会被完全使用,就像没有任何复制一样method 插值(填充)方法fill_value 在重索引的过程中,需要引入缺失值时使用的代替值 limit 前向或后向填充时的最大填充量 level在MultiIndex的指定级别上匹配简答索引,否则选取其子集 copy 默认为True,无论如何都复制;如果为False,则新旧相等就不复制
obj
= Series
([4.5,7.2,-5.3,3.6],index
=['d','b','a','c'])
obj
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
obj
= obj
.reindex
(['a','b','c','d','e'])
obj
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
fill_value 参数
obj
= Series
([4.5,7.2,-5.3,3.6],index
=['d','b','a','c'])
obj
.reindex
(['a','b','c','d','e'],fill_value
=0)
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
dtype: float64
method方法
ffill或pad 前向填充(或搬运)值
bfill或backfill 后向填充(或搬运)值
obj
= Series
(['blue','purple','yellow'],index
=[0,2,4])
obj
0 blue
2 purple
4 yellow
dtype: object
obj
.reindex
(range(6),method
='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
frame
= DataFrame
(np
.arange
(9).reshape
((3,3)),
columns
=['Ohio','Texas','California'],index
=['a','c','d'])
frame
OhioTexasCalifornia
a012c345d678
frame
.reindex
(['a','b','c','d'])
OhioTexasCalifornia
a0.01.02.0bNaNNaNNaNc3.04.05.0d6.07.08.0
states
= ['Texas','Utah','California']
frame
.reindex
(columns
=states
)
TexasUtahCalifornia
a1NaN2c4NaN5d7NaN8
frame
= DataFrame
(np
.arange
(9).reshape
((3,3)),
columns
=['Ohio','Texas','California'],index
=['a','c','d'])
states
= ['Texas','Utah','California']
frame
= frame
.reindex
(columns
=states
)
frame
.reindex
(index
=['a','b','c','d'],method
='ffill',columns
=states
)
TexasUtahCalifornia
a1NaN2b1NaN2c4NaN5d7NaN8
.ix
frame
= DataFrame
(np
.arange
(9).reshape
((3,3)),
columns
=['Ohio','Texas','California'],index
=['a','c','d'])
states
= ['Texas','Utah','California']
frame
= frame
.reindex
(columns
=states
)
frame
.reindex
(index
=['a','b','c','d'],columns
=states
)
TexasUtahCalifornia
a1.0NaN2.0bNaNNaNNaNc4.0NaN5.0d7.0NaN8.0
丢弃指定轴上的项
drop 方法返回的是一个在指定轴上删除了指定值的新对象
obj
= Series
(np
.arange
(5),index
=['a','b','c','d','e'])
obj
a 0
b 1
c 2
d 3
e 4
dtype: int64
obj
.drop
('c')
a 0
b 1
d 3
e 4
dtype: int64
obj
.drop
(['d','c'])
a 0
b 1
e 4
dtype: int64
data
= DataFrame
(np
.arange
(16).reshape
((4,4)),
index
=['Ohio','Colorado','Utah','New York'],columns
=['one','two','three','four'])
data
onetwothreefour
Ohio0123Colorado4567Utah891011New York12131415
删除行
data
.drop
(['Colorado','Ohio'])
onetwothreefour
Utah891011New York12131415
删除列
data
.drop
('two',axis
=1)
onethreefour
Ohio023Colorado467Utah81011New York121415
data
.drop
(['two','four'],axis
=1)
onethree
Ohio02Colorado46Utah810New York1214