DataFrame

it2022-05-05  105

DataFrame官网参考API资料

DataFrame

DataFrame 是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等) DataFrame 即有行索引也有列索引,它可以被看作由Series组成的字典(共用一个索引)

创建 DataFrame

from pandas import DataFrame data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002], 'pop':[1.5,1.7,3.6,2.4,2.9]} frame = DataFrame(data) frame popstateyear01.5Ohio200011.7Ohio200123.6Ohio200232.4Nevada200142.9Nevada2002

DataFrame的列按照指定顺序进行排序

DataFrame(data,columns=['year','state','pop']) yearstatepop02000Ohio1.512001Ohio1.722002Ohio3.632001Nevada2.442002Nevada2.9

索引重命名

DataFrame(data,columns=['year','state','pop'],index=['one','two','three','four','five']) yearstatepopone2000Ohio1.5two2001Ohio1.7three2002Ohio3.6four2001Nevada2.4five2002Nevada2.9

创建空列

传入的列在数据中找不到,会产生NaN值

DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five']) yearstatepopdebtone2000Ohio1.5NaNtwo2001Ohio1.7NaNthree2002Ohio3.6NaNfour2001Nevada2.4NaNfive2002Nevada2.9NaN

列之间进行对比,创建 布尔值列

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five']) frame['eastern'] = frame.state == 'Ohio' frame yearstatepopdebteasternone2000Ohio1.5NaNTruetwo2001Ohio1.7NaNTruethree2002Ohio3.6NaNTruefour2001Nevada2.4NaNFalsefive2002Nevada2.9NaNFalse

通过字典嵌套(字典的字典) 进行创建

外层字典的键作为列,内层键作为行

pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop) frame NevadaOhio2000NaN1.520012.41.720022.93.6 pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop,index=[2001,2002,2003]) frame NevadaOhio20012.41.720022.93.62003NaNNaN pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop) pdata = {'Ohio':frame['Ohio'][:-1], 'Nevada':frame['Nevada'][:2]} DataFrame(pdata) NevadaOhio2000NaN1.520012.41.7

给索引 赋名

pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop) frame.index.name = 'year' frame NevadaOhioyear2000NaN1.520012.41.720022.93.6

给列 赋名

frame.columns.name = 'state' frame stateNevadaOhioyear2000NaN1.520012.41.720022.93.6

转置

pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop) frame.T 200020012002NevadaNaN2.42.9Ohio1.51.73.6

.values 属性以二维ndarray形式返回DataFrame中数据

pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop) frame.values array([[nan, 1.5], [2.4, 1.7], [2.9, 3.6]])

删除列值 del

pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop) del frame['Ohio'] frame Nevada2000NaN20012.420022.9

索取

获取列值

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five']) frame['state'] one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object frame.year one 2000 two 2001 three 2002 four 2001 five 2002 Name: year, dtype: int64

获取所有列名 .columns

frame.columns Index(['year', 'state', 'pop', 'debt'], dtype='object')

获取所有索引名 .index

frame.index Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

获取行值

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five']) frame.ix['three'] /Users/wuyihong/anaconda2/envs/python35/lib/python3.5/site-packages/ipykernel/__main__.py:2: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated from ipykernel import kernelapp as app year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object

赋值

frame = DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five']) frame['debt'] = 16.5 frame yearstatepopdebtone2000Ohio1.516.5two2001Ohio1.716.5three2002Ohio3.616.5four2001Nevada2.416.5five2002Nevada2.916.5 import numpy as np frame['debt']=np.arange(5) frame yearstatepopdebtone2000Ohio1.50two2001Ohio1.71three2002Ohio3.62four2001Nevada2.43five2002Nevada2.94

将 Series 赋值给 DataFrame

赋值的是一个Series,会精确匹配DataFrame的索引,所有的空位都将被填上缺失值

from pandas import Series val = Series([-1.2,-1.5,-1.7],index=['two','four','five']) frame['debt'] = val frame yearstatepopdebtone2000Ohio1.5NaNtwo2001Ohio1.7-1.2three2002Ohio3.6NaNfour2001Nevada2.4-1.5five2002Nevada2.9-1.7

索引对象

Index 对象是不可修改的,这样才能使Index对象在多个数据结构之间安全共享

from pandas import Seriesies obj = Series(range(3),index=['a','b','c']) obj a 0 b 1 c 2 dtype: int64 index = obj.index index Index(['a', 'b', 'c'], dtype='object') index[1:] Index(['b', 'c'], dtype='object')

pd.Index()

最泛化的Index对象,将轴标签表示为一个由python对象组成的NumPy数组

import numpy as np import pandas as pd pd.Index(np.arange(3)) Int64Index([0, 1, 2], dtype='int64') index = pd.Index(np.arange(3)) obj = Series([1.5,-2.5,0],index=index) obj 0 1.5 1 -2.5 2 0.0 dtype: float64 obj.index is index True

用逻辑变量 返回索引所包含的数据

from pandas import DataFrame,Series pop = {'Nevada':{2001:2.4,2002:2.9}, 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} frame = DataFrame(pop) frame.index.name = 'year' frame.columns.name = 'state' frame stateNevadaOhioyear2000NaN1.520012.41.720022.93.6 'Ohio' in frame.columns True 2003 in frame.index False

基本功能

.reindex

其作用是创建一个适应新索引的新对象 参数

index用作索引的新序列。即可以是index实例,也可以是其他序列型的python数据结构。 index会被完全使用,就像没有任何复制一样method 插值(填充)方法fill_value 在重索引的过程中,需要引入缺失值时使用的代替值 limit 前向或后向填充时的最大填充量 level在MultiIndex的指定级别上匹配简答索引,否则选取其子集 copy 默认为True,无论如何都复制;如果为False,则新旧相等就不复制 obj = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) obj d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 obj = obj.reindex(['a','b','c','d','e']) obj a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64

fill_value 参数

obj = Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) obj.reindex(['a','b','c','d','e'],fill_value=0) a -5.3 b 7.2 c 3.6 d 4.5 e 0.0 dtype: float64

method方法

ffill或pad 前向填充(或搬运)值

bfill或backfill 后向填充(或搬运)值

obj = Series(['blue','purple','yellow'],index=[0,2,4]) obj 0 blue 2 purple 4 yellow dtype: object obj.reindex(range(6),method='ffill') 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object frame = DataFrame(np.arange(9).reshape((3,3)), columns=['Ohio','Texas','California'],index=['a','c','d']) frame OhioTexasCaliforniaa012c345d678 frame.reindex(['a','b','c','d']) OhioTexasCaliforniaa0.01.02.0bNaNNaNNaNc3.04.05.0d6.07.08.0 states = ['Texas','Utah','California'] frame.reindex(columns=states) TexasUtahCaliforniaa1NaN2c4NaN5d7NaN8 frame = DataFrame(np.arange(9).reshape((3,3)), columns=['Ohio','Texas','California'],index=['a','c','d']) states = ['Texas','Utah','California'] frame = frame.reindex(columns=states) frame.reindex(index=['a','b','c','d'],method='ffill',columns=states) TexasUtahCaliforniaa1NaN2b1NaN2c4NaN5d7NaN8

.ix

frame = DataFrame(np.arange(9).reshape((3,3)), columns=['Ohio','Texas','California'],index=['a','c','d']) states = ['Texas','Utah','California'] frame = frame.reindex(columns=states) frame.reindex(index=['a','b','c','d'],columns=states) TexasUtahCaliforniaa1.0NaN2.0bNaNNaNNaNc4.0NaN5.0d7.0NaN8.0

丢弃指定轴上的项

drop 方法返回的是一个在指定轴上删除了指定值的新对象

obj = Series(np.arange(5),index=['a','b','c','d','e']) obj a 0 b 1 c 2 d 3 e 4 dtype: int64 obj.drop('c') a 0 b 1 d 3 e 4 dtype: int64 obj.drop(['d','c']) a 0 b 1 e 4 dtype: int64 data = DataFrame(np.arange(16).reshape((4,4)), index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) data onetwothreefourOhio0123Colorado4567Utah891011New York12131415

删除行

data.drop(['Colorado','Ohio']) onetwothreefourUtah891011New York12131415

删除列

data.drop('two',axis=1) onethreefourOhio023Colorado467Utah81011New York121415 data.drop(['two','four'],axis=1) onethreeOhio02Colorado46Utah810New York1214

最新回复(0)