---恢复内容开始---
注: 本博文系原创,转载请标明原处。
题外话:春节过后,回到学校无所事事,感觉整个人都生锈一般,没什么动力,姑且称为“春节后遗症”。在科赛官网得到关于NBA的详细数据,而且又想对于自己学习数据挖掘半年以来做一次系统性梳理,就打算做一份关于杜兰特的技术特点的数据分析报告(本人是杜迷),可以称得上寓学于乐吧。话不多说,开工。。。。。
既然要说杜兰特的技术特点,总是要对比吧,不然怎么知道他的特点呢?这里我主要是从几个方面选择:一、球员的位置小前锋和后卫,杜兰特是小前锋,当然也会打打后卫。二、基本是同一个时代的球员,前后差几年也是可以的(如科比)。三、可以称为巨星的球员。最终选择了以下几名球员作为对比:科比、詹姆斯、库里、威斯布鲁克、乔治、安东尼、哈登、保罗、伦纳德。对于新星和前辈们就不做对比,时代不一样数据的意义也有差别,新星的数据比较少,对比没有必要。当然选的人也不是很完美,个人主观选择(哈哈......)
数据来源:https://www.kesci.com/apps/home/dataset/599a6e66c8d2787da4d1e21d/document
巨星表演最佳舞台是季后赛,他们给予我们太多太多的经典时刻,而那些被我们所津津称道时刻就是他们荣誉加身的时刻。所以我打算从季后赛开始分析。。。(就是这么任性)
3.1 首先,我们先看看季后赛的数据有哪些
>>> import pandas as pd data >>> data_player_playoff = pd.read_csv('E:\Python\Program\NBA_Data\data\player_playoff.csv') >>> data_player_playoff.head() 球员 赛季 球队 结果 比分 时间 投篮 命中 出手 三分 ... \ 0 Kelenna Azubuike 11-12 DAL L OKC95-79DAL 5 0.333 1 3 1.0 ... 1 Kelenna Azubuike 06-07 GSW L UTA115-101GSW 1 NaN 0 0 NaN ... 2 Kelenna Azubuike 06-07 GSW W UTA105-125GSW 3 0.000 0 1 NaN ... 3 Kelenna Azubuike 06-07 GSW W DAL86-111GSW 2 1.000 1 1 NaN ... 4 Kelenna Azubuike 06-07 GSW L DAL118-112GSW 0 NaN 0 0 NaN ... 罚球出手 篮板 前场 后场 助攻 抢断 盖帽 失误 犯规 得分 0 0 1 1 0 0 1 0 1 0 3 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 0 0 2 4 0 0 0 0 0 0 0 0 0 0 [5 rows x 24 columns]pd.head(n) 函数是对数据前n 行输出,默认5行,pd.tail() 对数据后几行的输出。
3.2 数据的基本信息
>>> data_player_playoff.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 49743 entries, 0 to 49742 Data columns (total 24 columns): 球员 49615 non-null object 赛季 49743 non-null object 球队 49743 non-null object 结果 49743 non-null object 比分 49743 non-null object 时间 49743 non-null int64 投篮 45767 non-null float64 命中 49743 non-null int64 出手 49743 non-null int64 三分 24748 non-null float64 三分命中 49743 non-null int64 三分出手 49743 non-null int64 罚球 29751 non-null float64 罚球命中 49743 non-null int64 罚球出手 49743 non-null int64 篮板 49743 non-null int64 前场 49743 non-null int64 后场 49743 non-null int64 助攻 49743 non-null int64 抢断 49743 non-null int64 盖帽 49743 non-null int64 失误 49743 non-null int64 犯规 49743 non-null int64 得分 49743 non-null int64 dtypes: float64(3), int64(16), object(5) memory usage: 9.1+ MB3.3 由于中文的列名对后面的数据处理带来麻烦,更改列名
>>> data_player_playoff.columns = ['player','season','team','result','team_score','time','shoot','hit','shot','three_pts','three_pts_hit','three_pts_shot','free_throw','free_throw_hit','free_throw_shot','backboard','front_court','back_court','assists','steals','block_shot','errors','foul','player_score']3.4 从数据表中选择杜兰特、科比、詹姆斯、库里、威斯布鲁克、乔治、安东尼、哈登、保罗、伦纳德的数据
>>> kd_data_off = data_player_playoff[data_player_playoff .player == 'Kevin Durant'] >>> jh_data_off = data_player_playoff [data_player_playoff .player == 'James Harden'] >>> kb_data_off = data_player_playoff [data_player_playoff .player == 'Kobe Bryant'] >>> lj_data_off = data_player_playoff [data_player_playoff .player == 'LeBron James'] >>> kl_data_off = data_player_playoff [data_player_playoff .player == 'Kawhi Leonard'] >>> sc_data_off = data_player_playoff [data_player_playoff .player == 'Stephen Curry'] >>> rw_data_off = data_player_playoff [data_player_playoff .player == 'Russell Westbrook'] >>> pg_data_off = data_player_playoff [data_player_playoff .player == 'Paul George'] >>> ca_data_off = data_player_playoff [data_player_playoff .player == 'Carmelo Anthony'] >>> cp_data_off = data_player_playoff [data_player_playoff .player == 'Chris Paul'] >>> super_data_off = pd.DataFrame () >>> super_data_off = pd.concat([kd_data_off ,kb_data_off ,jh_data_off ,lj_data_off ,sc_data_off ,kl_data_off ,cp_data_off ,rw_data_off ,pg_data_off ,ca_data_off ]) >>> super_data_off .info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1087 entries, 9721 to 904 Data columns (total 24 columns): player 1087 non-null object season 1087 non-null object team 1087 non-null object result 1087 non-null object team_score 1087 non-null object time 1087 non-null int64 shoot 1085 non-null float64 hit 1087 non-null int64 shot 1087 non-null int64 three_pts 1059 non-null float64 three_pts_hit 1087 non-null int64 three_pts_shot 1087 non-null int64 free_throw 1015 non-null float64 free_throw_hit 1087 non-null int64 free_throw_shot 1087 non-null int64 backboard 1087 non-null int64 front_court 1087 non-null int64 back_court 1087 non-null int64 assists 1087 non-null int64 steals 1087 non-null int64 block_shot 1087 non-null int64 errors 1087 non-null int64 foul 1087 non-null int64 player_score 1087 non-null int64 dtypes: float64(3), int64(16), object(5) memory usage: 212.3+ KB这里可以看出詹姆斯的年年总决赛的霸气,只比科比少三场,今年就会超过科比了,而且老詹还要进几年总决赛啊。杜兰特的场数和詹姆斯相差比较大的,估计最后和科比的场数差不多。
从这里可以看出杜兰特是个得分高手,隐隐约约可以看出稳如狗
得分的直方图来了,坐稳
#coding:utf-8 import matplotlib.pyplot as plt import pandas as pd # 中文乱码的处理 from pylab import mpl mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体 mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题 super_data_off = pd.read_csv('super_star_playoff.csv') kd_off_score = super_data_off[super_data_off .player == 'Kevin Durant'] .player_score.describe() super_off_mean_score = super_data_off .groupby('player').mean()['player_score'] labels = [u'场数',u'均分',u'标准差',u'最小值','25%','50%','75%',u'最大值'] print super_off_mean_score .index super_name = [u'安东尼',u'保罗',u'哈登',u'伦纳德',u'杜兰特',u'科比',u'詹姆斯',u'乔治',u'威少',u'库里'] # 绘图 plt.bar(range(len(super_off_mean_score )),super_off_mean_score ,align = 'center') plt.ylabel(u'得分') plt.title(u'巨星季后赛得分数据对比') #plt.xticks(range(len(labels)),labels) plt.xticks(range(len(super_off_mean_score )),super_name) plt.ylim(15,35) for x,y in enumerate (super_off_mean_score ): plt.text (x, y+1, '%s' % round(y, 2) , ha = 'center') plt.show()从得分的角度看杜兰特和詹姆斯是一档,安东尼、科比、威少和库里是一档,保罗、哈登、伦纳德、乔治一档。哈登今年应该会有比较明显的提升,毕竟他是从第六人打的季后赛。杜兰特的四个得分王不是白拿的,在得分方面确实联盟的超巨。
再看看巨星的每个赛季的季后赛的平均值的走势
season_kd_score = super_data_off[super_data_off .player == 'Kevin Durant'] .groupby('season').mean()['player_score'] plt.figure() plt.subplot(321) plt.title(u'杜兰特赛后季平均得分',color = 'red') #plt.xlabel(u'赛季') plt.ylabel(u'得分') plt.plot(season_kd_score,'k',season_kd_score,'bo') for x,y in enumerate (season_kd_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_lj_score = super_data_off [super_data_off .player == 'LeBron James'].groupby('season').mean()['player_score'] plt.subplot(322) plt.title(u'詹姆斯赛后季平均得分',color = 'red') #plt.xlabel(u'赛季') plt.ylabel(u'得分') plt.plot(season_lj_score ,'k',season_lj_score ,'bo') for x,y in enumerate (season_lj_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_kb_score = super_data_off[super_data_off.player == 'Kobe Bryant'].groupby('season').mean()['player_score'] a = season_kb_score [0:-4] b =season_kb_score [-4:] season_kb_score = pd.concat([b,a]) plt.subplot(323) plt.title(u'科比赛季后赛平均得分',color = 'red') #plt.xlabel(u'赛季') plt.ylabel(u'得分') plt.xticks(range(len(season_kb_score )),season_kb_score.index) plt.plot(list(season_kb_score) ,'k',list(season_kb_score),'bo') for x,y in enumerate (season_kb_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_rw_score = super_data_off[super_data_off.player == 'Russell Westbrook'].groupby('season').mean()['player_score'] plt.subplot(324) plt.title(u'威少赛季后赛平均得分',color = 'red') #plt.xlabel(u'赛季') plt.ylabel(u'得分') plt.plot(season_rw_score ,'k',season_rw_score ,'bo') for x,y in enumerate (season_rw_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_sc_score = super_data_off[super_data_off.player == 'Stephen Curry'].groupby('season').mean()['player_score'] plt.subplot(325) plt.title(u'库里赛季后赛平均得分',color = 'red') #plt.xlabel(u'赛季') plt.ylabel(u'得分') plt.plot(season_sc_score ,'k',season_sc_score ,'bo') for x,y in enumerate (season_sc_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') season_ca_score = super_data_off[super_data_off.player == 'Carmelo Anthony'].groupby('season').mean()['player_score'] plt.subplot(326) plt.title(u'安东尼赛季后赛平均得分',color = 'red') #plt.xlabel(u'赛季') plt.ylabel(u'得分') plt.plot(season_ca_score ,'k',season_ca_score ,'bo') for x,y in enumerate (season_ca_score ): plt.text (x,y+0.2, '%s' % round(y,2), ha = 'center') plt.show()再使用饼状图观察他们的的得分分布
super_name_E = ['Kevin Durant','LeBron James','Kobe Bryant','Russell Westbrook','Stephen Curry','Carmelo Anthony'] super_name_C = [u'杜兰特',u'詹姆斯',u'科比',u'威少',u'库里',u'安东尼'] plt.figure(facecolor= 'bisque') colors = ['red', 'yellow', 'peru', 'springgreen'] for i in range(len(super_name_E)): player_labels = [u'20分以下',u'20~29分',u'30~39分',u'40分以上'] explode = [0,0.1,0,0] # 突出得分在20~29的比例 player_score_range = [] player_off_score_range = super_data_off[super_data_off .player == super_name_E [i]] player_score_range.append(len(player_off_score_range [player_off_score_range['player_score'] < 20])*1.0/len(player_off_score_range )) player_score_range.append(len(pd.merge(player_off_score_range[19 < player_off_score_range.player_score], player_off_score_range[player_off_score_range.player_score < 30], how='inner')) * 1.0 / len(player_off_score_range)) player_score_range.append(len(pd.merge(player_off_score_range[29 < player_off_score_range.player_score], player_off_score_range[player_off_score_range.player_score < 40], how='inner')) * 1.0 / len(player_off_score_range)) player_score_range.append(len(player_off_score_range[39 < player_off_score_range.player_score]) * 1.0 / len(player_off_score_range)) plt.subplot(231 + i) plt.title(super_name_C [i] + u'得分分布', color='blue') plt.pie(player_score_range, labels=player_labels, colors=colors, labeldistance=1.1, autopct='%.01f%%', shadow=False, startangle=90, pctdistance=0.8, explode=explode) plt.axis('equal') plt.show()
从这些饼状图可知,杜兰特和詹姆斯在得分的稳定性上一骑绝尘,得分主要集中在 20 ~ 40 之间,占到全部的八成左右。他们的不仅得分高,而且稳定性也是极高。其中40+的得分中占比最高的是詹姆斯,其次是库里和杜兰特。这也从侧面得知杜兰特是这些球员中得分最稳的人,真是稳如狗!!!!从数据上看稳定性,那么下面我给出他们的得分的标准差的直方图:
std = super_data_off.groupby('player').std()['player_score'] color = ['red','red','red','red','blue','red','red','red','red','red',] print std plt.barh(range(10), std, align = 'center',color = color ,alpha = 0.8) plt.xlabel(u'标准差',color = 'blue') plt.ylabel(u'球员', color = 'blue') plt.yticks(range(len(super_name )),super_name) plt.xlim(6,11) for x,y in enumerate (std): plt.text(y + 0.1, x, '%s' % round(y,2), va = 'center') plt.show()标准差的直方图可以明显地说明杜兰特的稳定性极高(标准差越小说明数据的平稳性越好)
在评价一个球员时,往往其投篮的区域和命中率是一项很重要的指标,可以把分为神射手,三分投手、中投王和冲击内线(善突),当然也有造犯规的高手,如哈登。
super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant', u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry'] bar_width = 0.25 import numpy as np shoot = super_data_off.groupby('player') .mean()['shoot'] three_pts = super_data_off.groupby('player') .mean()['three_pts'] free_throw = super_data_off.groupby('player') .mean()['free_throw'] plt.bar(np.arange(10),shoot,align = 'center',label = u'投篮命中率',color = 'red',width = bar_width ) plt.bar(np.arange(10)+ bar_width, three_pts ,align = 'center',color = 'blue',label = u'三分命中率',width = bar_width ) plt.bar(np.arange(10)+ 2*bar_width, free_throw ,align = 'center',color = 'green',label = u'罚球命中率',width = bar_width ) for x,y in enumerate (shoot): plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (three_pts ): plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (free_throw): plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center') plt.legend () plt.ylim(0.3,1.0) plt.title(u'球员的命中率的对比') plt.xlabel(u'球员') plt.xticks(np.arange(10)+bar_width ,super_name) plt.ylabel(u'命中率') plt.show()投篮命中率、三分球命中率和罚球命中率最高的依次是伦纳德、库里和库里,由此可见,库里三分能力的强悍。杜兰特这三项的数据都是排在第三位,表明他的得分的全面性。
super_name_E = [u'Carmelo Anthony', u'Chris Paul', u'James Harden', u'Kawhi Leonard', u'Kevin Durant', u'Kobe Bryant', u'LeBron James', u'Paul George', u'Russell Westbrook', u'Stephen Curry'] bar_width = 0.25 import numpy as np three_pts = super_data_off.groupby('player').sum()['three_pts_hit'] free_throw_pts = super_data_off.groupby('player').sum()['free_throw_hit'] sum_pts = super_data_off.groupby('player').sum()['player_score'] three_pts_rate = np.array(list(three_pts ))*3.0 /np.array(list(sum_pts )) free_throw_pts_rate = np.array(list(free_throw_pts ))*1.0/np.array(list(sum_pts )) two_pts_rate = 1.0 - three_pts_rate - free_throw_pts_rate print two_pts_rate plt.bar(np.arange(10),two_pts_rate ,align = 'center',label = u'二分球得分占比',color = 'red',width = bar_width ) plt.bar(np.arange(10)+ bar_width, three_pts_rate ,align = 'center',color = 'blue',label = u'三分球得分占比',width = bar_width ) plt.bar(np.arange(10)+ 2*bar_width, free_throw_pts_rate ,align = 'center',color = 'green',label = u'罚球得分占比',width = bar_width ) for x,y in enumerate (two_pts_rate): plt.text(x, y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (three_pts_rate ): plt.text(x+bar_width , y+0.01, '%s' % round(y,2), ha = 'center') for x,y in enumerate (free_throw_pts_rate): plt.text(x+2*bar_width , y+0.01, '%s' % round(y,2), ha = 'center') plt.legend () plt.title(u'球员的得分方式的对比') plt.xlabel(u'球员') plt.xticks(np.arange(10)+bar_width ,super_name) plt.ylabel(u'占比率') plt.show()可以看出,二分球占比、三分球占比和罚球占比最高依次是:安东尼和科比、库里 、哈登。这也跟我们的主观相符的,安东尼绝招中距离跳投,科比的后仰跳投,库里不讲理的三分,哈登在罚球的造诣之高,碰瓷王不是白叫的。当然,詹姆斯的二分球占比也是很高,跟他的身体的天赋分不开的。而杜兰特这三项的数据都是中规中矩,也保持着中距离的特点,这也说明了他的进攻的手段的丰富性和全面性。
球星的能力不光光体现进攻端,而防守端的能力也是一个重要的指标。强如乔丹、科比和詹姆斯都是最佳防守阵容的常客,所以,这里给出他们在攻防两端的数据值。
import seaborn as sns import numpy as np player_adavance = pd.read_csv('super_star_advance_data.csv') player_labels = [ u'篮板率', u'助攻率', u'抢断率', u'盖帽率',u'失误率', u'使用率', u'胜利贡献值', u'霍格林效率值'] player_data = player_adavance[['player','total_rebound_rate','assist_rate','steals_rate','cap_rate','error_rate', 'usage_rate','ws','per']] .groupby('player').mean() num = [100,100,100,100,100,100,1,1] np_num = np.array(player_data)*np.array(num) plt.title(u'球员攻防两端的能力热力图') sns.heatmap(np_num , annot=True,xticklabels= player_labels ,yticklabels=super_name ,cmap='YlGnBu') plt.show()在篮板的数据小前锋的数据差不多,都是11 左右,而后卫中最能抢板是威少,毕竟是上赛季的场均三双(历史第二人)。助攻率最高的当然是保罗,其次是威少和詹姆斯;而杜兰特的助攻率比较平庸,但在小前锋里面也是不错了。抢断率方面是保罗和伦纳德的优势明显,显示了伦纳德的死亡缠绕的效果了。盖帽率最高的是杜兰特,身体的优势在这项数据的体现的很明显;在这个赛季杜兰特的盖帽能力又是提升了一个层次,高居联盟前五(杜中锋,哈哈)。失误率方面后卫高于前锋,最高的是威少。使用率最高的是威少,其次是詹姆斯,可以看出他们的球权都是挺大,伦纳德只有22(波波老爷子的整体篮球控制力真强)。贡献值最高是詹姆斯,毕竟球队都是围绕他建立的,现在更是一个人扛着球队前行;其次是保罗,毕竟球队的大脑;杜兰特第三,也是符合杀神的称号的。效率值的前三和贡献值一样,老詹真是强,不服不行啊。。。。
在数据面前,可以得出:从进攻的角度讲,杜兰特是最强的,主要体现在:高得分、稳定性强、得分方式全面和得分效率高。从防守的方面,杜兰特善于封盖,而串联球队方面,杜兰特还是与詹姆斯有着明显差距。这两年杜兰特的防守是越来越好了,希望这个赛季能进入最佳防守阵容。这些数据显示与平时对杜兰特的了解相差不大,可以说数据验证了主观的认识。季后赛的数据就分析就到这里了,对模块padans、numpy 、seaborn 和 matplotlib 系统的梳理一遍吧,也算是新学期的热身吧。常规赛的数据分析就不分析了,什么时候有兴趣了再搞。
---恢复内容结束---
转载于:https://www.cnblogs.com/pursued-deer/p/8489357.html
