用Python+matplotlib足球运动员的射门数据可视化(绘制散点图)

射门数据的可视化,本质上就是散点图,只是点的大小按期望进球值(预测进球概率)变化,提高了直观性和可视性。

一、https://understat.com联赛数据网

足球运动员的射门数据来自https://understat.com,进入主页,搜索姆巴佩“Mbappe”(见图1)。

图1 https://understat.com联赛数据网主页搜索

进入基利安·姆巴佩(Kylian Mbappé)页面,姆巴佩的player_id=3423,所以他的页面网址是https://understat.com/player/3423。https://understat.com/网站提供自2014/2015赛季至现在的联赛数据(爬取网页为https://understat.com/player/{player_id},其中C罗的player_id为2371,梅西的player_id为2097,内马尔的player_id为2099,姆巴佩player_id为3423),包括射门位置(X, Y)、预期进球(进球概率)(xG)、射门结果(result)、射门方式(shotType)、赛季(season)。

射门结果(result)包括:被截(被球员拦截)、进球、射偏、救球(被守门员扑救)、柱射(射在门柱上)。

射门类型(shotType)包括:头球射门、左脚射门、右脚射门及身体其他部位射门。

射门结果Result分为五种:1)Goal(进球);2)Shoton post(射在门柱上);3)Savedshot(守门员守住了);4)Blockedshot(被拦截);5)Missedshot(射偏)。

姆巴佩的数据从2015/2016赛季开始,目录是2022、2023赛季(见图2)。

图2 Kylian Mbappé页面

二、网页分析

单击鼠标右键查看原代码,发现有多个超长字符串变量在<script>...</script>标签中。

按顺序第四个<script>是射门数据(见图3)。

图3 页面代码(局部)

要抓取的是 

<script>

    var shotData = JSON.parse('...')

</script>

结构中引号中的内容。内容为JSON结构数据,注意:JSON是字符串形式,尽管很像字典,但不是Python字典,对Python就是字符串,但可以用json模块进行转换。

json.loads()==>将JSON字符串转为字典或字典列表

json.dumps()==>将字典或字典列表转为JSON字符串

JSON可以有两种表示结构:对象和数组

对象结构以"{"大括号开始,以"}"大括号结束。中间部分由以","来分割开键值对(key/value)代码表示如下:

{  

     key1:value1,     

     key2:value2,   

         ...  

}  

其中:关键字需要是不变类型,比如:字符串;而值可以是其他任何数据,比如:字符串,数值,布尔值,对象或者是null。

数组结构以"["方括号开始,"]"方括号结束。中间部分用","分割对象。代码表示如下:

[

  {

     key1:value1,

     key2:value2

  },

  {

    key3:value3,

      key4:value4

  }

]

可用用Python的以字典为元素的列表表示(Python二维数据)。

三、数据提取与解码

本次爬取的网页用的是JSON数组结构,转换成Python结构后为列表,元素为字典。

截取变量中的头尾两小节数据(C罗的数据),列于下面作前期分析,从数据看是字符串形式的Python单字节十六进制数(十进制值大于32且小于128,ASCII码)+数据,需先转化为Python字节流,再解码为JSON串,然后用json.loads()转换为Python字典列表。

>>> a = r'\x5B\x7B\x22id\x22\x3A\x2232535\x22,\x22minute\x22\x3A\x2218\x22,\x22result\x22\x3A\x22SavedShot\x22,\x22X\x22\x3A\x220.845\x22,\x22Y\x22\x3A\x220.49900001525878906\x22,\x22xG\x22\x3A\x220.06659495085477829\x22,\x22player\x22\x3A\x22Cristiano\x20Ronaldo\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x222371\x22,\x22situation\x22\x3A\x22SetPiece\x22,\x22season\x22\x3A\x222014\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x225834\x22,\x22h_team\x22\x3A\x22Real\x20Madrid\x22,\x22a_team\x22\x3A\x22Cordoba\x22,\x22h_goals\x22\x3A\x222\x22,\x22a_goals\x22\x3A\x220\x22,\x22date\x22\x3A\x222014\x2D08\x2D25\x2019\x3A00\x3A00\x22,\x22player_assisted\x22\x3A\x22Luka\x20Modric\x22,\x22lastAction\x22\x3A\x22Pass\x22\x7D,\x7B\x22id\x22\x3A\x22422004\x22,\x22minute\x22\x3A\x2223\x22,\x22result\x22\x3A\x22SavedShot\x22,\x22X\x22\x3A\x220.885\x22,\x22Y\x22\x3A\x220.5\x22,\x22xG\x22\x3A\x220.7612988352775574\x22,\x22player\x22\x3A\x22Cristiano\x20Ronaldo\x22,\x22h_a\x22\x3A\x22h\x22,\x22player_id\x22\x3A\x222371\x22,\x22situation\x22\x3A\x22Penalty\x22,\x22season\x22\x3A\x222020\x22,\x22shotType\x22\x3A\x22RightFoot\x22,\x22match_id\x22\x3A\x2215790\x22,\x22h_team\x22\x3A\x22Juventus\x22,\x22a_team\x22\x3A\x22Inter\x22,\x22h_goals\x22\x3A\x223\x22,\x22a_goals\x22\x3A\x222\x22,\x22date\x22\x3A\x222021\x2D05\x2D15\x2016\x3A00\x3A00\x22,\x22player_assisted\x22\x3Anull,\x22lastAction\x22\x3A\x22Standard\x22\x7D\x5D'

>>> b = eval("b'" + a + "'")                      # 将字符串放入b'...'中,用eval()转换为字节流

>>> b

b'[{"id":"32535","minute":"18","result":"SavedShot","X":"0.845","Y":"0.49900001525878906","xG":"0.06659495085477829","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"SetPiece","season":"2014","shotType":"RightFoot","match_id":"5834","h_team":"RealMadrid","a_team":"Cordoba","h_goals":"2","a_goals":"0","date":"2014-08-2519:00:00","player_assisted":"Luka Modric","lastAction":"Pass"},{"id":"422004","minute":"23","result":"SavedShot","X":"0.885","Y":"0.5","xG":"0.7612988352775574","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"Penalty","season":"2020","shotType":"RightFoot","match_id":"15790","h_team":"Juventus","a_team":"Inter","h_goals":"3","a_goals":"2","date":"2021-05-1516:00:00","player_assisted":null,"lastAction":"Standard"}]'

>>> type(b)                                     # 测试结果为字节流

<class 'bytes'>

>>> b.decode()                               # decode()解码为字符串,因为是ASCII码所有编码都兼容

'[{"id":"32535","minute":"18","result":"SavedShot","X":"0.845","Y":"0.49900001525878906","xG":"0.06659495085477829","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"SetPiece","season":"2014","shotType":"RightFoot","match_id":"5834","h_team":"RealMadrid","a_team":"Cordoba","h_goals":"2","a_goals":"0","date":"2014-08-2519:00:00","player_assisted":"LukaModric","lastAction":"Pass"},{"id":"422004","minute":"23","result":"SavedShot","X":"0.885","Y":"0.5","xG":"0.7612988352775574","player":"CristianoRonaldo","h_a":"h","player_id":"2371","situation":"Penalty","season":"2020","shotType":"RightFoot","match_id":"15790","h_team":"Juventus","a_team":"Inter","h_goals":"3","a_goals":"2","date":"2021-05-1516:00:00","player_assisted":null,"lastAction":"Standard"}]'

其中重要数据包含射门位置(X、Y)、预期进球(xG)、射门结果(result)、赛季(season)。预期进球即预测进球概念,xG=1则100%进球,X、Y为相对值,值介于0~1,matplotlib绘图则是0~100,所以要放大100倍,result=Goal为进球,season=2014表示2014/2015赛季。

>>> import json                           # 导入json模块

>>> json.loads(b.decode())          # JSON数据转换为字典列表

[{'id':'32535', 'minute': '18', 'result': 'SavedShot', 'X': '0.845', 'Y':'0.49900001525878906', 'xG': '0.06659495085477829', 'player': 'Cristiano Ronaldo','h_a': 'h', 'player_id': '2371', 'situation': 'SetPiece', 'season': '2014','shotType': 'RightFoot', 'match_id': '5834', 'h_team': 'Real Madrid', 'a_team':'Cordoba', 'h_goals': '2', 'a_goals': '0', 'date': '2014-08-25 19:00:00','player_assisted': 'Luka Modric', 'lastAction': 'Pass'}, {'id': '422004','minute': '23', 'result': 'SavedShot', 'X': '0.885', 'Y': '0.5', 'xG':'0.7612988352775574', 'player': 'Cristiano Ronaldo', 'h_a': 'h', 'player_id':'2371', 'situation': 'Penalty', 'season': '2020', 'shotType': 'RightFoot','match_id': '15790', 'h_team': 'Juventus', 'a_team': 'Inter', 'h_goals': '3','a_goals': '2', 'date': '2021-05-15 16:00:00', 'player_assisted': None,'lastAction': 'Standard'}]

>>> json.loads(b)                         # 其实不解码也能转换为字典列表

[{'id':'32535', 'minute': '18', 'result': 'SavedShot', 'X': '0.845', 'Y':'0.49900001525878906', 'xG': '0.06659495085477829', 'player': 'CristianoRonaldo', 'h_a': 'h', 'player_id': '2371', 'situation': 'SetPiece', 'season':'2014', 'shotType': 'RightFoot', 'match_id': '5834', 'h_team': 'Real Madrid','a_team': 'Cordoba', 'h_goals': '2', 'a_goals': '0', 'date': '2014-08-2519:00:00', 'player_assisted': 'Luka Modric', 'lastAction': 'Pass'}, {'id':'422004', 'minute': '23', 'result': 'SavedShot', 'X': '0.885', 'Y': '0.5', 'xG':'0.7612988352775574', 'player': 'Cristiano Ronaldo', 'h_a': 'h', 'player_id':'2371', 'situation': 'Penalty', 'season': '2020', 'shotType': 'RightFoot','match_id': '15790', 'h_team': 'Juventus', 'a_team': 'Inter', 'h_goals': '3','a_goals': '2', 'date': '2021-05-15 16:00:00', 'player_assisted': None,'lastAction': 'Standard'}]

>>> type(json.loads(b))                # 结果为列表

<class 'list'>

好了!有了上面的分析和基础知识后,就要开始爬网页,爬网页用requests模块的get()方法,从网页中提取<script>...</script>标签的内容用BeautifulSoup4模块的BeautifulSoup类的find_all()方法。

四、matplotlib中的绘制散点图——scatter()方法

pyplot模块中的scatter()函数用于绘制散点图,其语法格式如下:

matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, camp=None, 

       norm=None, vmin=None, vmax=None,alpha=None, linewidths=None, 

       verts=None, edgecolors=None, hold=None, data=None,**kwargs)

式中常用的参数含义如下:

x,y:表示 x 轴和 y 轴对应的数据。

s:指定点的大小。若传入的是一维数组,则表示每个点的大小。

c:指定散点的颜色,若传入的是一维数组,则表示每个点的颜色。

marker:表示绘制的散点类型(控制点的形状),见表1。

alpha:控制点的透明度,接受0~1之间的小数。在数据量大的时候设置较小的alpha值,然后调整一下s值,这样产生重叠效果使得数据的聚集特征会很好地显示出来。

cmap:调整渐变色或者颜色列表的种类。

表1 marker设置与对应符号及说明

五、完整代码

完整代码如下:

#############################################
# 设计 Zhang Ruilin   创建 2021-01-10 18:35 #
#                     修订 2022-12-28 10:13 #
# Matplotlib 绘制足球运动员的射门数据分布图 #
#############################################
import requests						# 爬网页工具
from bs4 import BeautifulSoup				# 分析网页、提取信息工具
import json						# JSON转字典、字典转JSON
import pandas as pd					# 大数据处理工具
import matplotlib.pyplot as plt				# 类似matlab的绘图工具包
import numpy as np					# 科学计算数学函数库
import matplotlib as mpl
import mplsoccer					# 绘制足球场工具

# 基利安·姆巴佩(Kylian Mbappé)的player-id为3423
url = 'https://understat.com/player/3423'		# 请求数据
html = requests.get(url)				# 爬取网页
# 解析处理数据
soup_parse = BeautifulSoup(html.content, 'lxml')	# 提取内容
scripts = soup_parse.find_all('script')			# 查找script标签返回一个列表类型        
strings = scripts[3].string				# 取含shotsData变量的结果,转字符串
_start = strings.index("('")+2				# 起点为JSON.parse('后的字符
_end = strings.index("')")				# 终止为\x5D')的'前,不含“'”
json_data = strings[_start:_end]			# 截取变量中''之间部分(JSON数据)
json_data = eval("b'"+json_data+"'")			# 将十六进制字符串\xYY转为字节流
data = json.loads(json_data)				# 转换为字典列表
# 处理数据, 包含射门位置(X,Y)、预期进球(xG)、射门结果(result)、赛季(season)
x, y, xg, result, season = [], [], [], [], []
for _dic in data:					# 提取X、Y、xG、result、season
    x.append(_dic['X'])
    y.append(_dic['Y'])
    xg.append(_dic['xG'])
    result.append(_dic['result'])
    season.append(_dic['season'])
columns = ['X', 'Y', 'xG', 'Result', 'Season']
df_data = pd.DataFrame([x, y, xg, result, season], index=columns)
df_data = df_data.T             			# 对数据进行行列交换(转置)
df_data = df_data.apply(pd.to_numeric, errors='ignore')	# 将数值字符串转换为数值型
df_data['X'] = df_data['X'].apply(lambda x: x*100)	# 放大100倍,得到最终结果
df_data['Y'] = df_data['Y'].apply(lambda x: x*100)	# 原数据为相对数据0~1
# df_data.to_csv(r'd:/Mbappé_shooting.csv')		# 保存为文件
background, text_color = 'lightgray', 'black'		# 定义背景色(浅灰色)、文字色(黑色)
mpl.rcParams['text.color'] = text_color			# 设置文字颜色
mpl.rcParams['font.sans-serif'] = ['simsun']		# 设置默认字体为宋体
mpl.rcParams['legend.fontsize'] = 15			# 图例字号15磅
fig, ax = plt.subplots(figsize=(7, 5.6))		# 新建画布7×5.6英寸
ax.axis('off')						# 关闭坐标轴(不显示坐标轴)
fig.set_facecolor(background)				# 用背景色填充
pitch = mplsoccer.VerticalPitch(half=True, pitch_type='opta', line_zorder=3,
        pitch_color='grass')				# 画垂直方向半个足球场
axes = fig.add_axes((0.05, 0.06, 0.9, 0.9))		# 绘图范围。左下角(0.05, 0.06),
axes.patch.set_facecolor(background)			# ↑宽、高各为90%
pitch.draw(ax=axes)
season=2021						# 设置赛季。范围2014~运行年-1
df = df_data.loc[df_data['Season'] == season]		# 筛选指定赛季数据
# 某赛季, 球员射门位置未得分散点图(df['Result']!='Goal'), 青色,透明度0.5
pitch.scatter(df[df['Result'] != 'Goal']['X'], df[df['Result'] != 'Goal']['Y'],
         s=np.sqrt(df[df['Result'] != 'Goal']['xG'])*100, marker='o', alpha=0.5,
         edgecolor='black', facecolor='cyan', ax=axes, label='未进球')
# 某赛季, 球员射门位置得分散点图(df['Result']=='Goal'), 深红色,透明度0.7
pitch.scatter(df[df['Result'] == 'Goal']['X'], df[df['Result'] == 'Goal']['Y'],
         s=np.sqrt(df[df['Result'] == 'Goal']['xG'])*100,marker='o', alpha=0.7,
         edgecolor='black', facecolor='crimson', ax=axes, label='进球得分')
axes.legend(loc='lower right')				# 添加图例
# 输出文字
axes.text(25, 64, f"预期进球:{sum(df['xG']):.2f}", weight='bold', 
              size=14)					# 期望进球df['xG']之和
axes.text(25, 61, f"得分次数:{len(df[df['Result'] == 'Goal'])}",
              weight='bold', size=14)			# 条件df['Result'] == 'Goal'的行数
axes.text(25, 58, f"射门次数:{len(df)}", weight='bold', size=14)	# 本赛季数据行数
axes.text(95, 60, f'{season}-{season+1}赛季', weight='bold', size=18)

plt.show()

执行结果如图4所示。

图4 Kylian Mbappé射门位置分布图

猜你喜欢

转载自blog.csdn.net/hz_zhangrl/article/details/128490494
今日推荐