STAT 430 Final Project - EPL Team Attack Pattern Sequence¶

Junseok Yang (jyang247)

Introduction & Motivation¶

Football (soccer) is one of the most popular sports in the world, which 211 countries have joined the football association FIFA. With many different leagues from all over the world, European leagues are considered to be the best and strongest leagues referring to Opta and English Premier League (EPL) is one of the most popular league among the European leagues in terms of performance, market size, popularity, etc. There are many reasons of millions of people watching and enjoying the league and one reason is that there are relatively many dynamic upset games, meaning teams that are expected to be favored (win) get defeated by relatively weak teams. There can be many factors related to this, and I believe that an unique tactic of a team plays a huge role of affecting the game result. Creative strategies in modern football such as inverted full-back, constant triangle/square player connection, involvement of goalkeeper in build-up stage and extreme cases like bus parking can not only help teams to prepare and react challenging situations efficiently, but also help overcome individual players' weakness and earn meaningful points. With the data provided by Pappalardo, L. et al, I am going to study and understand about football strategies, specifically shot event sequence which is relevant to the main topic of the project, attack pattern.

Research Question¶

The main research question for this project would be "Do different teams in the EPL tend to have their own unique shot event sequence(s)?"

Why Shot sequence?¶

There are many ways of earning points from match, and scoing more goals than opponent team is one trivial way to win a game. The more a team can carry the ball successfully and create shots on target, the more likely they can score goals and win. With diverse players with diverse strengths/weaknesses, managers and coaches study, plan, and train several unique attack patterns to their players in order to maximize their performance during the game. In this process, there are specific players who are involved more in the attack sequences than other players, and it is crucial, especially like strategy analyst or scouter, to not only understand the attack pattern of each team, but also find and buy talented players from other teams who are versatile and could potentially contribute to their team success.

Data Preprocessing¶

In order to answer the research question, it is necessary to preprocess the data to have a correct structure. Considering that the question is asking about shot event sequence, each observation (row) needs to be a single shot event sequence. Below is the preprocessing stages to transform the data to have observations of shot event sequence.

Any codes from the paper by Pappalardo, L., Cintia, P., Rossi, A. et al will have a comment in the code chunks (# Code from Pappalardo, L., Cintia, P., Rossi, A. et al).

In [1]:
# Import packages for the project
import json
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import seaborn as sns
import pandas as pd
import sys,os
import warnings
warnings.filterwarnings('ignore')
In [2]:
# Code from Pappalardo, L., Cintia, P., Rossi, A. et al
# loading the events data
events={}
nations = ['Italy','England','Germany','France','Spain']
for nation in nations:
    with open('events/events_%s.json' %nation) as json_data:
        events[nation] = json.load(json_data)
        
# loading the match data
matches={}
nations = ['Italy','England','Germany','France','Spain']
for nation in nations:
    with open('matches/matches_%s.json' %nation) as json_data:
        matches[nation] = json.load(json_data)

# loading the players data
players={}
with open('players.json') as json_data:
    players = json.load(json_data)

# loading the competitions data
competitions={}
with open('competitions.json') as json_data:
    competitions = json.load(json_data)
In [3]:
# Data files from Pappalardo, L., Cintia, P., Rossi, A. et al
# Read the csv files
df_matches = pd.read_csv("matches.csv")
df_events = pd.read_csv("events.csv")
df_players = pd.read_csv("players.csv")
df_competitions = pd.read_csv("competitions.csv")
In [4]:
# Code from Pappalardo, L., Cintia, P., Rossi, A. et al
# Pitch function
def pitch():
    """
    code to plot a soccer pitch 
    """
    #create figure
    fig,ax=plt.subplots(figsize=(7,5))
    
    #Pitch Outline & Centre Line
    plt.plot([0,0],[0,100], color="black")
    plt.plot([0,100],[100,100], color="black")
    plt.plot([100,100],[100,0], color="black")
    plt.plot([100,0],[0,0], color="black")
    plt.plot([50,50],[0,100], color="black")

    #Left Penalty Area
    plt.plot([16.5,16.5],[80,20],color="black")
    plt.plot([0,16.5],[80,80],color="black")
    plt.plot([16.5,0],[20,20],color="black")

    #Right Penalty Area
    plt.plot([83.5,100],[80,80],color="black")
    plt.plot([83.5,83.5],[80,20],color="black")
    plt.plot([83.5,100],[20,20],color="black")

    #Left 6-yard Box
    plt.plot([0,5.5],[65,65],color="black")
    plt.plot([5.5,5.5],[65,35],color="black")
    plt.plot([5.5,0.5],[35,35],color="black")

    #Right 6-yard Box
    plt.plot([100,94.5],[65,65],color="black")
    plt.plot([94.5,94.5],[65,35],color="black")
    plt.plot([94.5,100],[35,35],color="black")

    #Prepare Circles
    centreCircle = Ellipse((50, 50), width=30, height=39, edgecolor="black", facecolor="None", lw=1.8)
    centreSpot = Ellipse((50, 50), width=1, height=1.5, edgecolor="black", facecolor="black", lw=1.8)
    leftPenSpot = Ellipse((11, 50), width=1, height=1.5, edgecolor="black", facecolor="black", lw=1.8)
    rightPenSpot = Ellipse((89, 50), width=1, height=1.5, edgecolor="black", facecolor="black", lw=1.8)

    #Draw Circles
    ax.add_patch(centreCircle)
    ax.add_patch(centreSpot)
    ax.add_patch(leftPenSpot)
    ax.add_patch(rightPenSpot)
    
    #limit axis
    plt.xlim(0,100)
    plt.ylim(0,100)
    
    ax.annotate("", xy=(25, 5), xytext=(5, 5),
                arrowprops=dict(arrowstyle="->", linewidth=2))
    ax.text(7,7,'Attack',fontsize=20)
    return fig,ax
In [5]:
# Code from Pappalardo, L., Cintia, P., Rossi, A. et al
def draw_pitch(pitch, line, orientation, view):
    """
    Draw a soccer pitch given the pitch, the orientation, the view and the line
    
    Parameters
    ----------
    pitch
    
    """
    orientation = orientation
    view = view
    line = line
    pitch = pitch
    
    if orientation.lower().startswith("h"):
        
        if view.lower().startswith("h"):
            fig,ax = plt.subplots(figsize=(6.8,10.4))
            plt.xlim(49,105)
            plt.ylim(-1,69)
        else:
            fig,ax = plt.subplots(figsize=(10.4,6.8))
            plt.xlim(-1,105)
            plt.ylim(-1,69)
        ax.axis('off') # this hides the x and y ticks
    
        # side and goal lines #
        ly1 = [0,0,68,68,0]
        lx1 = [0,104,104,0,0]

        plt.plot(lx1,ly1,color=line,zorder=5)


        # boxes, 6 yard box and goals

            #outer boxes#
        ly2 = [13.84,13.84,54.16,54.16] 
        lx2 = [104,87.5,87.5,104]
        plt.plot(lx2,ly2,color=line,zorder=5)

        ly3 = [13.84,13.84,54.16,54.16] 
        lx3 = [0,16.5,16.5,0]
        plt.plot(lx3,ly3,color=line,zorder=5)

            #goals#
        ly4 = [30.34,30.34,37.66,37.66]
        lx4 = [104,104.2,104.2,104]
        plt.plot(lx4,ly4,color=line,zorder=5)

        ly5 = [30.34,30.34,37.66,37.66]
        lx5 = [0,-0.2,-0.2,0]
        plt.plot(lx5,ly5,color=line,zorder=5)


           #6 yard boxes#
        ly6 = [24.84,24.84,43.16,43.16]
        lx6 = [104,99.5,99.5,104]
        plt.plot(lx6,ly6,color=line,zorder=5)

        ly7 = [24.84,24.84,43.16,43.16]
        lx7 = [0,4.5,4.5,0]
        plt.plot(lx7,ly7,color=line,zorder=5)

        #Halfway line, penalty spots, and kickoff spot
        ly8 = [0,68] 
        lx8 = [52,52]
        plt.plot(lx8,ly8,color=line,zorder=5)


        plt.scatter(93,34,color=line,zorder=5)
        plt.scatter(11,34,color=line,zorder=5)
        plt.scatter(52,34,color=line,zorder=5)

        circle1 = plt.Circle((93.5,34), 9.15,ls='solid',lw=1.5,color=line, fill=False, zorder=1,alpha=1)
        circle2 = plt.Circle((10.5,34), 9.15,ls='solid',lw=1.5,color=line, fill=False, zorder=1,alpha=1)
        circle3 = plt.Circle((52, 34), 9.15,ls='solid',lw=1.5,color=line, fill=False, zorder=2,alpha=1)

        ## Rectangles in boxes
        rec1 = plt.Rectangle((87.5,20), 16,30,ls='-',color=pitch, zorder=1,alpha=1)
        rec2 = plt.Rectangle((0, 20), 16.5,30,ls='-',color=pitch, zorder=1,alpha=1)

        ## Pitch rectangle
        rec3 = plt.Rectangle((-1, -1), 106,70,ls='-',color=pitch, zorder=1,alpha=1)

        ax.add_artist(rec3)
        ax.add_artist(circle1)
        ax.add_artist(circle2)
        ax.add_artist(rec1)
        ax.add_artist(rec2)
        ax.add_artist(circle3)
        
    else:
        if view.lower().startswith("h"):
            fig,ax = plt.subplots(figsize=(10.4,6.8))
            plt.ylim(49,105)
            plt.xlim(-1,69)
        else:
            fig,ax = plt.subplots(figsize=(6.8,10.4))
            plt.ylim(-1,105)
            plt.xlim(-1,69)
        ax.axis('off') # this hides the x and y ticks

        # side and goal lines #
        lx1 = [0,0,68,68,0]
        ly1 = [0,104,104,0,0]

        plt.plot(lx1,ly1,color=line,zorder=5)


        # boxes, 6 yard box and goals

            #outer boxes#
        lx2 = [13.84,13.84,54.16,54.16] 
        ly2 = [104,87.5,87.5,104]
        plt.plot(lx2,ly2,color=line,zorder=5)

        lx3 = [13.84,13.84,54.16,54.16] 
        ly3 = [0,16.5,16.5,0]
        plt.plot(lx3,ly3,color=line,zorder=5)

            #goals#
        lx4 = [30.34,30.34,37.66,37.66]
        ly4 = [104,104.2,104.2,104]
        plt.plot(lx4,ly4,color=line,zorder=5)

        lx5 = [30.34,30.34,37.66,37.66]
        ly5 = [0,-0.2,-0.2,0]
        plt.plot(lx5,ly5,color=line,zorder=5)


           #6 yard boxes#
        lx6 = [24.84,24.84,43.16,43.16]
        ly6 = [104,99.5,99.5,104]
        plt.plot(lx6,ly6,color=line,zorder=5)

        lx7 = [24.84,24.84,43.16,43.16]
        ly7 = [0,4.5,4.5,0]
        plt.plot(lx7,ly7,color=line,zorder=5)

        #Halfway line, penalty spots, and kickoff spot
        lx8 = [0,68] 
        ly8 = [52,52]
        plt.plot(lx8,ly8,color=line,zorder=5)


        plt.scatter(34,93,color=line,zorder=5)
        plt.scatter(34,11,color=line,zorder=5)
        plt.scatter(34,52,color=line,zorder=5)

        circle1 = plt.Circle((34,93.5), 9.15,ls='solid',lw=1.5,color=line, fill=False, zorder=1,alpha=1)
        circle2 = plt.Circle((34,10.5), 9.15,ls='solid',lw=1.5,color=line, fill=False, zorder=1,alpha=1)
        circle3 = plt.Circle((34,52), 9.15,ls='solid',lw=1.5,color=line, fill=False, zorder=2,alpha=1)


        ## Rectangles in boxes
        rec1 = plt.Rectangle((20, 87.5), 30,16.5,ls='-',color=pitch, zorder=1,alpha=1)
        rec2 = plt.Rectangle((20, 0), 30,16.5,ls='-',color=pitch, zorder=1,alpha=1)

        ## Pitch rectangle
        rec3 = plt.Rectangle((-1, -1), 70,106,ls='-',color=pitch, zorder=1,alpha=1)

        ax.add_artist(rec3)
        ax.add_artist(circle1)
        ax.add_artist(circle2)
        ax.add_artist(rec1)
        ax.add_artist(rec2)
        ax.add_artist(circle3)
In [6]:
# Data files from Pappalardo, L., Cintia, P., Rossi, A. et al
# tag2name = Information about 'Tag' code
tags2name = pd.read_csv("tags2name.csv")
eventid2name = pd.read_csv("eventid2name.csv")

tags2name.head(3)
Out[6]:
Tag Label Description
0 101 Goal Goal
1 102 own_goal Own goal
2 301 assist Assist
In [7]:
# eventid2name = Information about 'event' code
eventid2name.head(3)
Out[7]:
event subevent event_label subevent_label
0 1 10 Duel Air duel
1 1 11 Duel Ground attacking duel
2 1 12 Duel Ground defending duel
In [8]:
# df_players = Information about players
df_players.head(3)
Out[8]:
Unnamed: 0 passportArea weight firstName middleName lastName currentTeamId birthDate height role birthArea wyId foot shortName currentNationalTeamId
0 0 {'name': 'Turkey', 'id': '792', 'alpha3code': ... 78 Harun NaN Tekin 4502.0 1989-06-17 187 {'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk... {'name': 'Turkey', 'id': '792', 'alpha3code': ... 32777 right H. Tekin 4687.0
1 1 {'name': 'Senegal', 'id': '686', 'alpha3code':... 73 Malang NaN Sarr 3775.0 1999-01-23 182 {'code2': 'DF', 'code3': 'DEF', 'name': 'Defen... {'name': 'France', 'id': '250', 'alpha3code': ... 393228 left M. Sarr 4423.0
2 2 {'name': 'France', 'id': '250', 'alpha3code': ... 72 Over NaN Mandanda 3772.0 1998-10-26 176 {'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk... {'name': 'France', 'id': '250', 'alpha3code': ... 393230 NaN O. Mandanda NaN
In [9]:
# Data preprocessing for EPL team-code conversion (No need to run again)
# Filter only the EPL matches
epl = df_matches[df_matches['nation'] == 'England']

# Step 1: Extract team names from 'Label'
epl[['Home', 'Away']] = epl['label'].str.extract(r'(.+?) - (.+?),')

# Step 2: Extract team scores from 'label'
epl[['Home_score', 'Away_score']] = epl['label'].str.extract(r', (\d+) - (\d+)').astype(int)

# Step 3: Create a new DataFrame with 'code' and 'team'
team_id = epl.loc[epl['winner'] != 0, ['winner', 'Home', 'Away', 'Home_score', 'Away_score']].copy()
team_id['team'] = team_id.apply(
    lambda row: row['Home'] if row['Home_score'] > row['Away_score'] else row['Away'], axis=1
)
team_id = team_id[['winner', 'team']].rename(columns={'winner': 'code'}).drop_duplicates(subset='code').reset_index(drop=True)

# Save as .csv file
team_id.to_csv('team_id.csv', index=False)
In [10]:
# Read the team code file
team_id = pd.read_csv("team_id.csv")
team_id
Out[10]:
code team
0 1659 AFC Bournemouth
1 1628 Crystal Palace
2 1609 Arsenal
3 1612 Liverpool
4 1611 Manchester United
5 1613 Newcastle United
6 1625 Manchester City
7 1639 Stoke City
8 1624 Tottenham Hotspur
9 1633 West Ham United
10 1631 Leicester City
11 1619 Southampton
12 1610 Chelsea
13 1644 Watford
14 1627 West Bromwich Albion
15 1651 Brighton & Hove Albion
16 1623 Everton
17 1646 Burnley
18 1673 Huddersfield Town
19 10531 Swansea City

Defining 'Attack Pattern'¶

What is attack pattern?¶

Defining attack pattern can be vary, and since there is no explicit attack pattern feature in the data, there are many features such as spatial x & y coordinates,events name, etc, in the data which I can utilize and combine to represent the "attack pattern" to detect this "latent" (unobserved) feature.

How many events?¶

In this project, I will include "$\bold{two}$ previous events" before shots that were chances (tags with 'opportunity'). Goals and assists are the two statistics that are widely used to evaluate players, but the previous events before the assist are somewhat overlooked and underestimated. I believe these events are also crucial in terms of contributing the sequences and could even be considered as a "starting point". For instance,

$\bold{Long\ pass\ (30-40m)\ to\ penalty\ box\ (Defender) - Short\ pass\ assist\ (Striker) - Goal\ (Midfielder)}$

For the sequences like above, I think the defender should get as much (or even higher) credit as the midfielder (I'm not saying that the assist event is not important). Although a football game could be viewed as one large continuous sequence, these two previous events before shots are the events that are somewhat directly involved and affecting the final consequence which I consider to be the key concept in this project.

I will first extract all the two previous events before opportunity shots (also the shots) as observations (rows) and compress the three events into one sequence (row) including diverse features.

In [11]:
# Data preprocessing
# Need to filter only the shot-events sequence rows
# 1. Find all the shots/goals row that were chances (opportunity)
# 2. Extract the two previous (contribution-assist) rows to combine those 3 rows as 1 shot-events sequence
# 3. Compress the 3 rows of sequence into one row

# List of all EPL match IDs
epl_match_id_list = df_matches[df_matches['nation'] == 'England']['wyId'].unique()
# Specific tag (opportunity)
target_tag_id = 201

# Initiate empty list to store the data
all_shot_sequences = []

# For loop to iterate through matches
for match_id in epl_match_id_list:
    # Get the events for the current nation
    events_list = events['England']
    
    # Iterate through the events, starting from the 3rd event to check previous 2 events
    for i in range(2, len(events_list)):
        ev = events_list[i]
        
        # Check if the current event is a 'Shot' event with 'Opportunity' condition
        if ev['matchId'] == match_id and ev['subEventName'] == 'Shot' and any(tag['id'] == target_tag_id for tag in ev['tags']):
            # Get the sequence of events (previous 2 + current)
            shot_sequence = events_list[i-2:i+1]
            
            # Check if all previous 2 events and the current shot event have valid position data
            valid_sequence = True
            for event in shot_sequence:
                # If any event has fewer than 2 position coordinates, set the flag to False
                if len(event['positions']) < 2:
                    valid_sequence = False
                    break  # If any event is invalid, no need to continue checking
            
            # Only append the sequence if it is valid
            if valid_sequence:
                all_shot_sequences.extend(shot_sequence)

# Convert into pandas dataframe
df_test = pd.DataFrame(all_shot_sequences)
df_test.head(3)
Out[11]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
0 7 Touch [{'id': 1302}] 8433 [{'y': 6, 'x': 39}, {'y': 23, 'x': 11}] 2500089 Others on the ball 1646 1H 1233.759219 72 251700446
1 8 Simple pass [{'id': 302}, {'id': 1801}] 9739 [{'y': 77, 'x': 89}, {'y': 61, 'x': 88}] 2500089 Pass 1659 1H 1238.114351 85 251700603
2 10 Shot [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 245813 [{'y': 61, 'x': 88}, {'y': 100, 'x': 100}] 2500089 Shot 1659 1H 1239.333857 100 251700604
In [12]:
# Flip the spatial x & y coordinates
def extract_and_flip_coordinates(df):
    # Initiate empty lists to store the x & y coordinates
    x_coords, y_coords = [], []

    # Determine the number of sequences (3 rows per sequence)
    num_sequences = len(df) // 3 

    # Initiate the variables to store and update the maximum pitch size
    #max_x = 0
    #max_y = 0

    for seq in range(num_sequences):
        # Get the 3 rows for the current sequence
        events = df.iloc[seq * 3:(seq + 1) * 3].reset_index()
        
        # Initialize the summed numbers
        summed_x_1, summed_y_1, summed_x_2, summed_y_2 = None, None, None, None
        
        # Check mismatches and calculate summed numbers where necessary
        if events.loc[0, "teamId"] != events.loc[2, "teamId"]:  # First event mismatch
            summed_x_1 = events.loc[0, "positions"][1]['x'] + events.loc[1, "positions"][0]['x']
            summed_y_1 = events.loc[0, "positions"][1]['y'] + events.loc[1, "positions"][0]['y']
        
        if events.loc[1, "teamId"] != events.loc[2, "teamId"]:  # Second event mismatch
            summed_x_2 = events.loc[1, "positions"][1]['x'] + events.loc[2, "positions"][0]['x']
            summed_y_2 = events.loc[1, "positions"][1]['y'] + events.loc[2, "positions"][0]['y']

        # Adjust coordinates for all three events
        for i in range(3):
            if i == 0 and events.loc[0, "teamId"] != events.loc[2, "teamId"]:  # Flip for first event mismatch
                x_coords.append(summed_x_1 - events.loc[0, "positions"][0]['x'])
                y_coords.append(summed_y_1 - events.loc[0, "positions"][0]['y'])
            elif i == 1 and events.loc[1, "teamId"] != events.loc[2, "teamId"]:  # Flip for second event mismatch
                x_coords.append(summed_x_2 - events.loc[1, "positions"][0]['x'])
                y_coords.append(summed_y_2 - events.loc[1, "positions"][0]['y'])
            else:  # No flip needed
                x_coords.append(events.loc[i, "positions"][0]['x'])
                y_coords.append(events.loc[i, "positions"][0]['y'])

        # Check if all matches have same size of pitch

    return x_coords, y_coords
In [13]:
# Output the updated DataFrame
df_test['x'], df_test['y'] = extract_and_flip_coordinates(df_test)
df_test.head(3)
Out[13]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id x y
0 7 Touch [{'id': 1302}] 8433 [{'y': 6, 'x': 39}, {'y': 23, 'x': 11}] 2500089 Others on the ball 1646 1H 1233.759219 72 251700446 61 94
1 8 Simple pass [{'id': 302}, {'id': 1801}] 9739 [{'y': 77, 'x': 89}, {'y': 61, 'x': 88}] 2500089 Pass 1659 1H 1238.114351 85 251700603 89 77
2 10 Shot [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 245813 [{'y': 61, 'x': 88}, {'y': 100, 'x': 100}] 2500089 Shot 1659 1H 1239.333857 100 251700604 88 61
In [14]:
# Convert the Tag-Name dataframe into dictionary form
tags2name_dict = dict(zip(tags2name['Tag'], tags2name['Description']))

# Assuming 'tags' contains dictionaries with an 'id' key
df_test['tagLabels'] = df_test['tags'].apply(lambda tags: ', '.join([tags2name_dict[tag['id']] for tag in tags]))
df_test['tagsID'] = df_test['tags'].astype(str).str.findall(r'\d+').apply(lambda x: ', '.join(map(str, x)))

# Create a new variable to store which team is involved in the event
df_test['team'] = df_test.merge(team_id, left_on='teamId', right_on='code', how='left')['team']

# Create a new variable to store the actual player name from 'df_players'
df_test['player'] = df_test.merge(df_players, left_on='playerId', right_on='wyId', how='left')['shortName']

# Create a new variable to convert the event time into minutes
df_test['eventMin'] = (df_test['eventSec'] / 60).round(2)
df_test.head(3)
Out[14]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id x y tagLabels tagsID team player eventMin
0 7 Touch [{'id': 1302}] 8433 [{'y': 6, 'x': 39}, {'y': 23, 'x': 11}] 2500089 Others on the ball 1646 1H 1233.759219 72 251700446 61 94 Missed ball 1302 Burnley S. Ward 20.56
1 8 Simple pass [{'id': 302}, {'id': 1801}] 9739 [{'y': 77, 'x': 89}, {'y': 61, 'x': 88}] 2500089 Pass 1659 1H 1238.114351 85 251700603 89 77 Key pass, Accurate 302, 1801 AFC Bournemouth J. Ibe 20.64
2 10 Shot [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 245813 [{'y': 61, 'x': 88}, {'y': 100, 'x': 100}] 2500089 Shot 1659 1H 1239.333857 100 251700604 88 61 Right foot, Opportunity, Position: Goal low ce... 402, 201, 1201, 1801 AFC Bournemouth L. Mousset 20.66
In [15]:
# Save as .csv file
df_test.to_csv("df_test.csv", index=False)

Check-point¶

In [16]:
# Read the saved csv file
df_test = pd.read_csv("df_test.csv")
df_test.head(3)
Out[16]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id x y tagLabels tagsID team player eventMin
0 7 Touch [{'id': 1302}] 8433 [{'y': 6, 'x': 39}, {'y': 23, 'x': 11}] 2500089 Others on the ball 1646 1H 1233.759219 72 251700446 61 94 Missed ball 1302 Burnley S. Ward 20.56
1 8 Simple pass [{'id': 302}, {'id': 1801}] 9739 [{'y': 77, 'x': 89}, {'y': 61, 'x': 88}] 2500089 Pass 1659 1H 1238.114351 85 251700603 89 77 Key pass, Accurate 302, 1801 AFC Bournemouth J. Ibe 20.64
2 10 Shot [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 245813 [{'y': 61, 'x': 88}, {'y': 100, 'x': 100}] 2500089 Shot 1659 1H 1239.333857 100 251700604 88 61 Right foot, Opportunity, Position: Goal low ce... 402, 201, 1201, 1801 AFC Bournemouth L. Mousset 20.66
In [ ]:
# Next preprocessing - Pivoting
# Generate a sequence of numbers to represent the group of 3 events (shot sequence)
# Create a new DataFrame to store the transformed shot sequence rows
shot_sequences = []

# Iterate through the dataframe in steps of 3 rows at a time
for i in range(0, len(df_test), 3):
    # Get the current chunk of 3 rows (ensuring we don't go out of bounds)
    chunk = df_test.iloc[i:i+3]
    
    # If there are fewer than 3 rows left at the end of the DataFrame, skip it
    if len(chunk) < 3:
        continue

    # Create a new row for the shot sequence
    new_row = {}

    # Iterate over the columns (except matchId) and create new column names with '_1', '_2', '_3'
    for col in df_test.columns:
        if col == 'matchId':
            new_row[col] = chunk[col].iloc[0]  # All matchIds in the chunk should be the same
        else:
            for j in range(3):
                new_row[f"{col}_{j+1}"] = chunk[col].iloc[j]
    
    # Append the new row to the shot_sequences list
    shot_sequences.append(new_row)

# Convert the list of shot sequences into a new DataFrame
shot_sequence_df = pd.DataFrame(shot_sequences)
shot_sequence_df.head(3)
Out[ ]:
eventId_1 eventId_2 eventId_3 subEventName_1 subEventName_2 subEventName_3 tags_1 tags_2 tags_3 playerId_1 ... tagsID_3 team_1 team_2 team_3 player_1 player_2 player_3 eventMin_1 eventMin_2 eventMin_3
0 7 8 10 Touch Simple pass Shot [{'id': 1302}] [{'id': 302}, {'id': 1801}] [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 8433 ... 402, 201, 1201, 1801 Burnley AFC Bournemouth AFC Bournemouth S. Ward J. Ibe L. Mousset 20.56 20.64 20.66
1 1 1 10 Ground attacking duel Ground defending duel Shot [{'id': 501}, {'id': 703}, {'id': 1801}] [{'id': 502}, {'id': 701}, {'id': 1802}] [{'id': 402}, {'id': 201}, {'id': 1203}, {'id'... 8980 ... 402, 201, 1203, 1801 Burnley AFC Bournemouth Burnley J. Hendrick S. Cook J. Hendrick 22.04 22.04 22.07
2 8 8 10 Simple pass Simple pass Shot [{'id': 1801}] [{'id': 302}, {'id': 1801}] [{'id': 401}, {'id': 201}, {'id': 1203}, {'id'... 245813 ... 401, 201, 1203, 1801 AFC Bournemouth AFC Bournemouth AFC Bournemouth L. Mousset J. King C. Daniels 24.84 24.88 24.91

3 rows × 55 columns

Adding New Features - Progression & Time Duration¶

  1. Progression Distance & Ratio
  • Progressing a ball to opponent's side is crucial in terms of increasing the possibility of threatening shots which obviously more likely to score goals. Here, we are creating two features, "Progression distance" and "Progression ratio". The former feature literally computes the progression distance, which would be simply the difference of "X" coordinates (Horizontal) between two events (i.e. Event2 - Event1 / Event3 - Event2). Since order does matter, it is possible to have a negative progression distance, indicating that the ball is going backward (i.e. Backpass). Using this distance, we are also calculating the proportion (ratio) of the progression by dividing the distance to its actual distance:

$$ \text{Progression Distance} = x_{i+1} - x_i $$

$$ \text{Progression Ratio} = \frac{x_{i+1} - x_i}{\sqrt{(x_{i+1} - x_i)^2 + (y_{i+1} - y_i)^2}} $$

  1. Time Duration
  • It may be also relevant to check the time duration between the events. In modern football, many strong teams tend to deliver the ball quickly to speed up the play tempo and make defenders difficult to react their play. By incorporating this feature in our analysis later might be helpful to reveal notable insights.
In [18]:
# Function for adding progression & event duration features
def add_progress_time_features(df, n_events=3):
    for i in range(1, n_events):
        # Event index suffix (e.g., 12 for event 1 to event 2)
        pair = f'{i}{i+1}'

        x_prev = f'x_{i}'
        x_curr = f'x_{i+1}'
        y_prev = f'y_{i}'
        y_curr = f'y_{i+1}'
        t_prev = f'eventSec_{i}'
        t_curr = f'eventSec_{i+1}'

        # 1. Horizontal progress distance
        shot_sequence_df[f'progress_dist_{pair}'] = shot_sequence_df[x_curr] - shot_sequence_df[x_prev]

        # 2. Euclidean distance
        euclid_dist = np.sqrt((shot_sequence_df[x_curr] - shot_sequence_df[x_prev])**2 + (shot_sequence_df[y_curr] - shot_sequence_df[y_prev])**2)

        # 3. Progress ratio (rounded to 2 decimals, replace NaNs with 0)
        ratio = shot_sequence_df[f'progress_dist_{pair}'] / euclid_dist
        shot_sequence_df[f'progress_ratio_{pair}'] = ratio.fillna(0).round(2)

        # 4. Time difference (rounded to 2 decimals)
        if t_prev in shot_sequence_df.columns and t_curr in shot_sequence_df.columns:
            shot_sequence_df[f'event_duration_{pair}'] = (shot_sequence_df[t_curr] - shot_sequence_df[t_prev]).round(2)
        else:
            shot_sequence_df[f'event_duration_{pair}'] = 0

    return shot_sequence_df
In [19]:
# Create another variable to see whether the team made the final shot is same as the team involved in the first and second event
shot_sequence_df['same_team_1'] = shot_sequence_df.apply(lambda row: 'same' if row['team_1'] == row['team_3'] else 'different', axis=1)
shot_sequence_df['same_team_2'] = shot_sequence_df.apply(lambda row: 'same' if row['team_2'] == row['team_3'] else 'different', axis=1)
shot_sequence_df['opponent_involve'] = shot_sequence_df.apply(lambda row: 'yes' if 'different' in [row['same_team_1'], row['same_team_2']] else 'no', axis=1)

# Variable to check whether the final shot made is accurate or not
shot_sequence_df['shot_accuracy'] = shot_sequence_df['tagLabels_3'].apply(lambda x: 'accurate' if 'Accurate' in x else ('not accurate' if 'Not accurate' in x else 'unknown'))

# Append progression & time features
shot_sequence_df = add_progress_time_features(shot_sequence_df)
shot_sequence_df.head(3)
Out[19]:
eventId_1 eventId_2 eventId_3 subEventName_1 subEventName_2 subEventName_3 tags_1 tags_2 tags_3 playerId_1 ... same_team_1 same_team_2 opponent_involve shot_accuracy progress_dist_12 progress_ratio_12 event_duration_12 progress_dist_23 progress_ratio_23 event_duration_23
0 7 8 10 Touch Simple pass Shot [{'id': 1302}] [{'id': 302}, {'id': 1801}] [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 8433 ... different same yes accurate 28 0.85 4.36 -1 -0.06 1.22
1 1 1 10 Ground attacking duel Ground defending duel Shot [{'id': 501}, {'id': 703}, {'id': 1801}] [{'id': 502}, {'id': 701}, {'id': 1802}] [{'id': 402}, {'id': 201}, {'id': 1203}, {'id'... 8980 ... same different yes accurate 0 0.00 0.31 1 0.14 1.82
2 8 8 10 Simple pass Simple pass Shot [{'id': 1801}] [{'id': 302}, {'id': 1801}] [{'id': 401}, {'id': 201}, {'id': 1203}, {'id'... 245813 ... same same no accurate 15 0.37 2.36 6 0.39 1.54

3 rows × 65 columns

In [20]:
# Save as .csv file
shot_sequence_df.to_csv("shot_sequence_df.csv", index=False)

Analysis 1 - Latent Variable with Unsupervised Learning¶

"Spatial-Only" Clustering¶

The most simple way to define would be using the spatial features (x & y coordinates) and cluster the sequences. Since each observation represents a sequence of shot-events, it is a spatial trajectory which we can implement "Frechet distance" to meaure the distance between the sequences to reflect and preserve the spatial structure ("dog-walking-a-leash").

Frechet Distance¶

According to Wikipedia, Frechet distance is one of the popular metrics to compute the distance between spatial sequences or curves.

After converting the data into a distance matrix, we can apply diverse clustering algorithms. Considering that the data contains about 6000 sequences, using complex algorithms might be computationally heavy. Here, we will use relatively simple algorithms like K-Medoids which is similar to K-Means except that it uses distance-based matrix rather than the Euclidean distance.

Reference

  • Frechet Distance (https://en.wikipedia.org/wiki/Fréchet_distance)
In [11]:
# Read the saved csv file
shot_sequence_df = pd.read_csv("shot_sequence_df.csv")
shot_sequence_df.head(3)
Out[11]:
eventId_1 eventId_2 eventId_3 subEventName_1 subEventName_2 subEventName_3 tags_1 tags_2 tags_3 playerId_1 ... same_team_1 same_team_2 opponent_involve shot_accuracy progress_dist_12 progress_ratio_12 event_duration_12 progress_dist_23 progress_ratio_23 event_duration_23
0 7 8 10 Touch Simple pass Shot [{'id': 1302}] [{'id': 302}, {'id': 1801}] [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 8433 ... different same yes accurate 28 0.85 4.36 -1 -0.06 1.22
1 1 1 10 Ground attacking duel Ground defending duel Shot [{'id': 501}, {'id': 703}, {'id': 1801}] [{'id': 502}, {'id': 701}, {'id': 1802}] [{'id': 402}, {'id': 201}, {'id': 1203}, {'id'... 8980 ... same different yes accurate 0 0.00 0.31 1 0.14 1.82
2 8 8 10 Simple pass Simple pass Shot [{'id': 1801}] [{'id': 302}, {'id': 1801}] [{'id': 401}, {'id': 201}, {'id': 1203}, {'id'... 245813 ... same same no accurate 15 0.37 2.36 6 0.39 1.54

3 rows × 65 columns

In [22]:
# Extract only the spatial coordinates columns
spatial_data = shot_sequence_df[['x_1', 'y_1', 'x_2', 'y_2', 'x_3', 'y_3']]

# Function to convert a row (pandas Series) to a sequence of 2D points
def row_to_sequence(row):
    # Create pairs of (x, y) coordinates for the 3 events
    return [(row[i], row[i+1]) for i in range(0, len(row), 2)]

# Convert the DataFrame rows into sequences of 2D points
sequences = spatial_data.apply(row_to_sequence, axis=1).tolist()
sequences[0:3]
Out[22]:
[[(np.int64(61), np.int64(94)),
  (np.int64(89), np.int64(77)),
  (np.int64(88), np.int64(61))],
 [(np.int64(77), np.int64(31)),
  (np.int64(77), np.int64(31)),
  (np.int64(78), np.int64(38))],
 [(np.int64(66), np.int64(81)),
  (np.int64(81), np.int64(43)),
  (np.int64(87), np.int64(29))]]
In [23]:
# Import package for Frechet distance
from frechetdist import frdist

# Compute the pairwise Frechet distance matrix
n = len(sequences)
frechet_dist = np.zeros((n, n))

# Compute pairwise Frechet distances (Takes about 20 minutes)
for i in range(n):
    for j in range(i + 1, n):
        distance = frdist(sequences[i], sequences[j])
        frechet_dist[i, j] = distance
        frechet_dist[j, i] = distance

# Display the Frechet-distance matrix
frechet_dist
Out[23]:
array([[ 0.        , 65.        , 32.01562119, ..., 63.95310782,
        56.0357029 , 56.8594759 ],
       [65.        ,  0.        , 51.19570294, ..., 14.31782106,
        23.34523506, 27.29468813],
       [32.01562119, 51.19570294,  0.        , ..., 50.11985634,
        42.63801121, 43.41658669],
       ...,
       [63.95310782, 14.31782106, 50.11985634, ...,  0.        ,
        11.40175425, 38.62641583],
       [56.0357029 , 23.34523506, 42.63801121, ..., 11.40175425,
         0.        , 37.65634077],
       [56.8594759 , 27.29468813, 43.41658669, ..., 38.62641583,
        37.65634077,  0.        ]], shape=(5916, 5916))
In [24]:
# Save the distance matrix as csv file
frechet_dist_df = pd.DataFrame(frechet_dist)
frechet_dist_df.to_csv("frechet_distance.csv", index=False)

K-Medoids Clustering¶

Clustering is an unsupervised learning method which handles data with no pre-assigned labels. This is useful descriptive analysis to segment observations sharing similar feature characteristics and further enable to understand how they are similar within or different to other clusters.

K-Medoids is a widely-used clustering algorithm that is similar to K-Means but works with distance-based matrix.

In [12]:
# Read the saved csv file
frechet_dist = pd.read_csv("frechet_distance.csv")
dist_matrix = frechet_dist.values
In [13]:
# Import package for K-Medoids
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score

# Define K values and random states
K_range = range(2, 21)
random_states = [207, 430, 437]

# Store average metrics
avg_cost = []
avg_sil = []

# Loop through each K
for k in K_range:
    temp_cost = []
    temp_sil = []

    # Run K-Medoids for each random state
    for rs in random_states:
        model = KMedoids(n_clusters=k, metric="precomputed", random_state=rs)
        labels = model.fit_predict(dist_matrix)

        # Save cost and silhouette
        temp_cost.append(model.inertia_)
        temp_sil.append(silhouette_score(dist_matrix, labels, metric="precomputed"))

    # Append average
    avg_cost.append(np.mean(temp_cost))
    avg_sil.append(np.mean(temp_sil))

# Plot both metrics
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Elbow Plot
axes[0].plot(K_range, avg_cost, marker='o')
axes[0].set_title('Average Elbow Plot (Frechet Distance Only)')
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Cost (Inertia)')
axes[0].set_xticks(list(K_range))

# Silhouette Score Plot
axes[1].plot(K_range, avg_sil, marker='o', color='green')
axes[1].set_title('Average Silhouette Score (Frechet Distance Only)')
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Score')
axes[1].set_xticks(list(K_range))

plt.tight_layout()
plt.show()
No description has been provided for this image

Combining Categorical Features¶

This time, other categorical features such as event types and shot accuracy that are likely to affect the attack pattern sequence will be included. With the Frechet-distance matrix for the spatial coordinates, the categorical features are going to be transformed into Gower's distance matrix.

  • Categorical: Event type / Opponent player involvement / Whether the shot was accurate
  • Numerical: Progress distance & ratio between events / Time duration between events

After the conversion process, the two matrices (Spatial Frechet matrix & Categorical Gower matrix) will be normalized, specifically MinMaxScale, to have an equal range from 0 to 1 and combine these two distance matrices into one with the equation like below:

$$ D_{final} = w_{Frechet} \cdot D_{Normalized\ Frechet} + w_{Gower} \cdot D_{Normalized\ Gower} $$

Weights may hold different values, but in this project we want to emphasize both data equally, which means that both weights $w_{Frechet}$ and $w_{Gower}$ will be 0.5.

$$ D_{final} = 0.5 \cdot D_{Normalized\ Frechet} + 0.5 \cdot D_{Normalized\ Gower} $$

Reference

  • Gower Distance (https://arxiv.org/ftp/arxiv/papers/2101/2101.02481.pdf)
In [14]:
# Import package for Gower distance
from gower import gower_matrix

# Define the mixed features (categorical + numerical)
X_mix = shot_sequence_df[['subEventName_1', 'subEventName_2', 'opponent_involve', 'shot_accuracy',
                          'progress_dist_12', 'progress_ratio_12', 'event_duration_12', 
                          'progress_dist_23', 'progress_ratio_23', 'event_duration_23']]

# Convert the categorical data into Gower's distance matrix
gower_dist = gower_matrix(X_mix)
gower_dist
Out[14]:
array([[0.        , 0.26367524, 0.2548754 , ..., 0.25610757, 0.2951537 ,
        0.47080293],
       [0.26367524, 0.        , 0.3389019 , ..., 0.0076088 , 0.25355783,
        0.41021228],
       [0.2548754 , 0.3389019 , 0.        , ..., 0.34637523, 0.39244175,
        0.516001  ],
       ...,
       [0.25610757, 0.0076088 , 0.34637523, ..., 0.        , 0.24606653,
        0.4177036 ],
       [0.2951537 , 0.25355783, 0.39244175, ..., 0.24606653, 0.        ,
        0.4182586 ],
       [0.47080293, 0.41021228, 0.516001  , ..., 0.4177036 , 0.4182586 ,
        0.        ]], shape=(5916, 5916), dtype=float32)
In [15]:
# Manual function to scale with MinMaxScaler with symmetric-preserving
# MinMax: Scale to a range of (0, 1)
def minmax_symmetrize(D):
    min_val = np.min(D)
    max_val = np.max(D)
    return (D - min_val) / (max_val - min_val)

# Scale both distance matrices
frechet_normalized = minmax_symmetrize(frechet_dist)
gower_normalized = minmax_symmetrize(gower_dist)

# Combine the matrices with equal weights
w_frechet = 0.5
w_gower = 0.5
D_final = np.array(w_frechet * frechet_normalized + w_gower * gower_normalized)

# Check the combined matrix
D_final
Out[15]:
array([[0.        , 0.26955945, 0.21823385, ..., 0.26294744, 0.27876703,
        0.39998796],
       [0.26955945, 0.        , 0.30202867, ..., 0.02486983, 0.20542291,
        0.31795093],
       [0.21823385, 0.30202867, 0.        , ..., 0.30566024, 0.32687789,
        0.41242341],
       ...,
       [0.26294744, 0.02486983, 0.30566024, ..., 0.        , 0.18389493,
        0.33863858],
       [0.27876703, 0.20542291, 0.32687789, ..., 0.18389493, 0.        ,
        0.33768547],
       [0.39998796, 0.31795093, 0.41242341, ..., 0.33863858, 0.33768547,
        0.        ]], shape=(5916, 5916))
In [16]:
# Define K values and random states
K_range = range(2, 21)
random_states = [207, 430, 437]

# Store average metrics
avg_cost = []
avg_sil = []

# Loop through each K
for k in K_range:
    temp_cost = []
    temp_sil = []

    # Run K-Medoids for each random state
    for rs in random_states:
        model = KMedoids(n_clusters=k, metric="precomputed", random_state=rs)
        labels = model.fit_predict(D_final)

        # Save cost and silhouette
        temp_cost.append(model.inertia_)
        temp_sil.append(silhouette_score(D_final, labels, metric="precomputed"))

    # Append average
    avg_cost.append(np.mean(temp_cost))
    avg_sil.append(np.mean(temp_sil))

# Plot both metrics
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Elbow Plot
axes[0].plot(K_range, avg_cost, marker='o')
axes[0].set_title('Average Elbow Plot (Frechet + Gower Distance)')
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Cost (Inertia)')
axes[0].set_xticks(list(K_range))

# Silhouette Score Plot
axes[1].plot(K_range, avg_sil, marker='o', color='green')
axes[1].set_title('Average Silhouette Score (Frechet + Gower Distance)')
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Score')
axes[1].set_xticks(list(K_range))

plt.tight_layout()
plt.show()
No description has been provided for this image
In [45]:
# Plot both metrics (2 rows, 1 column layout)
fig, axes = plt.subplots(2, 1, figsize=(7, 10))

# Elbow Plot
axes[0].plot(K_range, avg_cost, marker='o')
axes[0].set_title('Average Elbow Plot (Frechet + Gower Distance)')
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Cost (Inertia)')
axes[0].set_xticks(list(K_range))

# Silhouette Score Plot
axes[1].plot(K_range, avg_sil, marker='o', color='green')
axes[1].set_title('Average Silhouette Score (Frechet + Gower Distance)')
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Score')
axes[1].set_xticks(list(K_range))

plt.tight_layout()
plt.show()
No description has been provided for this image

Choosing Optimal Number of Clusters (K)¶

  1. $K = 7$ (Main)
  • We can find out that there is a drastic drop from $K = 6$ to $7$ (inertia being relatively flat) with showing an "elbow". Furthermore, the silhouette score skyrockets from $K = 6$ to $7$ but a dramatic drop from $K = 7$ to $8$. Therefore, $K = 7$ would be appropriate candidate to test.
  1. $K = 20$ (Optional)
  • Although the overall performance for $K = 20$ is poor with relatively low silhouette score and no clear elbow, we would still like to test $K = 20$ besides $K = 7$ since there are 20 EPL teams.
In [27]:
# Fit K-Medoids clustering to the distance matrix (Frechet distance only)
km7 = KMedoids(n_clusters=7, metric="precomputed", random_state=430)
km20 = KMedoids(n_clusters=20, metric="precomputed", random_state=430)

km7_pred = km7.fit_predict(D_final)
km20_pred = km20.fit_predict(D_final)

# Create 'Attack pattern'
shot_sequence_df['Attack_Frechet_Gower7'] = km7_pred
shot_sequence_df['Attack_Frechet_Gower20'] = km20_pred

# Check the created clustering labels
shot_sequence_df.head(3)
Out[27]:
eventId_1 eventId_2 eventId_3 subEventName_1 subEventName_2 subEventName_3 tags_1 tags_2 tags_3 playerId_1 ... opponent_involve shot_accuracy progress_dist_12 progress_ratio_12 event_duration_12 progress_dist_23 progress_ratio_23 event_duration_23 Attack_Frechet_Gower7 Attack_Frechet_Gower20
0 7 8 10 Touch Simple pass Shot [{'id': 1302}] [{'id': 302}, {'id': 1801}] [{'id': 402}, {'id': 201}, {'id': 1201}, {'id'... 8433 ... yes accurate 28 0.85 4.36 -1 -0.06 1.22 1 7
1 1 1 10 Ground attacking duel Ground defending duel Shot [{'id': 501}, {'id': 703}, {'id': 1801}] [{'id': 502}, {'id': 701}, {'id': 1802}] [{'id': 402}, {'id': 201}, {'id': 1203}, {'id'... 8980 ... yes accurate 0 0.00 0.31 1 0.14 1.82 2 2
2 8 8 10 Simple pass Simple pass Shot [{'id': 1801}] [{'id': 302}, {'id': 1801}] [{'id': 401}, {'id': 201}, {'id': 1203}, {'id'... 245813 ... no accurate 15 0.37 2.36 6 0.39 1.54 0 10

3 rows × 67 columns

Dimensionality Reduction - UMAP¶

In [46]:
# Package for UMAP
import umap.umap_ as umap

# Use 'precomputed' metric
umap_model = umap.UMAP(metric='precomputed', random_state=430)
embedding = umap_model.fit_transform(D_final)

# Create the DataFrame
df_umap = pd.DataFrame({
    'UMAP1': embedding[:, 0],
    'UMAP2': embedding[:, 1],
    'Cluster_7': shot_sequence_df['Attack_Frechet_Gower7'],
    'Cluster_20': shot_sequence_df['Attack_Frechet_Gower20']
})
In [47]:
# Set up 1x3 subplots for UMAP visualizations
fig, axes = plt.subplots(1, 3, figsize=(21, 6), constrained_layout=True)

# 1. Raw UMAP
sns.scatterplot(data=df_umap, x='UMAP1', y='UMAP2', ax=axes[0], color='gray', s=10)
axes[0].set_title('UMAP (No Cluster Labels)')

# 2. UMAP with 7 clusters
sns.scatterplot(data=df_umap, x='UMAP1', y='UMAP2', hue='Cluster_7', palette='tab10', ax=axes[1], s=10)
axes[1].set_title('UMAP (7 Clusters)')

# Move legend outside
axes[1].legend(title='Cluster', bbox_to_anchor=(1, 1), loc='upper left')

# 3. UMAP with 20 clusters
sns.scatterplot(data=df_umap, x='UMAP1', y='UMAP2', hue='Cluster_20', palette='tab20', ax=axes[2], s=10)
axes[2].set_title('UMAP (20 Clusters)')

# Move and format legend (10 columns × 2 rows)
axes[2].legend(title='Cluster', bbox_to_anchor=(1, 1), loc='upper left', ncol=2)

# Set shared labels
for ax in axes:
    ax.set_xlabel('UMAP1')
    ax.set_ylabel('UMAP2')

plt.suptitle('UMAP Projections with Different Cluster Labels', fontsize=18, y=1.1)
plt.show()
No description has been provided for this image
In [48]:
# Set figure size
plt.figure(figsize=(7, 6))

# UMAP with 7 clusters
sns.scatterplot(data=df_umap, x='UMAP1', y='UMAP2', hue='Cluster_7', palette='tab10', s=10)

# Title and labels
plt.title('UMAP (7 Clusters)')
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')

# Move legend outside
plt.legend(title='Cluster', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [30]:
# Save as .csv file
shot_sequence_df.to_csv("shot_sequence_df_cluster.csv", index=False)

Analysis 2 - Translating Research Question into Mathematical Equation¶

After creating the latent feature of "attack pattern" by clustering the data, it is time to check the research question again. To answer the research question of "Do different teams in the EPL tend to have their own unique shot event sequence(s)?", it is necessary to translate into a mathematical question. Let $T$ = EPL team and $A$ = Attack pattern (Cluster label from above). The translated question would be:

$$ P(A = a | T = t) = P(A = a)? $$

Mathematical Question to Statistical Test¶

In other words, it is checking whether there is an association between the EPL team and attack pattern (independence). Considering that it is checking the relationship between the two categorical variables, creating a contingency table would be the first step, and statistical tests like chi-squared test or Fisher's exact test would be appropriate to implement afterwards.

In [12]:
# Load the saved csv file
shot_sequence_df = pd.read_csv("shot_sequence_df_cluster.csv")
In [13]:
# Create contingency table for k=7
contingency_frechet_gower7 = pd.crosstab(shot_sequence_df['team_3'], shot_sequence_df['Attack_Frechet_Gower7'])

# Compute total shots per team
total_sum = contingency_frechet_gower7.sum(axis=1)

# Compute proportions (row-wise)
proportions = contingency_frechet_gower7.div(total_sum, axis=0)

# Get max cluster and proportion
max_proportion = proportions.max(axis=1).round(3)
max_cluster = proportions.idxmax(axis=1).astype(int)

# Compute second-highest cluster (use argsort)
# argsort() gives sorted indices (ascending), so we pick second from last [-2]
second_max_cluster = proportions.apply(lambda row: row.sort_values(ascending=False).index[1], axis=1).astype(int)

# Combine results
contingency_frechet_gower7 = contingency_frechet_gower7.assign(
    Total_shots=total_sum,
    Max_cluster=max_cluster,
    Max2_cluster=second_max_cluster,
    Max_proportion=max_proportion
)

# Display the contingency table
contingency_frechet_gower7
Out[13]:
Attack_Frechet_Gower7 0 1 2 3 4 5 6 Total_shots Max_cluster Max2_cluster Max_proportion
team_3
AFC Bournemouth 51 42 31 54 30 45 32 285 3 0 0.189
Arsenal 101 36 60 79 34 33 53 396 0 3 0.255
Brighton & Hove Albion 36 30 24 43 46 37 35 251 4 3 0.183
Burnley 38 24 35 38 51 30 43 259 4 6 0.197
Chelsea 69 44 51 68 42 65 39 378 0 3 0.183
Crystal Palace 42 31 47 48 46 40 35 289 3 2 0.166
Everton 36 19 35 43 43 28 27 231 3 4 0.186
Huddersfield Town 33 29 29 44 38 39 22 234 3 5 0.188
Leicester City 48 27 36 35 45 51 27 269 5 0 0.190
Liverpool 89 35 65 78 43 73 35 418 0 3 0.213
Manchester City 126 42 82 88 41 53 35 467 0 3 0.270
Manchester United 65 28 45 57 24 49 38 306 0 3 0.212
Newcastle United 59 27 38 45 38 31 41 279 0 3 0.211
Southampton 50 30 35 55 37 43 34 284 3 0 0.194
Stoke City 46 26 37 36 46 31 30 252 0 4 0.183
Swansea City 32 9 37 29 27 30 29 193 2 0 0.192
Tottenham Hotspur 86 31 54 60 42 53 34 360 0 3 0.239
Watford 46 27 33 46 59 42 32 285 4 0 0.207
West Bromwich Albion 32 24 40 36 46 29 37 244 4 2 0.189
West Ham United 37 20 38 38 34 36 33 236 2 3 0.161
In [14]:
# Define the grid size
rows, cols = 5, 4
fig, axes = plt.subplots(rows, cols, figsize=(20, 16))
fig.tight_layout(pad=4)

# Flatten the axes for easier iteration
axes = axes.flatten()

# Loop through each team and plot its histogram
for i, team in enumerate(contingency_frechet_gower7.index):
    ax = axes[i]
    
    # Exclude 'Total_Sum', 'Max_proportion', and 'Max_cluster' from plotting
    cluster_counts = contingency_frechet_gower7.loc[team].drop(['Total_shots', 'Max_proportion', 'Max_cluster', 'Max2_cluster'])
    
    # Compute relative frequencies
    relative_freq = cluster_counts / cluster_counts.sum()

    # Convert index to numeric
    cluster_labels = relative_freq.index.astype(int)  

    # Plot bar chart
    ax.bar(cluster_labels, relative_freq.values, color='dodgerblue')
    ax.set_title(f'{team}')
    ax.set_xlabel('Cluster label')
    ax.set_ylabel('Percentage')
    ax.set_xticks(cluster_labels)
    ax.set_xticklabels(cluster_labels, rotation=45)

# Add the main title
fig.suptitle("Distribution of Attack Pattern Cluster by Team (7 Clusters)", fontsize=16, fontweight='bold', y=1)
plt.show()
No description has been provided for this image

Chi-squared & Fisher's Exact Test¶

In [15]:
# Package for Chi-squared tes & Fisher's Exact test
from scipy.stats import chi2_contingency
from scipy.stats import fisher_exact

# Chi-squared test
_, p_value_chisq, _, _ = chi2_contingency(contingency_frechet_gower7)
print(f"P-value (Chi-squared): {p_value_chisq}")

# Fisher's Exact test
_, p_value_fisher = fisher_exact(contingency_frechet_gower7)
print(f"P-value (Fisher's Exact): {p_value_fisher}")
P-value (Chi-squared): 3.104035983995996e-07
P-value (Fisher's Exact): 0.0001

Hypothesis Testing¶

  • Null ($H_0$): There is no association between EPL teams and attack pattern.
  • Alternative ($H_A$): There is an association between EPL teams and attack pattern.

Both Chi-squared and Fisher's exact test yielded significantly small p-values ($3.10 \times e^{-07}$, $0.0001$), indicating that there is an association between EPL teams and their attack patterns.

In [35]:
# Create contingency table for k=20
contingency_frechet_gower20 = pd.crosstab(shot_sequence_df['team_3'], shot_sequence_df['Attack_Frechet_Gower20'])

# Compute total shots per team
total_sum = contingency_frechet_gower20.sum(axis=1)

# Compute proportions (row-wise)
proportions = contingency_frechet_gower20.div(total_sum, axis=0)

# Get max cluster and proportion
max_proportion = proportions.max(axis=1).round(3)
max_cluster = proportions.idxmax(axis=1).astype(int)

# Compute second-highest cluster (use argsort)
# argsort() gives sorted indices (ascending), so we pick second from last [-2]
second_max_cluster = proportions.apply(lambda row: row.sort_values(ascending=False).index[1], axis=1).astype(int)

# Combine results
contingency_frechet_gower20 = contingency_frechet_gower20.assign(
    Total_shots=total_sum,
    Max_cluster=max_cluster,
    Max2_cluster=second_max_cluster,
    Max_proportion=max_proportion
)

# Expand the number of columns displayed in the table
pd.set_option('display.max_columns', None)
contingency_frechet_gower20
Out[35]:
Attack_Frechet_Gower20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Total_shots Max_cluster Max2_cluster Max_proportion
team_3
AFC Bournemouth 5 3 34 32 7 11 9 16 2 52 53 4 0 12 16 3 2 3 6 15 285 10 9 0.186
Arsenal 3 8 34 28 31 15 6 18 5 74 95 3 2 24 11 3 6 4 4 22 396 10 9 0.240
Brighton & Hove Albion 4 5 19 30 9 7 7 14 8 38 35 2 3 10 10 8 5 11 4 22 251 9 10 0.151
Burnley 3 14 18 24 11 7 8 14 11 32 38 5 5 18 7 9 3 6 6 20 259 10 9 0.147
Chelsea 2 3 33 57 24 16 16 28 8 58 68 2 3 17 11 4 5 6 5 12 378 10 9 0.180
Crystal Palace 4 5 33 34 23 9 9 16 11 42 38 6 2 11 8 5 6 5 5 17 289 9 10 0.145
Everton 4 9 30 22 13 3 6 12 8 37 33 2 1 14 7 8 7 4 3 8 231 9 10 0.160
Huddersfield Town 5 3 19 24 14 13 10 16 6 37 33 5 3 13 9 3 7 1 4 9 234 9 10 0.158
Leicester City 5 10 24 38 13 14 8 15 2 30 45 1 0 15 12 1 13 9 4 10 269 10 3 0.167
Liverpool 3 12 47 52 24 9 9 21 7 72 87 6 4 18 17 6 7 3 2 12 418 10 9 0.208
Manchester City 2 9 46 43 35 17 16 23 6 84 125 4 2 13 14 2 6 7 1 12 467 10 9 0.268
Manchester United 3 2 37 35 20 7 7 13 6 50 61 2 2 22 15 3 0 5 3 13 306 10 9 0.199
Newcastle United 2 9 23 17 18 6 8 17 8 42 55 5 2 18 13 4 8 0 6 18 279 10 9 0.197
Southampton 5 2 27 23 14 9 13 12 9 48 49 3 0 15 20 4 3 3 5 20 284 10 9 0.173
Stoke City 2 3 33 20 14 10 7 16 9 35 41 3 3 10 8 7 5 5 4 17 252 10 9 0.163
Swansea City 3 2 19 19 18 7 5 5 2 27 33 5 1 14 7 4 4 4 1 13 193 10 9 0.171
Tottenham Hotspur 6 8 39 33 21 10 5 16 6 57 82 6 2 24 15 2 10 4 6 8 360 10 9 0.228
Watford 7 11 18 35 16 8 8 10 11 38 43 2 5 16 14 5 12 5 7 14 285 10 9 0.151
West Bromwich Albion 1 6 23 19 15 14 11 10 14 28 29 2 4 14 13 6 7 4 6 18 244 10 9 0.119
West Ham United 1 6 25 28 17 6 7 12 6 33 36 3 3 18 3 1 5 5 6 15 236 10 9 0.153
In [36]:
# Define the grid size
rows, cols = 5, 4
fig, axes = plt.subplots(rows, cols, figsize=(20, 16))
fig.tight_layout(pad=4)

# Flatten the axes for easier iteration
axes = axes.flatten()

# Loop through each team and plot its histogram
for i, team in enumerate(contingency_frechet_gower20.index):
    ax = axes[i]
    
    # Exclude 'Total_Sum', 'Max_proportion', and 'Max_cluster' from plotting
    cluster_counts = contingency_frechet_gower20.loc[team].drop(['Total_shots', 'Max_proportion', 'Max_cluster', 'Max2_cluster'])
    
    # Compute relative frequencies
    relative_freq = cluster_counts / cluster_counts.sum()

    # Convert index to numeric
    cluster_labels = relative_freq.index.astype(int)  

    # Plot bar chart
    ax.bar(cluster_labels, relative_freq.values, color='dodgerblue')
    ax.set_title(f'{team}')
    ax.set_xlabel('Cluster label')
    ax.set_ylabel('Percentage')
    ax.set_xticks(cluster_labels)
    ax.set_xticklabels(cluster_labels, rotation=45)

# Add the main title
fig.suptitle("Distribution of Attack Pattern Cluster by Team (20 Clusters)", fontsize=16, fontweight='bold', y=1)
plt.show()
No description has been provided for this image
In [37]:
# Chi-squared test
_, p_value_chisq, _, _ = chi2_contingency(contingency_frechet_gower20)
print(f"P-value (Chi-squared): {p_value_chisq}")

# Fisher's exact test
_, p_value_fisher = fisher_exact(contingency_frechet_gower20)
print(f"P-value (Fisher's Exact): {p_value_fisher}")
P-value (Chi-squared): 0.00011950401648507138
P-value (Fisher's Exact): 1.0

Further Analysis¶

1. Big 6 Teams' Attack Sequence¶

The tradition of so called "Big 6 Clubs" in the EPL are Arsenal, Chelsea, Liverpool, Manchester City, Manchester United and Tottenham Hotspur. The 17-18 EPL league table of these teams was like below:

  1. Manchester City
  2. Manchester United
  3. Tottenham Hotspur
  4. Liverpool
  5. Chelsea
  6. Arsenal

One thing to note is that all big 6 clubs have the same cluster distributions, cluster 0 being the most followed by 3.

In [16]:
# Define the big 6 teams in EPL and filter only these teams' shot sequences
big6_teams = ['Manchester City', 'Manchester United', 'Liverpool', 'Arsenal', 'Chelsea', 'Tottenham Hotspur']
big6 = shot_sequence_df[shot_sequence_df['team_3'].isin(big6_teams)]

# Extract the attack pattern by cluster
manc_row0 = big6[(big6['team_3'] == 'Manchester City') & (big6['Attack_Frechet_Gower7'] == 0)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
manu_row0 = big6[(big6['team_3'] == 'Manchester United') & (big6['Attack_Frechet_Gower7'] == 0)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
tot_row0 = big6[(big6['team_3'] == 'Tottenham Hotspur') & (big6['Attack_Frechet_Gower7'] == 0)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
liv_row0 = big6[(big6['team_3'] == 'Liverpool') & (big6['Attack_Frechet_Gower7'] == 0)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
chel_row0 = big6[(big6['team_3'] == 'Chelsea') & (big6['Attack_Frechet_Gower7'] == 0)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
ars_row0 = big6[(big6['team_3'] == 'Arsenal') & (big6['Attack_Frechet_Gower7'] == 0)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()

# Extract x and y coordinates for each of the 3 points
manc_x0 = [manc_row0['x_1'], manc_row0['x_2'], manc_row0['x_3']]
manc_y0 = [manc_row0['y_1'], manc_row0['y_2'], manc_row0['y_3']]

manu_x0 = [manu_row0['x_1'], manu_row0['x_2'], manu_row0['x_3']]
manu_y0 = [manu_row0['y_1'], manu_row0['y_2'], manu_row0['y_3']]

tot_x0 = [tot_row0['x_1'], tot_row0['x_2'], tot_row0['x_3']]
tot_y0 = [tot_row0['y_1'], tot_row0['y_2'], tot_row0['y_3']]

liv_x0 = [liv_row0['x_1'], liv_row0['x_2'], liv_row0['x_3']]
liv_y0 = [liv_row0['y_1'], liv_row0['y_2'], liv_row0['y_3']]

chel_x0 = [chel_row0['x_1'], chel_row0['x_2'], chel_row0['x_3']]
chel_y0 = [chel_row0['y_1'], chel_row0['y_2'], chel_row0['y_3']]

ars_x0 = [ars_row0['x_1'], ars_row0['x_2'], ars_row0['x_3']]
ars_y0 = [ars_row0['y_1'], ars_row0['y_2'], ars_row0['y_3']]
In [17]:
# Create a pitch
f = draw_pitch("#195905", "#f3efec", "h", "full")

# Plot each category
plt.plot(manc_x0, manc_y0, color='cyan', linestyle='-', linewidth=1, markersize=4, label="Man City")
plt.plot(manu_x0, manu_y0, color='red', linestyle='-', linewidth=1, markersize=4, label="Man United")
plt.plot(tot_x0, tot_y0, color='grey', linestyle='-', linewidth=1, markersize=4, label="Tottenham")
plt.plot(liv_x0, liv_y0, color='black', linestyle='-', linewidth=1, markersize=4, label="Liverpool")
plt.plot(chel_x0, chel_y0, color='dodgerblue', linestyle='-', linewidth=1, markersize=4, label="Chelsea")
plt.plot(ars_x0, ars_y0, color='yellow', linestyle='-', linewidth=1, markersize=4, label="Arsenal")

# Spatial Coordinates (Highlight the last event with star)
plt.text(manc_x0[0], manc_y0[0], '1', fontsize=8, color='cyan', fontweight='bold', ha='right', va='top')
plt.text(manc_x0[1], manc_y0[1], '2', fontsize=8, color='cyan', fontweight='bold', ha='right', va='top')
plt.scatter(manc_x0[2], manc_y0[2], marker='*', c='cyan', s=25, zorder=13)

plt.text(manu_x0[0], manu_y0[0], '1', fontsize=8, color='red', fontweight='bold', ha='right', va='top')
plt.text(manu_x0[1], manu_y0[1], '2', fontsize=8, color='red', fontweight='bold', ha='right', va='top')
plt.scatter(manu_x0[2], manu_y0[2], marker='*', c='red', s=25, zorder=13)

plt.text(tot_x0[0], tot_y0[0], '1', fontsize=8, color='grey', fontweight='bold', ha='right', va='top')
plt.text(tot_x0[1], tot_y0[1], '2', fontsize=8, color='grey', fontweight='bold', ha='right', va='top')
plt.scatter(tot_x0[2], tot_y0[2], marker='*', c='grey', s=25, zorder=13)

plt.text(liv_x0[0], liv_y0[0], '1', fontsize=8, color='black', fontweight='bold', ha='right', va='top')
plt.text(liv_x0[1], liv_y0[1], '2', fontsize=8, color='black', fontweight='bold', ha='right', va='top')
plt.scatter(liv_x0[2], liv_y0[2], marker='*', c='black', s=25, zorder=13)

plt.text(chel_x0[0], chel_y0[0], '1', fontsize=8, color='dodgerblue', fontweight='bold', ha='right', va='top')
plt.text(chel_x0[1], chel_y0[1], '2', fontsize=8, color='dodgerblue', fontweight='bold', ha='right', va='top')
plt.scatter(chel_x0[2], chel_y0[2], marker='*', c='dodgerblue', s=25, zorder=13)

plt.text(ars_x0[0], ars_y0[0], '1', fontsize=8, color='yellow', fontweight='bold', ha='right', va='top')
plt.text(ars_x0[1], ars_y0[1], '2', fontsize=8, color='yellow', fontweight='bold', ha='right', va='top')
plt.scatter(ars_x0[2], ars_y0[2], marker='*', c='yellow', s=25, zorder=13)

# Annotation arrow for attack direction
plt.annotate("", xy=(25, -3), xytext=(5, -5), arrowprops=dict(arrowstyle="->", linewidth=2))
plt.text(7, -4, 'Attack (---------->)', fontsize=15)
plt.title("Average Attack Sequence of Big 6 Teams (Cluster 0)")
plt.legend(title="Team", bbox_to_anchor=(1, 1), loc='upper left')
plt.show()
No description has been provided for this image
In [18]:
# Table group by event combinations - Big 6 & Cluster 0
(shot_sequence_df[
        (shot_sequence_df['team_3'].isin(big6_teams)) &
        (shot_sequence_df['Attack_Frechet_Gower7'] == 0)
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .sort_values(by='count', ascending=False)
    .query('count >= 5')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']])
Out[18]:
count goal_count goal_percentage accuracy_rate
subEventName_1 subEventName_2
Simple pass Simple pass 156 31 0.20 1.0
Smart pass 59 20 0.34 1.0
Cross 51 29 0.57 1.0
High pass 27 9 0.33 1.0
Ground attacking duel Simple pass 22 8 0.36 1.0
Touch Simple pass 16 6 0.38 1.0
Acceleration Smart pass 15 8 0.53 1.0
Smart pass Cross 14 8 0.57 1.0
Acceleration Simple pass 13 4 0.31 1.0
High pass Simple pass 12 5 0.42 1.0
Smart pass Simple pass 12 9 0.75 1.0
Ground attacking duel Cross 12 7 0.58 1.0
Smart pass 11 5 0.45 1.0
Simple pass Touch 10 0 0.00 1.0
Acceleration 9 0 0.00 1.0
Cross Simple pass 7 4 0.57 1.0
High pass Cross 5 4 0.80 1.0
Acceleration Cross 5 2 0.40 1.0
In [19]:
# Separate by Team - Big 6 & Cluster 0
(shot_sequence_df[(shot_sequence_df['team_3'].isin(big6_teams)) &
                  (shot_sequence_df['Attack_Frechet_Gower7'] == 0)]
    .groupby(['team_3', 'subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .query('count >= 3')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']]
    .sort_values(['team_3', 'count'], ascending=[True, False]))
Out[19]:
count goal_count goal_percentage accuracy_rate
team_3 subEventName_1 subEventName_2
Arsenal Simple pass Simple pass 38 8 0.21 1.0
Smart pass 12 5 0.42 1.0
Cross 6 2 0.33 1.0
Ground attacking duel Simple pass 5 2 0.40 1.0
Touch Simple pass 5 3 0.60 1.0
Smart pass Cross 4 2 0.50 1.0
Ground attacking duel Smart pass 3 1 0.33 1.0
Simple pass High pass 3 0 0.00 1.0
Smart pass Simple pass 3 1 0.33 1.0
Chelsea Simple pass Simple pass 20 2 0.10 1.0
Smart pass 8 2 0.25 1.0
Cross 6 1 0.17 1.0
High pass 5 2 0.40 1.0
Ground attacking duel Simple pass 3 1 0.33 1.0
Simple pass Acceleration 3 0 0.00 1.0
Touch 3 0 0.00 1.0
Liverpool Simple pass Simple pass 21 4 0.19 1.0
Cross 11 6 0.55 1.0
Smart pass 10 6 0.60 1.0
High pass 6 1 0.17 1.0
Ground attacking duel Cross 4 2 0.50 1.0
Acceleration Simple pass 3 2 0.67 1.0
Ground attacking duel Simple pass 3 2 0.67 1.0
High pass Simple pass 3 0 0.00 1.0
Manchester City Simple pass Simple pass 33 9 0.27 1.0
Cross 18 12 0.67 1.0
Smart pass 13 4 0.31 1.0
High pass 8 5 0.62 1.0
Smart pass Cross 5 2 0.40 1.0
Simple pass 5 4 0.80 1.0
Acceleration Simple pass 4 1 0.25 1.0
Ground attacking duel Simple pass 4 0 0.00 1.0
Acceleration Smart pass 3 2 0.67 1.0
Corner Simple pass 3 2 0.67 1.0
Ground attacking duel Smart pass 3 2 0.67 1.0
High pass Simple pass 3 1 0.33 1.0
Touch Simple pass 3 1 0.33 1.0
Manchester United Simple pass Simple pass 22 6 0.27 1.0
Smart pass 5 2 0.40 1.0
Acceleration Smart pass 3 2 0.67 1.0
Ground attacking duel Cross 3 2 0.67 1.0
Simple pass 3 2 0.67 1.0
Tottenham Hotspur Simple pass Simple pass 22 2 0.09 1.0
Smart pass 11 1 0.09 1.0
Cross 8 7 0.88 1.0
Acceleration Smart pass 6 4 0.67 1.0
Ground attacking duel Simple pass 4 1 0.25 1.0
Simple pass High pass 3 1 0.33 1.0
Smart pass Simple pass 3 3 1.00 1.0
Touch Simple pass 3 0 0.00 1.0

For the sequence pattern of cluster 0, all big 6 teams showed similar attack sequence trajectories. The average attack sequences were relatively linear occurred at the mid-left side (attack team perspective) with the final shot made nearly at the penalty box line. The most attempted pattern for opportunity shot was having both two previous events being simple pass (Simple pass - Simple pass) with all final shots being accurate. For more details by each team:

  1. Arsenal: Simple pass - Smart pass (42%) / Touch - Simple pass (60%)
  2. Chelsea: Simple pass - High pass (40%) / Relatively low or 0% success rate of goal for other patterns
  3. Liverpool: Simple pass - Cross (55%) / Simple pass - Smart pass (60%) / Relatively high percentage of goals
  4. Manchester City: Simple pass - Cross (67%) / Simple pass - High pass (62%) /Smart pass - Simple pass (80%) / Most various attack sequence patterns
  5. Manchester United: Simple pass - Smart pass (40%) / Least diverse attack sequence patterns
  6. Tottenham Hotspur: Simple pass - Cross (88%) / Acceleration - Smart pass (67%) / Attack sequence tend to rely on players' height & speed
In [20]:
# Extract the attack pattern by second max cluster
manc_row3 = big6[(big6['team_3'] == 'Manchester City') & (big6['Attack_Frechet_Gower7'] == 3)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
manu_row3 = big6[(big6['team_3'] == 'Manchester United') & (big6['Attack_Frechet_Gower7'] == 3)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
tot_row3 = big6[(big6['team_3'] == 'Tottenham Hotspur') & (big6['Attack_Frechet_Gower7'] == 3)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
liv_row3 = big6[(big6['team_3'] == 'Liverpool') & (big6['Attack_Frechet_Gower7'] == 3)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
chel_row3 = big6[(big6['team_3'] == 'Chelsea') & (big6['Attack_Frechet_Gower7'] == 3)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
ars_row3 = big6[(big6['team_3'] == 'Arsenal') & (big6['Attack_Frechet_Gower7'] == 3)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()

# Extract x and y coordinates for each of the 3 points
manc_x3 = [manc_row3['x_1'], manc_row3['x_2'], manc_row3['x_3']]
manc_y3 = [manc_row3['y_1'], manc_row3['y_2'], manc_row3['y_3']]

manu_x3 = [manu_row3['x_1'], manu_row3['x_2'], manu_row3['x_3']]
manu_y3 = [manu_row3['y_1'], manu_row3['y_2'], manu_row3['y_3']]

tot_x3 = [tot_row3['x_1'], tot_row3['x_2'], tot_row3['x_3']]
tot_y3 = [tot_row3['y_1'], tot_row3['y_2'], tot_row3['y_3']]

liv_x3 = [liv_row3['x_1'], liv_row3['x_2'], liv_row3['x_3']]
liv_y3 = [liv_row3['y_1'], liv_row3['y_2'], liv_row3['y_3']]

chel_x3 = [chel_row3['x_1'], chel_row3['x_2'], chel_row3['x_3']]
chel_y3 = [chel_row3['y_1'], chel_row3['y_2'], chel_row3['y_3']]

ars_x3 = [ars_row3['x_1'], ars_row3['x_2'], ars_row3['x_3']]
ars_y3 = [ars_row3['y_1'], ars_row3['y_2'], ars_row3['y_3']]
In [21]:
# Create a pitch
f = draw_pitch("#195905", "#f3efec", "h", "full")

# Plot each category (Second Max Cluster)
plt.plot(manc_x3, manc_y3, color='cyan', linestyle='-', linewidth=1, markersize=4, label="Man City")
plt.plot(manu_x3, manu_y3, color='red', linestyle='-', linewidth=1, markersize=4, label="Man United")
plt.plot(tot_x3, tot_y3, color='grey', linestyle='-', linewidth=1, markersize=4, label="Tottenham")
plt.plot(liv_x3, liv_y3, color='black', linestyle='-', linewidth=1, markersize=4, label="Liverpool")
plt.plot(chel_x3, chel_y3, color='dodgerblue', linestyle='-', linewidth=1, markersize=4, label="Chelsea")
plt.plot(ars_x3, ars_y3, color='yellow', linestyle='-', linewidth=1, markersize=4, label="Arsenal")

# Spatial Coordinates (Highlight the last event with star)
plt.text(manc_x3[0], manc_y3[0], '1', fontsize=8, color='cyan', fontweight='bold', ha='right', va='top')
plt.text(manc_x3[1], manc_y3[1], '2', fontsize=8, color='cyan', fontweight='bold', ha='right', va='top')
plt.scatter(manc_x3[2], manc_y3[2], marker='*', c='cyan', s=25, zorder=13)

plt.text(manu_x3[0], manu_y3[0], '1', fontsize=8, color='red', fontweight='bold', ha='right', va='top')
plt.text(manu_x3[1], manu_y3[1], '2', fontsize=8, color='red', fontweight='bold', ha='right', va='top')
plt.scatter(manu_x3[2], manu_y3[2], marker='*', c='red', s=25, zorder=13)

plt.text(tot_x3[0], tot_y3[0], '1', fontsize=8, color='grey', fontweight='bold', ha='right', va='top')
plt.text(tot_x3[1], tot_y3[1], '2', fontsize=8, color='grey', fontweight='bold', ha='right', va='top')
plt.scatter(tot_x3[2], tot_y3[2], marker='*', c='grey', s=25, zorder=13)

plt.text(liv_x3[0], liv_y3[0], '1', fontsize=8, color='black', fontweight='bold', ha='right', va='top')
plt.text(liv_x3[1], liv_y3[1], '2', fontsize=8, color='black', fontweight='bold', ha='right', va='top')
plt.scatter(liv_x3[2], liv_y3[2], marker='*', c='black', s=25, zorder=13)

plt.text(chel_x3[0], chel_y3[0], '1', fontsize=8, color='dodgerblue', fontweight='bold', ha='right', va='top')
plt.text(chel_x3[1], chel_y3[1], '2', fontsize=8, color='dodgerblue', fontweight='bold', ha='right', va='top')
plt.scatter(chel_x3[2], chel_y3[2], marker='*', c='dodgerblue', s=25, zorder=13)

plt.text(ars_x3[0], ars_y3[0], '1', fontsize=8, color='yellow', fontweight='bold', ha='right', va='top')
plt.text(ars_x3[1], ars_y3[1], '2', fontsize=8, color='yellow', fontweight='bold', ha='right', va='top')
plt.scatter(ars_x3[2], ars_y3[2], marker='*', c='yellow', s=25, zorder=13)

# Annotation arrow for attack direction
plt.annotate("", xy=(25, -3), xytext=(5, -5), arrowprops=dict(arrowstyle="->", linewidth=2))
plt.text(7, -4, 'Attack (---------->)', fontsize=15)
plt.title("Average Attack Sequence of Big 6 Teams (Cluster 3)")
plt.legend(title="Big 6 Teams", bbox_to_anchor=(1, 1), loc='upper left')
plt.show()
No description has been provided for this image
In [22]:
# Table group by event combinations - Big 6 & Cluster 3
(shot_sequence_df[
        (shot_sequence_df['team_3'].isin(big6_teams)) &
        (shot_sequence_df['Attack_Frechet_Gower7'] == 3)
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .sort_values(by='count', ascending=False)
    .query('count >= 5')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']])
Out[22]:
count goal_count goal_percentage accuracy_rate
subEventName_1 subEventName_2
Simple pass Simple pass 134 0 0.0 0.0
Cross 54 0 0.0 0.0
Smart pass 32 0 0.0 0.0
High pass 18 0 0.0 0.0
Smart pass Simple pass 15 0 0.0 0.0
High pass Simple pass 12 0 0.0 0.0
Touch Simple pass 12 0 0.0 0.0
Simple pass Touch 12 0 0.0 0.0
High pass Cross 11 0 0.0 0.0
Ground attacking duel Simple pass 11 0 0.0 0.0
Smart pass Cross 11 0 0.0 0.0
Ground attacking duel Cross 10 0 0.0 0.0
Ground loose ball duel Simple pass 7 0 0.0 0.0
Simple pass Acceleration 7 0 0.0 0.0
Cross Air duel 7 0 0.0 0.0
Corner Simple pass 6 0 0.0 0.0
Acceleration Simple pass 6 0 0.0 0.0
Touch High pass 5 0 0.0 0.0
Cross Simple pass 5 0 0.0 0.0
In [23]:
# Separate by Team - Big 6 & Cluster 3
(shot_sequence_df[(shot_sequence_df['team_3'].isin(big6_teams)) &
                  (shot_sequence_df['Attack_Frechet_Gower7'] == 3)]
    .groupby(['team_3', 'subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .query('count >= 3')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']]
    .sort_values(['team_3', 'count'], ascending=[True, False]))
Out[23]:
count goal_count goal_percentage accuracy_rate
team_3 subEventName_1 subEventName_2
Arsenal Simple pass Simple pass 32 0 0.0 0.0
Cross 7 0 0.0 0.0
Smart pass 6 0 0.0 0.0
Smart pass Simple pass 5 0 0.0 0.0
Ground attacking duel Cross 3 0 0.0 0.0
High pass Cross 3 0 0.0 0.0
Simple pass High pass 3 0 0.0 0.0
Smart pass Cross 3 0 0.0 0.0
Chelsea Simple pass Simple pass 16 0 0.0 0.0
Cross 7 0 0.0 0.0
Smart pass 6 0 0.0 0.0
Ground loose ball duel Simple pass 4 0 0.0 0.0
High pass Cross 4 0 0.0 0.0
Corner Simple pass 3 0 0.0 0.0
Cross Air duel 3 0 0.0 0.0
High pass Simple pass 3 0 0.0 0.0
Liverpool Simple pass Simple pass 20 0 0.0 0.0
Cross 13 0 0.0 0.0
Smart pass 5 0 0.0 0.0
Touch 4 0 0.0 0.0
Ground attacking duel Cross 3 0 0.0 0.0
Simple pass 3 0 0.0 0.0
Simple pass High pass 3 0 0.0 0.0
Manchester City Simple pass Simple pass 31 0 0.0 0.0
Cross 8 0 0.0 0.0
Smart pass 8 0 0.0 0.0
Smart pass Cross 6 0 0.0 0.0
Touch Simple pass 5 0 0.0 0.0
High pass Simple pass 4 0 0.0 0.0
Simple pass High pass 4 0 0.0 0.0
Touch 4 0 0.0 0.0
Corner Simple pass 3 0 0.0 0.0
Manchester United Simple pass Simple pass 20 0 0.0 0.0
Cross 6 0 0.0 0.0
Smart pass Simple pass 3 0 0.0 0.0
Tottenham Hotspur Simple pass Simple pass 15 0 0.0 0.0
Cross 13 0 0.0 0.0
Smart pass 5 0 0.0 0.0
High pass 4 0 0.0 0.0
Ground attacking duel Simple pass 3 0 0.0 0.0
Simple pass Touch 3 0 0.0 0.0

On the other hand, the attack sequences segmented as cluster 3 are showing slightly different spatial trajectories, happened closer to the left side line with flatten hat (^) shape. Also, the final event (shot) were all inaccurate with 0% of goal conversion rate. Equivalently, all the big 6 teams had the most attempts for the attack sequence pattern of Simple pass - Simple pass.

Combined Cluster Stats¶

In [24]:
# Table group by event combinations - Big 6 & Cluster 0 / 3
(shot_sequence_df[
        (shot_sequence_df['team_3'].isin(big6_teams)) &
        ((shot_sequence_df['Attack_Frechet_Gower7'] == 0) | (shot_sequence_df['Attack_Frechet_Gower7'] == 3))
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .sort_values(by='count', ascending=False)
    .query('count >= 10')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']])
Out[24]:
count goal_count goal_percentage accuracy_rate
subEventName_1 subEventName_2
Simple pass Simple pass 290 31 0.11 0.538
Cross 105 29 0.28 0.486
Smart pass 91 20 0.22 0.648
High pass 45 9 0.20 0.600
Ground attacking duel Simple pass 33 8 0.24 0.667
Touch Simple pass 28 6 0.21 0.571
Smart pass Simple pass 27 9 0.33 0.444
Cross 25 8 0.32 0.560
High pass Simple pass 24 5 0.21 0.500
Simple pass Touch 22 0 0.00 0.455
Ground attacking duel Cross 22 7 0.32 0.545
Acceleration Simple pass 19 4 0.21 0.684
Smart pass 19 8 0.42 0.789
High pass Cross 16 4 0.25 0.312
Simple pass Acceleration 16 0 0.00 0.562
Cross Simple pass 12 4 0.33 0.583
Ground attacking duel Smart pass 11 5 0.45 1.000
In [25]:
# Separate by Team - Big 6 & Cluster 0 / 3
(shot_sequence_df[(shot_sequence_df['team_3'].isin(big6_teams)) &
                  ((shot_sequence_df['Attack_Frechet_Gower7'] == 0) | (shot_sequence_df['Attack_Frechet_Gower7'] == 3))]
    .groupby(['team_3', 'subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .query('count >= 5')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']]
    .sort_values(['team_3', 'count'], ascending=[True, False]))
Out[25]:
count goal_count goal_percentage accuracy_rate
team_3 subEventName_1 subEventName_2
Arsenal Simple pass Simple pass 70 8 0.11 0.543
Smart pass 18 5 0.28 0.667
Cross 13 2 0.15 0.462
Smart pass Simple pass 8 1 0.12 0.375
Cross 7 2 0.29 0.571
Touch Simple pass 7 3 0.43 0.714
Ground attacking duel Simple pass 6 2 0.33 0.833
Simple pass High pass 6 0 0.00 0.500
High pass Cross 5 2 0.40 0.400
Chelsea Simple pass Simple pass 36 2 0.06 0.556
Smart pass 14 2 0.14 0.571
Cross 13 1 0.08 0.462
High pass 7 2 0.29 0.714
Ground attacking duel Simple pass 5 1 0.20 0.600
Simple pass Acceleration 5 0 0.00 0.600
Liverpool Simple pass Simple pass 41 4 0.10 0.512
Cross 24 6 0.25 0.458
Smart pass 15 6 0.40 0.667
High pass 9 1 0.11 0.667
Ground attacking duel Cross 7 2 0.29 0.571
Simple pass 6 2 0.33 0.500
Simple pass Touch 6 0 0.00 0.333
Acceleration Simple pass 5 2 0.40 0.600
Manchester City Simple pass Simple pass 64 9 0.14 0.516
Cross 26 12 0.46 0.692
Smart pass 21 4 0.19 0.619
High pass 12 5 0.42 0.667
Smart pass Cross 11 2 0.18 0.455
Touch Simple pass 8 1 0.12 0.375
High pass Simple pass 7 1 0.14 0.429
Smart pass Simple pass 7 4 0.57 0.714
Acceleration Simple pass 6 1 0.17 0.667
Corner Simple pass 6 2 0.33 0.500
Ground attacking duel Simple pass 5 0 0.00 0.800
Simple pass Touch 5 0 0.00 0.200
Manchester United Simple pass Simple pass 42 6 0.14 0.524
Cross 8 1 0.12 0.250
Smart pass 7 2 0.29 0.714
Ground attacking duel Cross 5 2 0.40 0.600
Tottenham Hotspur Simple pass Simple pass 37 2 0.05 0.595
Cross 21 7 0.33 0.381
Smart pass 16 1 0.06 0.688
Acceleration Smart pass 8 4 0.50 0.750
Ground attacking duel Simple pass 7 1 0.14 0.571
Simple pass High pass 7 1 0.14 0.429
Touch 5 0 0.00 0.400
Smart pass Simple pass 5 3 0.60 0.600

The stats for combined clusters of attack sequences are showing similar trend from the cluster 0 analysis:

  1. Arsenal
  • Simple pass - Smart pass (Goal Conversion: 28%, Shot accuracy: 54%)
  • Touch - Simple pass (Goal Conversion: 43%, Shot accuracy: 71%)
  1. Chelsea
  • Simple pass - High pass (Goal Conversion: 29%, Shot accuracy: 71%)
  • Still relatively low or 0% goal conversion rate for other patterns
  1. Liverpool
  • Simple pass - Cross (Goal Conversion: 25%, Shot accuracy: 46%)
  • Simple pass - Smart pass (Goal Conversion: 40%, Shot accuracy: 67%)
  1. Manchester City
  • Simple pass - Cross (Goal Conversion: 46%, Shot accuracy: 69%)
  • Simple pass - High pass (Goal Conversion: 42%, Shot accuracy: 67%)
  • Smart pass - Simple pass (Goal Conversion: 57%, Shot accuracy: 71%)
  • Most various attack sequence patterns
  1. Manchester United
  • Simple pass - Smart pass (Goal Conversion: 29%, Shot accuracy: 71%)
  • Least diverse attack sequence patterns
  1. Tottenham Hotspur
  • Simple pass - Cross (Goal Conversion: 33%, Shot accuracy: 38%)
  • Acceleration - Smart pass (Goal Conversion: 50%, Shot accuracy: 75%)
  • Attack sequence tend to rely on players' height & speed

Summary Statistics (Average) of Numerical Features¶

  • Standard deviation in parentheses

1. Cluster 0¶

In [26]:
# Summary statistics table for Big 6 clubs (Cluster 0)
# Define numerical features
cols = ['progress_dist_12', 'progress_ratio_12', 
        'progress_dist_23', 'progress_ratio_23', 
        'event_duration_12', 'event_duration_23']

# Filter the data
filtered_df = shot_sequence_df[
    (shot_sequence_df['team_3'].isin(big6_teams)) &
    (shot_sequence_df['Attack_Frechet_Gower7'] == 0)
]

# Count observations per team
team_counts = filtered_df.groupby('team_3').size().rename("count")

# Group by and calculate mean and std
agg_df = filtered_df.groupby('team_3')[cols].agg(['mean', 'std'])

# Flatten the MultiIndex columns
agg_df.columns = ['_'.join(col).strip() for col in agg_df.columns.values]

# Create formatted DataFrame with "mean (std)"
formatted_df = pd.DataFrame(index=agg_df.index)

for col in cols:
    mean_col = f"{col}_mean"
    std_col = f"{col}_std"
    formatted_df[col] = agg_df[mean_col].round(2).astype(str) + " (" + agg_df[std_col].round(2).astype(str) + ")"

# Insert 'count' as the first column
formatted_df.insert(0, 'count', team_counts)

# Reset index and rename properly
formatted_df = formatted_df.reset_index().rename(columns={'team_3': 'Team'})
formatted_df.style.hide(axis='index')
Out[26]:
Team count progress_dist_12 progress_ratio_12 progress_dist_23 progress_ratio_23 event_duration_12 event_duration_23
Arsenal 101 8.93 (13.44) 0.33 (0.58) 7.1 (11.42) 0.36 (0.52) 2.16 (1.45) 1.69 (0.82)
Chelsea 69 9.36 (17.64) 0.32 (0.55) 12.14 (13.3) 0.4 (0.44) 2.21 (1.28) 2.08 (1.33)
Liverpool 89 10.11 (14.75) 0.46 (0.52) 11.07 (13.05) 0.41 (0.46) 2.23 (1.33) 1.98 (1.01)
Manchester City 126 9.94 (14.82) 0.42 (0.58) 6.75 (13.08) 0.24 (0.49) 2.4 (1.22) 1.84 (0.93)
Manchester United 65 6.98 (13.4) 0.29 (0.51) 8.52 (11.53) 0.4 (0.43) 2.17 (1.36) 1.83 (0.9)
Tottenham Hotspur 86 11.45 (15.64) 0.43 (0.53) 8.35 (11.89) 0.31 (0.46) 2.68 (1.39) 1.9 (1.03)

2. Cluster 3¶

In [27]:
# Summary statistics table for Big 6 clubs (Cluster 3)
# Filter the data
filtered_df = shot_sequence_df[
    (shot_sequence_df['team_3'].isin(big6_teams)) &
    (shot_sequence_df['Attack_Frechet_Gower7'] == 3)
]

# Count observations per team
team_counts = filtered_df.groupby('team_3').size().rename("count")

# Group by and calculate mean and std
agg_df = filtered_df.groupby('team_3')[cols].agg(['mean', 'std'])

# Flatten the MultiIndex columns
agg_df.columns = ['_'.join(col).strip() for col in agg_df.columns.values]

# Create formatted DataFrame with "mean (std)"
formatted_df = pd.DataFrame(index=agg_df.index)

for col in cols:
    mean_col = f"{col}_mean"
    std_col = f"{col}_std"
    formatted_df[col] = agg_df[mean_col].round(2).astype(str) + " (" + agg_df[std_col].round(2).astype(str) + ")"

# Insert 'count' as the first column
formatted_df.insert(0, 'count', team_counts)

# Reset index and rename properly
formatted_df = formatted_df.reset_index().rename(columns={'team_3': 'Team'})
formatted_df.style.hide(axis='index')
Out[27]:
Team count progress_dist_12 progress_ratio_12 progress_dist_23 progress_ratio_23 event_duration_12 event_duration_23
Arsenal 79 8.8 (12.19) 0.34 (0.55) 2.92 (9.91) 0.1 (0.46) 2.38 (1.73) 1.69 (0.85)
Chelsea 68 5.62 (18.77) 0.13 (0.58) 4.81 (14.79) 0.06 (0.52) 2.49 (2.35) 1.61 (1.02)
Liverpool 78 5.99 (14.95) 0.23 (0.58) 10.4 (14.14) 0.38 (0.51) 2.77 (3.01) 1.94 (1.02)
Manchester City 88 9.2 (15.41) 0.37 (0.58) 4.86 (11.72) 0.17 (0.48) 2.54 (1.38) 1.79 (0.79)
Manchester United 57 9.33 (16.74) 0.26 (0.63) 3.56 (11.14) 0.1 (0.48) 2.56 (2.01) 1.94 (0.92)
Tottenham Hotspur 60 7.43 (10.71) 0.34 (0.57) 10.55 (14.3) 0.33 (0.43) 2.65 (1.52) 1.93 (1.03)

2. What makes Leicester City different?¶

Potential clue of scoring more goals than any other teams beside Big 6 clubs¶

The league table in 2017-18 season has been given like below:

Rank Team GP W D L GS GA P Max Cluster 2nd Max Cluster
1 Manchester City 38 32 4 2 106 27 100 0 3
2 Manchester United 38 25 6 7 68 28 81 0 3
3 Tottenham Hotspur 38 23 8 7 74 36 77 0 3
4 Liverpool 38 21 12 5 84 38 75 0 3
5 Chelsea 38 21 7 10 62 38 70 0 3
6 Arsenal 38 19 6 13 74 51 63 0 3
7 Burnley 38 14 12 12 36 39 54 4 6
8 Everton 38 13 10 15 44 58 49 3 4
9 Leicester City 38 12 11 15 56 60 47 5 0
10 Newcastle United 38 12 8 18 39 47 44 0 3
11 Crystal Palace 38 11 11 16 45 55 44 3 2
12 AFC Bournemouth 38 11 11 16 45 61 44 3 0
13 West Ham United 38 10 12 16 48 68 42 2 3
14 Watford 38 11 8 19 44 64 41 4 0
15 Brighton 38 9 13 16 34 54 40 4 3
16 Huddersfield Town 38 9 10 19 28 58 37 3 5
17 Southampton 38 7 15 16 37 56 36 3 0
18 Swansea City 38 8 9 21 28 56 33 2 0
19 Stoke City 38 7 12 19 35 68 33 0 4
20 West Bromwich Albion 38 6 13 19 31 56 31 4 2

We can see from the table that Leicester City scored the most (56 goals) compared to all other teams beside big 6 clubs where the average goals for these teams seem to be around 35 to 45 goals. Moreover, if we also check the most cluster labels for attack sequence, Leicester City is the only team with their most amount of sequences labeled as cluster 5. In this part, we would like to delve into their attack sequence patterns and observe what makes them different in terms of scoring relatively more goals than any other teams like Crystal Palace or West Ham United.

Analysis of Cluster 5 & Comparsion Group¶

In [28]:
# Filter Leicester City & Comparison Group with relatively many goals (Everton / Crystal Palace / AFC Bournemouth)
leicester = shot_sequence_df[shot_sequence_df['team_3'] == 'Leicester City']
comparsion = shot_sequence_df[(shot_sequence_df['team_3'] == 'Everton') |
                              (shot_sequence_df['team_3'] == 'Crystal Palace') |
                              (shot_sequence_df['team_3'] == 'AFC Bournemouth')]

# Extract the attack pattern by cluster
leic_row5 = leicester[(leicester['Attack_Frechet_Gower7'] == 5)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()
comp_row3 = leicester[(leicester['Attack_Frechet_Gower7'] == 3)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()

# Extract x and y coordinates for each of the 3 points
leic_x5 = [leic_row5['x_1'], leic_row5['x_2'], leic_row5['x_3']]
leic_y5 = [leic_row5['y_1'], leic_row5['y_2'], leic_row5['y_3']]

comp_x3 = [comp_row3['x_1'], comp_row3['x_2'], comp_row3['x_3']]
comp_y3 = [comp_row3['y_1'], comp_row3['y_2'], comp_row3['y_3']]
In [29]:
# Create a pitch
f = draw_pitch("#195905", "#f3efec", "h", "full")

# Plot each category
plt.plot(leic_x5, leic_y5, color='white', linestyle='-', linewidth=1, markersize=4, label="Leicester City")
plt.plot(comp_x3, comp_y3, color='purple', linestyle='-', linewidth=1, markersize=4, label="Comparison")

# Spatial Coordinates (Highlight the last event with star)
plt.text(leic_x5[0], leic_y5[0], '1', fontsize=8, color='white', fontweight='bold', ha='right', va='top')
plt.text(leic_x5[1], leic_y5[1], '2', fontsize=8, color='white', fontweight='bold', ha='right', va='top')
plt.scatter(leic_x5[2], leic_y5[2], marker='*', c='white', s=25, zorder=13)

plt.text(comp_x3[0], comp_y3[0], '1', fontsize=8, color='purple', fontweight='bold', ha='right', va='top')
plt.text(comp_x3[1], comp_y3[1], '2', fontsize=8, color='purple', fontweight='bold', ha='right', va='top')
plt.scatter(comp_x3[2], comp_y3[2], marker='*', c='purple', s=25, zorder=13)

# Annotation arrow for attack direction
plt.annotate("", xy=(25, -3), xytext=(5, -5), arrowprops=dict(arrowstyle="->", linewidth=2))
plt.text(7, -4, 'Attack (---------->)', fontsize=15)
plt.title("Average Attack Sequence of Leicester City vs Comparison (Cluster 5 vs 3)")
plt.legend(title="Team", bbox_to_anchor=(1, 1), loc='upper left')
plt.show()
No description has been provided for this image
In [30]:
# Table group by event combinations - Leicester City
(shot_sequence_df[
        (shot_sequence_df['team_3'] == 'Leicester City') &
        (shot_sequence_df['Attack_Frechet_Gower7'] == 5)
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .sort_values(by='count', ascending=False)
    .query('count >= 3')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']])
Out[30]:
count goal_count goal_percentage accuracy_rate
subEventName_1 subEventName_2
Ground defending duel Ground attacking duel 28 10 0.36 0.75
Cross 6 4 0.67 1.00

We can see from the table grouped by events that the attack sequence patterns for Leicester City with cluster 5 are simple; Winning the duel in the defending situation and progress with either another duel in the attacking situation or cross. It is also notable that both the goal conversion rate and shot accuracy were relativley high, showing somewhat simple but sophisticated at the same time with good quality of finishing the sequences.

In [31]:
# Table group by event combinations - Everton / Crystal Palace / AFC Bournemouth
(shot_sequence_df[
        ((shot_sequence_df['team_3'] == 'Everton') |
         (shot_sequence_df['team_3'] == 'Crystal Palace') |
         (shot_sequence_df['team_3'] == 'AFC Bournemouth')) &
        (shot_sequence_df['Attack_Frechet_Gower7'] == 3)
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .sort_values(by='count', ascending=False)
    .query('count >= 5')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']])
Out[31]:
count goal_count goal_percentage accuracy_rate
subEventName_1 subEventName_2
Simple pass Simple pass 40 0 0.0 0.0
Cross 17 0 0.0 0.0
Ground attacking duel Cross 9 0 0.0 0.0
Simple pass 9 0 0.0 0.0
Simple pass Smart pass 7 0 0.0 0.0
Ball out of the field Corner 6 0 0.0 0.0
Simple pass High pass 6 0 0.0 0.0
Acceleration Simple pass 5 0 0.0 0.0

On the other hand, teams in the comparison group (Everton / Crystal Palace / AFC Bournemouth) had their attack sequence pattern segmented in cluster 3 the most. Similar to the cluster 3 analysis made with the big 6 clubs previously, the characteristics of the cluster is straightforward; 0% goal conversion and shot accuracy rate. Comparing Leicester City with the comparison group, they both show similar spatial trajectory with the 'V' shape. However, Considering that the dominant attack sequence pattern for these teams have these kinds of feature (0% goal conversion and shot accuracy rate), it is somewhat understandable of why their total goals scored are lower than Leicester City. Leicester City had better finishing quality than the comparison group.

Analysis of Cluster 0 & Comparison with Big 6 Clubs¶

In [32]:
# Define the big 6 teams in EPL and filter only these teams' shot sequences
leicester = shot_sequence_df[shot_sequence_df['team_3'] == 'Leicester City']

# Extract the attack pattern by cluster
leic_row0 = leicester[(leicester['Attack_Frechet_Gower7'] == 0)][['x_1', 'x_2', 'x_3', 'y_1', 'y_2', 'y_3']].mean()

# Extract x and y coordinates for each of the 3 points
leic_x0 = [leic_row0['x_1'], leic_row0['x_2'], leic_row0['x_3']]
leic_y0 = [leic_row0['y_1'], leic_row0['y_2'], leic_row0['y_3']]
In [33]:
# Create a pitch
f = draw_pitch("#195905", "#f3efec", "h", "full")

# Plot each category
plt.plot(leic_x0, leic_y0, color='white', linestyle='-', linewidth=1, markersize=4, label="Leicester City")

plt.plot(manc_x0, manc_y0, color='cyan', linestyle='-', linewidth=1, markersize=4, label="Man City")
plt.plot(manu_x0, manu_y0, color='red', linestyle='-', linewidth=1, markersize=4, label="Man United")
plt.plot(tot_x0, tot_y0, color='grey', linestyle='-', linewidth=1, markersize=4, label="Tottenham")
plt.plot(liv_x0, liv_y0, color='black', linestyle='-', linewidth=1, markersize=4, label="Liverpool")
plt.plot(chel_x0, chel_y0, color='dodgerblue', linestyle='-', linewidth=1, markersize=4, label="Chelsea")
plt.plot(ars_x0, ars_y0, color='yellow', linestyle='-', linewidth=1, markersize=4, label="Arsenal")

# Spatial Coordinates (Highlight the last event with star)
plt.text(leic_x0[0], leic_y0[0], '1', fontsize=8, color='white', fontweight='bold', ha='right', va='top')
plt.text(leic_x0[1], leic_y0[1], '2', fontsize=8, color='white', fontweight='bold', ha='right', va='top')
plt.scatter(leic_x0[2], leic_y0[2], marker='*', c='white', s=25, zorder=13)

plt.text(manc_x0[0], manc_y0[0], '1', fontsize=8, color='cyan', fontweight='bold', ha='right', va='top')
plt.text(manc_x0[1], manc_y0[1], '2', fontsize=8, color='cyan', fontweight='bold', ha='right', va='top')
plt.scatter(manc_x0[2], manc_y0[2], marker='*', c='cyan', s=25, zorder=13)

plt.text(manu_x0[0], manu_y0[0], '1', fontsize=8, color='red', fontweight='bold', ha='right', va='top')
plt.text(manu_x0[1], manu_y0[1], '2', fontsize=8, color='red', fontweight='bold', ha='right', va='top')
plt.scatter(manu_x0[2], manu_y0[2], marker='*', c='red', s=25, zorder=13)

plt.text(tot_x0[0], tot_y0[0], '1', fontsize=8, color='grey', fontweight='bold', ha='right', va='top')
plt.text(tot_x0[1], tot_y0[1], '2', fontsize=8, color='grey', fontweight='bold', ha='right', va='top')
plt.scatter(tot_x0[2], tot_y0[2], marker='*', c='grey', s=25, zorder=13)

plt.text(liv_x0[0], liv_y0[0], '1', fontsize=8, color='black', fontweight='bold', ha='right', va='top')
plt.text(liv_x0[1], liv_y0[1], '2', fontsize=8, color='black', fontweight='bold', ha='right', va='top')
plt.scatter(liv_x0[2], liv_y0[2], marker='*', c='black', s=25, zorder=13)

plt.text(chel_x0[0], chel_y0[0], '1', fontsize=8, color='dodgerblue', fontweight='bold', ha='right', va='top')
plt.text(chel_x0[1], chel_y0[1], '2', fontsize=8, color='dodgerblue', fontweight='bold', ha='right', va='top')
plt.scatter(chel_x0[2], chel_y0[2], marker='*', c='dodgerblue', s=25, zorder=13)

plt.text(ars_x0[0], ars_y0[0], '1', fontsize=8, color='yellow', fontweight='bold', ha='right', va='top')
plt.text(ars_x0[1], ars_y0[1], '2', fontsize=8, color='yellow', fontweight='bold', ha='right', va='top')
plt.scatter(ars_x0[2], ars_y0[2], marker='*', c='yellow', s=25, zorder=13)

# Annotation arrow for attack direction
plt.annotate("", xy=(25, -3), xytext=(5, -5), arrowprops=dict(arrowstyle="->", linewidth=2))
plt.text(7, -4, 'Attack (---------->)', fontsize=15)
plt.title("Average Attack Sequence of Leicester City vs Big 6 (Cluster 0)")
plt.legend(title="Team", bbox_to_anchor=(1, 1), loc='upper left')
plt.show()
No description has been provided for this image
In [34]:
# Table group by event combinations
(shot_sequence_df[
        (shot_sequence_df['team_3'] == 'Leicester City') &
        (shot_sequence_df['Attack_Frechet_Gower7'] == 0)
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(count=('shot_accuracy', 'size'),
         goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
         accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)))
    .sort_values(by='count', ascending=False)
    .query('count >= 3')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate']])
Out[34]:
count goal_count goal_percentage accuracy_rate
subEventName_1 subEventName_2
Simple pass Simple pass 6 0 0.00 1.0
High pass 5 2 0.40 1.0
High pass Simple pass 5 3 0.60 1.0
Simple pass Cross 3 2 0.67 1.0
Smart pass 3 2 0.67 1.0
Ground attacking duel Cross 3 0 0.00 1.0
Acceleration Simple pass 3 0 0.00 1.0

When looking at the spatial visualization of the average attack sequences of cluster 0 between Leicester City and big 6 clubs, we can see that Leicester City's sequence shows somewhat similar trajectory but starts in the lower side (around the half-line circle) and their distances relatively being longer for both event 1 & 2 and 2 & 3 compared to the big 6 in general. This makes sense as the characteristics of the sub-events reflect the pattern which are mostly Cross or High pass. The event Simple pass was more involved this time like Big, but the variety of event combinations is still relatively simple with relatively less attempts. Still, their goal conversion and shot accuracy rates were fairly high, showing another signal of how they could score more than any other teams beside the big 6 clubs.

Summary Statistics Table - Leicester City¶

In [35]:
# Filter for Leicester City, selected clusters
filtered_df = shot_sequence_df[
    (shot_sequence_df['team_3'] == 'Leicester City') &
    (shot_sequence_df['Attack_Frechet_Gower7'].isin([5, 0]))
]

# Count observations per cluster
cluster_counts = filtered_df.groupby('Attack_Frechet_Gower7').size().rename("count")

# Group by cluster and calculate mean and std
agg_df = filtered_df.groupby('Attack_Frechet_Gower7')[cols].agg(['mean', 'std'])

# Flatten MultiIndex column names
agg_df.columns = ['_'.join(col).strip() for col in agg_df.columns.values]

# Format as "mean (std)"
formatted_df = pd.DataFrame(index=agg_df.index)

for col in cols:
    mean_col = f"{col}_mean"
    std_col = f"{col}_std"
    formatted_df[col] = agg_df[mean_col].round(2).astype(str) + " (" + agg_df[std_col].round(2).astype(str) + ")"

# Insert count column at the beginning
formatted_df.insert(0, 'count', cluster_counts)

# Optional: Rename index to make it clearer (e.g. "Cluster 0", "Cluster 5")
formatted_df.index.name = 'Cluster'
formatted_df.reset_index(inplace=True)
formatted_df.style.hide(axis='index')
Out[35]:
Cluster count progress_dist_12 progress_ratio_12 progress_dist_23 progress_ratio_23 event_duration_12 event_duration_23
0 48 15.31 (14.09) 0.56 (0.45) 9.5 (12.13) 0.37 (0.48) 2.87 (1.46) 1.59 (0.88)
5 51 8.18 (33.78) 0.28 (0.62) 5.86 (19.04) 0.15 (0.51) 3.29 (10.68) 1.33 (0.85)

Key Players for Leicester City's Attack Sequences¶

In [36]:
# Function to extract the most & second most involved players
def top_two_modes(series):
    modes = series.mode()
    if len(modes) >= 2:
        return f"{modes.iloc[0]} / {modes.iloc[1]}"
    elif len(modes) == 1:
        return f"{modes.iloc[0]}"
    else:
        return None

# Summary table for key players - cluster 5
player_summary = (
    shot_sequence_df[
        (shot_sequence_df['team_3'] == 'Leicester City') &
        (shot_sequence_df['Attack_Frechet_Gower7'] == 5)
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(
        count=('shot_accuracy', 'size'),
        goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
        accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)),

        player_1=('player_1', top_two_modes),
        player_2=('player_2', top_two_modes),
        player_3=('player_3', top_two_modes)
    )
    .sort_values(by='count', ascending=False)
    .query('count >= 3')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate', 'player_1', 'player_2', 'player_3']]
)

# Display the table
player_summary
Out[36]:
count goal_count goal_percentage accuracy_rate player_1 player_2 player_3
subEventName_1 subEventName_2
Ground defending duel Ground attacking duel 28 10 0.36 0.75 C. Kabasele / J. Gomez R. Mahrez R. Mahrez
Cross 6 4 0.67 1.00 D. Janmaat / J. Gomez D. Gray / M. Albrighton R. Mahrez

For the attack sequence patterns labeled as cluster 5, R. Mahrez was the most involved player in the second and last shot sequence event. C. Kabasele was the player iniating the sequences by winning the duel with opponent players in the defending situation the most. D. Gray and M. Albrighton were the two most involved players connecting to R. Mahrez with fairly accurate crosses.

In [37]:
# Summary table for key players - cluster 0
player_summary = (
    shot_sequence_df[
        (shot_sequence_df['team_3'] == 'Leicester City') &
        (shot_sequence_df['Attack_Frechet_Gower7'] == 0)
    ]
    .groupby(['subEventName_1', 'subEventName_2'])
    .agg(
        count=('shot_accuracy', 'size'),
        goal_count=('tags_3', lambda x: x.str.contains(r'\b101\b', regex=True, na=False).sum()),
        accuracy_rate=('shot_accuracy', lambda x: (x == 'accurate').mean().round(3)),

        player_1=('player_1', top_two_modes),
        player_2=('player_2', top_two_modes),
        player_3=('player_3', top_two_modes)
    )
    .sort_values(by='count', ascending=False)
    .query('count >= 3')
    .assign(goal_percentage=lambda df: (df['goal_count'] / df['count']).round(2))
    .loc[:, ['count', 'goal_count', 'goal_percentage', 'accuracy_rate', 'player_1', 'player_2', 'player_3']]
)

# Display the table
player_summary
Out[37]:
count goal_count goal_percentage accuracy_rate player_1 player_2 player_3
subEventName_1 subEventName_2
Simple pass Simple pass 6 0 0.00 1.0 Adrien Silva / B. Chilwell R. Mahrez R. Mahrez
High pass 5 2 0.40 1.0 H. Maguire M. Albrighton J. Vardy
High pass Simple pass 5 3 0.60 1.0 M. Albrighton R. Mahrez S. Okazaki
Simple pass Cross 3 2 0.67 1.0 B. Chilwell / M. Albrighton J. Vardy S. Okazaki
Smart pass 3 2 0.67 1.0 H. Maguire Adrien Silva / K. Iheanacho I. Slimani / J. Vardy
Ground attacking duel Cross 3 0 0.00 1.0 D. Gray D. Gray A. King / J. Vardy
Acceleration Simple pass 3 0 0.00 1.0 R. Mahrez R. Mahrez K. Iheanacho

For the attack sequence patterns labeled as cluster 0, J. Vardy, S. Okazaki were the most involved players in the last shot sequence event with R. Mahrez. Specifically, J. Vardy was also involved a lot in the second sequence event with R. Mahrez, showing the importance of his role in the attack. B. Chilwell, H. Maguire were instantiating the attack sequences the most with M. Albrighton with Simple pass instead of fighting in the defending duel like in cluster 5. This could be also understood as these two defenders involved frequently in the team's attack sequence.

References & Resources¶

  1. Frechet Distance (https://en.wikipedia.org/wiki/Fréchet_distance)
  2. Gower Distance (https://arxiv.org/ftp/arxiv/papers/2101/2101.02481.pdf)
  3. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998. (https://link.springer.com/content/pdf/10.1023/A:1009769707641.pdf)
  4. Opta Analyst (https://theanalyst.com/2024/10/strongest-leagues-world-football-opta-power-rankings)
  5. Pappalardo, L., Cintia, P., Rossi, A. et al. A public data set of spatio-temporal match events in soccer competitions. Sci Data 6, 236 (2019) doi:10.1038/s41597-019-0247-7 (https://www.nature.com/articles/s41597-019-0247-7)