Professional Experience & Project

Data Science Exploration (STAT 207) Teaching Assistant

I am currently working as a Data Science Exploration (STAT 207) Teaching Assistant at UIUC, and I have worked as a course assistant in the same course for three semesters during my time as an undergraduate. Some of my main jobs are leading a lab session, helping students by answering any questions related to both statistics/Python programming methods and assisting the professor to communicate effectively with other course staffs and students.

• Fall 2023 - List of Teachers Ranked as Excellent by Their Students (pdf)

• Spring 2024 - List of Teachers Ranked as Excellent by Their Students (pdf)

Political Polarization

Starting from March 2025, I am working on a project of analyzing user data provided by a non-profit organization ‘Braver Angels’ supervised by professor Tori Ellison. With the data, I am currently in the stage of testing different parameter combinations for topic modeling algorithm such as Latent Dirichlet Allocation (LDA) to identify any potential keywords that distinguish topics and also applying clustering algorithm to group similar users and understand some notable features that may be related to political polarization in the U.S.

Personal Sports Project

Analysis of 17-18 EPL Team Attack Patterns (Fall 2024 / Revised Spring 2025)

STAT 430 - Practice of Applied Statistics (Report) (Code)

• Utilized publicly available spatio-temporal football data from Pappalardo et al. (originally from WyScout) to construct attacking sequences based on event types, spatial coordinates, and timing features

• Applied unsupervised learning using Fréchet and Gower distances with K-medoids clustering to define latent attack patterns, followed by statistical testing to verify team-level associations

• Discovered that Big 6 teams shared structurally similar attacking clusters with tactical variation, while Leicester City uniquely relied on fast, efficient duel-based sequences—achieving the highest goal output outside the Big 6

European Football League Player Stats (Spring 2022 / Re-developing from June 2025)

STAT 385 - Statistical Programming Methods (Streamlit) (Shiny App / Old Version)

• Developing an interactive Streamlit web app using Python to analyze and visualize major European football league (Big 5 & Eredivisie/Primeira Liga/Jupiler) players' performance data from Fbref, building on a prior R Shiny version

• Integrating data scraping (BeautifulSoup), radar chart visualization (Plotly), and player comparison features using cosine similarity to allow dynamic selection and comparison of similar players based on diverse in-game performance stats

Pitching Similarity (Spring 2023)

STAT 430 - Baseball Analytics (Shiny App)

• Created an R Shiny app with MLB Statcast data

• Designed pitching location heatmaps with ggplot to visualize and compare selected pitchers' pitching location and diverse stats such as pitch type and speed

In-Class Group Projects

Below are some individual and group projects that I have done from the school. Feel free to check!

Partial Dopaminergic & Tyrosine Hydroxylase of Dopamine Synthesis (Spring 2024)

STAT 530 - Bioinformatics (Report)

• Used MERFISH dataset to figure out the existence of partial dopaminergic neurons in preoptic area (POA) and spatial evidence of tyrosine hydroxylase (TH) and L-DOPA exocytosis for dopamine synthesis

• Implemented dimensionality reduction algorithms such as UMAP and found out a spatial structure and molecular features of partial dopaminergic cells within POA that are potentially interacted with dopamine synthesis

• Conducted spatial autocorrelation to test evidence of spatial location difference between TH/Ddc (dopadecarbosylase) groups and detected a notable pattern of L-DOPA exocytosis from concentrated to dispersed

Analysis & Forecast of Chinese Stock Market (Fall 2023)

STAT 430 - Time Series Machine Learning (Report)

• Analyzed 500 Chinese stocks price data with diverse time-series perspectives (Classic ARIMA / Clustering / Deep Learning GRU)

• Worked on cluster analysis part by applying algorithms like DBSCAN to group stocks with similar price trends and found real estate / Food / Medicine showing notable growth and stabled fluctuation

Beijing Housing Prices (Fall 2022)

STAT 432 - Basics of Statistical Learning (Report) (Presentation)

• Conducted both supervised and unsupervised analysis with a Beijing housing price dataset from Kaggle to understand price trends in Beijing, China

• Applied diverse regression algorithms such as LASSO and XGBoost to predict 'price' and figured out Number of rooms / Age & Condition / Distance from an epicenter to be the most significant predictors

• Clustered the data using K-Means, grouped the observations as 'over'/'under' priced within their clusters and discovered the trend of 'overpriced' properties to be relatively old and close to the epicenter

Lyrics Generator (Fall 2022)

STAT 430 - Fundamentals of Deep Learning (Report) (Presentation) (Github)

• Built a neural network structure (LSTM) with PyTorch to generate lyrics with a dataset containing 6 million songs from Kaggle

• Trained and tested the model with different hyperparameters (Sequence length / Batch size) by genres and discovered longer sequence length not necessarily guarantee better model performances

• Generated sample lyrics and observed Rap to be relatively short and negative whereas Country and R&B to be long and romantic

Consulting Project: Air Pollutant in Gucheng, China (Spring 2022)

STAT 443 - Professional Statistics (Report) (Presentation)

• Consulted with a mock client about a project using an air pollutant dataset from Gucheng, China using R

• Visualized diverse time-series-related plots with ggplot and observed relatively higher air pollutant (PM2.5) level in winter, weekend, and nighttime

• Tested various algorithms (Linear Regression / Multinomial / K-Nearest Neighbors) to find the best model which predicts a next day's air pollutant (PM2.5) level based on its previous day's meteorological variables values

Denoise Autoencoder (Spring 2022)

CS 307 - Modeling & Learning in Data Science (Report)

• Built autoencoders with 'FashionMNIST' data in PyTorch to observe a relationship between noisy data and the autoencoders' performances

• Trained the autoencoders using manipulated data with three different noise type (No noise / Gaussian / RandomErasing)

• Evaluated the autoencoders' performances by comparing their average train & test loss and visualizing t-SNE plots of latent spaces

Market Segmentation of Credit Card Customers (Fall 2021)

STAT 430 - Unsupervised Learning (Report & Code) (Presentation)

• Explored a bank data from Kaggle to discover notable insights of customers' leaving status based on their demographics and banking information using Python

• Applied diverse algorithms (Hierarchical Agglomerative, K-Prototype Clustering) from Scikit-learn to visualize a clustering structure and set an assumption of customers with lower and unstable income to be less economically active and likely to leave referring to the structure and exploratory data analysis

• Computed supervised evaluation scores (Complete/Homogeneity/V-Score) of the algorithms' predicted clustering label created based on the assumption to compare with the pre-assigned label and identified poor match between the predicted and original labels (weak or no association between customers' info and their leaving status)