Predicting User Engagement Scores in a Learning Management Platform

Client:

ACTO

Duration:

2 Months

Machine Learning

NumPy

SkLearn

As a Data Analyst at a leading educational technology company, I undertook a project to develop an advanced machine learning solution for predicting user engagement in our Learning Management System (LMS). This project aimed to enhance user experience and improve content delivery by leveraging data-driven insights.

The Challenge:

Our LMS serves thousands of users, generating vast amounts of interaction data. The primary challenge was to accurately predict user engagement levels in a multi-class setting, using a dataset of approximately 20,000 rows. Initial attempts at modeling this complex behavior yielded suboptimal results, with only a 35% F1-score.

My Approach:

Here are the steps I took to structure the data, and then perform training and validation to train a model to predict user engagement scores:

Data Preprocessing and Feature Engineering:

# Finding and removing null values

null_percentages = (df.isnull().sum() / len(df)) * 100
print("Percentage of null values in each column:")
print(null_percentages[null_percentages > 0])
df = df[df['score'].notna()]
print(f"Data shape after removing null scores: {df.shape}")

# Categorizing Score Values into bins

def categorize_score(score):
    if score <= 2:
        return 'Very Low'
    elif score <= 4:
        return 'Low'
    elif score <= 6:
        return 'Even'
    elif score <= 8:
        return 'High'
    else:
        return 'Very High'

df['score_category'] = df['score'].apply(categorize_score)

Cleaned and prepared the dataset, handling missing values and encoding categorical variables.
Created new features to capture user behavior patterns, such as total time spent on different activities and engagement ratios.
Implemented advanced techniques to handle temporal aspects of user interactions.

Addressing Class Imbalance:

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize SMOTE
smote = SMOTE(random_state=42)

X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Check the distribution after SMOTE
print("Class distribution before SMOTE:")
print(y_train.value_counts())
print("\nClass distribution after SMOTE:")
print(y_train_res.value_counts())

# Check the shape of the resulting datasets
print("X_train_res shape:", X_train_res.shape)
print("X_test shape:", X_test.shape)
print("y_train_res shape:", y_train_res.shape)
print("y_test shape:", y_test.shape)

Utilized the Synthetic Minority Over-sampling Technique (SMOTE) to balance our dataset, ensuring robust model performance across all engagement levels.

Model Selection and Development:
- After evaluating several algorithms, we chose XGBoost for its ability to handle complex, non-linear relationships in data.
- Implemented a rigorous cross-validation strategy to ensure model generalizability.
- Fine-tuned hyperparameters using grid search with cross-validation to optimize model performance.
Feature Importance Analysis:
- Leveraged XGBoost's built-in feature importance metrics to identify key predictors of user engagement.
- This analysis provided valuable insights for product development and content strategy teams.
Results and Impact:
- Significantly improved model performance, increasing the F1-score from 35% to 60%.
- Achieved an impressive 86% ROC-AUC score in our 5-class prediction problem.
- The enhanced predictive capabilities allowed for:
  1. Personalized content recommendations, improving user satisfaction.
  2. Early identification of at-risk users, enabling timely interventions.
  3. Optimization of the overall learning experience on our platform.
Challenges Overcome:
- Dealing with the complexity of multi-class classification in user behavior prediction.
- Balancing model complexity with interpretability to provide actionable insights.
- Integrating the model predictions into the existing LMS infrastructure for real-time user engagement enhancement.

Key Learnings:
1. The importance of feature engineering in capturing complex user behaviors.
2. The effectiveness of ensemble methods like XGBoost in handling diverse and noisy datasets.
3. The value of combining data science expertise with domain knowledge in educational technology.

Conclusion:

This project demonstrates my ability to develop end-to-end machine learning solutions that directly impact business outcomes and user experiences.

By leveraging advanced techniques in data preprocessing, feature engineering, and state-of-the-art algorithms like XGBoost, I significantly improved our ability to understand and predict user engagement in our Learning Management System.

This work not only enhanced our product offering but also contributed to our mission of providing personalized and effective learning experiences!