I am trying to build a machine learning model for predicting customer churn in my company. Currently I have data for 5000 customers over 2 years. Is this enough data or do I need more? Also which algorithm should I use - logistic regression or random forest?
Reply by: DataScientist_10yrs
5000 records is decent dataset for binary classification problem like churn prediction. You can definitely start with this. For algorithm choice, I suggest try both and compare results. Start with logistic regression as baseline because its simple and interpretable. Then try random forest or XGBoost which usually give better accuracy but are more complex. Use cross-validation to evaluate properly.
Reply by: Analytics_Consultant
Also make sure your data is balanced - meaning you have similar number of churned and non-churned customers. If 95% customers didnt churn and only 5% churned, then your model will have class imbalance problem. In that case you need to use techniques like SMOTE or adjust class weights. Data quality and feature engineering is more important than quantity of data.