EBIS4043 Big Data Analysis and Applications

hellyou發表於2024-10-26

The purpose of this assignment is to make sure that you are picking up the R based analytics skills (Please do not use other tools to generate the answers!) that have been introduced in this class and check your ability. (Total 50 marks)
1.Use the dataset available at iSpace.
2.Make sure to have the entire process from data loading to analysis and interpretation in the submission.
3.All your answers including your identity, codes, interpretation should be in one file: HTML generated from Rmarkdown file (.rmd). Any sort of multiple files will be graded as zero mark.
4.You can discuss the coding for this assignment with your friends. However, any visible overlap in your interpretation will be considered plagiarism.
5.The use of any generative AI tool is strictly prohibited for this assignment. If such use is detected, it will be considered an attempt at plagiarism.
6.There can be more than one correct answer to every question. Use any technique that you learned from the classroom.
7.If needed, use 20240614 as a random number seed.

Data Description
This dataset is originally from the Orange Telecom’s churn dataset, which consists of customer information known to the telecom company, along with a churn indicator (“TRUE” = canceled the subscription, “FALSE” = otherwise). Regarding the customer information, the dataset contains customers’ location, extra service plans (e.g., international roaming and voice mail services), usage (in terms of minutes, no. of calls, charged fees, …), and so on. All customers in the dataset are from the United States.

Questions
1.Write and execute R code to build and test the below regression equation for predicting the value of the Churn variable using the dataset with 1) Linear Probability Model (LPM) and 2) Logistic regression model. Transforming & creating variables appropriately if needed. Which model has a better fit? (Total 10 marks)

Where CS.contacted: = 0 if the customer has never contacted customer service, = 1 otherwise, and Total.all.charge: = Sum of all fees charged to the telecom customer for calls, except for customer service calls, and 代 寫EBIS4043 Big Data Analysis and Applications Total.all.time: = Sum of all time the customer spent on calls, except for customer service (in minutes).
2.Using the LPM model estimated for question 1, plot the effect of Total.all.charge on Churn in the case of CS.contacted = 0 and CS.contacted = 1 while the values of other predictors are held at their mean values. (Total 10 marks)

3.Write and execute R code to build and test the below regression equation for predicting the values of the Churn variable using all predictors in the dataset with 1) Linear Probability Model (LPM) and 2) Logistic regression model. Please use 5-fold cross-validation for both models. (Total 10 marks)
Hint1: use the caret package.
Hint2: use as.factor() function to convert a variable into a factor variable.

4.Based on the results from question 3, which model is preferred for prediction, in terms of accuracy at the threshold of 0.3? (Total 10 marks)
Hint: use data.frame() function to convert the list output from predict() into a dataframe.

5.Do you think the LPM model developed in question 3 can be used for predicting whether a Canadian customer will be churned? Please provide at least two reasons for your answer based on this document and answers you have generated so far. (Total 10 Marks)

相關文章