ETC2420 / ETC5242 Statistical Thinking

n58h2r發表於2024-10-07

原文網址 : https://www.cnblogs.com/comp9021/p/18450328

Thinking

Statistical Thinking (ETC2420 / ETC5242)

Assignment 1

Semester 2, 2024

Instructions

This assignment is a group assignment. Only one submission for each group is required.

All groups are to do all tasks in this assignment, regardless of your unit code.

Effective strategies for group work:

• Experience suggests it is a poor strategy for a group to allocate separate tasks to individuals and then just to present them together as if they were each undertaken collectively by the group.

• The best strategy is to have each group member attempt each task on their own, with the group meeting together after a few days to review and decide on the best set of ideas and approaches, including whether additional work is required. Be sure to leave enough time to combine your efforts and review the final report before submitting the assignment.

You may verbally discuss ideas about the assignment with any of your classmates. However, all groups must separately prepare and write their own submission, which should accurately reflect the work and efforts of all students in your group. We highly recommend that you do not share any files or written notes between groups, which will be treated as collusion. Any discussion on the online forum should be limited to seeking clarifications about the instructions and tasks.

Preparing your submission

Your responses should be written in an R Markdown file that compiles to produce the desired results.

This assignment consists of several Tasks. Each should be answered using a combination of written text and output from R, in the format of a mini-report. All answers need explanation and justification. Any recommendations and conclusions you draw should be supported with evidence. Your descriptions will need both a technical and a non-technical component (see the Tasks for more details).

Organise your submission carefully with headings and a clear section structure. We recommend that your response for each Task form. its own section, with further sub-sections as required.

Presentation is important, as is conveying your results/recommendations/findings in a succinct and informative way. You need to include anything that is relevant to your specific analyses, and avoid including superfluous information.

All plots must be properly labelled and explained. For example, the definition of each axis should be clear and also what each visual element (points, lines, etc.) represents.

Show and evaluate all code chunks using the echo = TRUE and eval = TRUE chunk options, and format the output so that it does not run off the page when printed. You can also suppress all other messages and warnings, as per the following command (to be included in the first code chunk of your R Markdown file):

knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE,

message = FALSE, error = FALSE)

Anything that is not part of your answer should not be included in the R Markdown file, even if it is not evaluated or does not appear in the rendered PDF file.

Your Group Number (as allocated on Moodle), and all group members’ names and student ID numbers must be stated in the YAML section of the R Markdown file and on any other files submitted.

Submission

This assignment is due on Thursday 10 October 2024 at 23:55.

You need to upload two (2) separate files to the link on Moodle for your group: the Rmd file and the corresponding PDF file. You may create an HTML or Word document from your Rmd file, but you must convert this to PDF for submission.

Your files should be named according to the following templates:

GroupNumber_A1.Rmd

GroupNumber_A1.pdf

Replace GroupNumber with your Group Number (as shown on Moodle). For example, if your Group Number is “Group92”, your files should be named:

Group92_A1.Rmd

Group92_A1.pdf

Submission of incorrect files or file names will lead to loss of marks.

Your files must be submitted separately, rather than combined together as a ZIP file or similar.

Marking

There is a total of 100 marks available for this assignment.

There are five (5) broad Tasks, which together consist of 90 marks (specific mark allocations are given below).

The remaining 10 marks will be awarded for your submission as a whole, as follows:

• Overall presentation [5 marks].

• Reproducibility [5 marks]. Whether the submitted Rmd files compile as submitted without error.

Peer evaluation

As part of the marking, you will also need to complete a peer evaluation survey to rate the contribution of each team member. The survey responses may be used to adjust your individual marks. We will provide more information about this process prior to the due date.

Introduction

A bank manager would like to get a better understanding of her current customers, of which there are 780,000. She has commissioned you to survey them to find out their incomes and education levels, and then 代寫ETC2420 / ETC5242 Statistical Thinking analyse their responses. She has some specific questions the she wants you to answer, which her team will use to help inform. their product design and marketing strategy.

The manager has some understanding of descriptive statistics, but is not very familiar with statistical modelling and inference. However, she has heard about various approaches, such as the bootstrap and Bayesian inference, and is happy for you to use any techniques and to explain the results to her in a non-technical way.

Data

Using a database of all of the bank’s customers, you select 200 at random and send them a questionnaire to ask the following questions:

1. What was your annual gross income in the previous financial year?

2. What is the highest level of education you have received?

Possible options include:

• Doctorate

• Master degree

• Bachelor degree

• High school or lower

The file called banksurvey.csv contains the customers’ responses, with one row per customer.

The income column is in units of thousands of dollars.

Tasks

For each of Tasks 1–5, there is a set of required analyses and components that you should carry out and include in your response. For each Task, write a short report that includes:

• Explanations of what analyses you are carrying out.

• The results of your analyses (including plots, numerical results, and conclusions).

• Enough details about your analysis choices and assumptions so that someone with technical knowledge (e.g., another statistician) would understand them and could replicate your work.

• A summary of your methods and findings, and any key assumptions, in a form. that would be under-standable to the manager (e.g., using non-technical language).

You should be able to cover what you need in about 3 or 4 paragraphs of text for each Task. Note that some Tasks may require more explanation than others.

Task 1 [20 marks]

The manager wants to understand the distribution of her customers’ incomes. She would like some appropriate descriptive and visual summaries of the data. In addition, she would like a fitted model, which her team will use to help inform. their product design and marketing strategy.

Required analyses and components:

• One or more appropriate plots to visualise the data.

• Some appropriate descriptive statistics.

• Explore fitting various models. Include at least the following distributions: normal, exponential, gamma.

• Your exploration should include using appropriate QQ plots and calculating maximum likelihood estimates.

Task 2 [18 marks]

The manager is particularly interested in estimating the 80th percentile of annual income of her customers. Use a variety of approaches and provide interval estimates for each.

Required analyses and components:

• Use estimators that are based on fitting a model. Provide at least one estimate for each model fitted in Task 1 (therefore, you need at least 3 such estimates).

• It is also possible to use an estimator that doesn’t assume any specific model, by using a statistic known as a sample quantile. There are many ways to define such a statistic, most of which are implemented in the R function quantile(). The default choice for that function is what are known as ‘Type 7’ quantiles (as specified by the argument type = 7). Using this default choice, estimate the 80th percentile of annual income.

• Calculate 95% confidence intervals based on each of the estimators above. (Use any method that you think is appropriate, and justify your choice.)

• In your report, explain the key statistical ideas to the manager, including what the parameter of interest is, how it relates to the data, and what assumptions you are making for each estimate.

• Provide an overall conclusion, including recommendations about which of these estimates are likely to be more useful or reliable.

Task 3 [18 marks]

The manager wants to learn more about the different approaches to estimation that you have used in Task 2, including their strengths and weaknesses. Explain and illustrate the statistical properties of each estimator.

Required analyses and components:

• Simulate the sampling distribution of each estimator, assuming the (population) distribution of annual incomes follows an exponential distribution with mean $100,000.

• Compare these sampling distributions using appropriate plots and numerical summaries. For the latter, you should include at least the bias and standard deviation.

• Explain how to interpret these simulation results, and how you could use this type of simulation to select an appropriate estimator to use.

• Explain what you expect to see in your results if you simulated from a gamma distribution (with shape parameter not equal to 1) instead of an exponential distribution.

Task 4 [16 marks]

The manager’s team wants to understand how their customers’ income relates to their level of education. They are particularly interested in whether there is a large difference in average income between those who completed a university degree and those who haven’t. Also, the team wants to know whether any such difference (if there is one) can be attributed to the fact that customers who more highly educated are able to find higher paid jobs.

Required analyses and components:

• Show appropriate plots that compare how the distribution of income varies across each of the four education groups.

• Calculate a 95% confidence interval for the difference in the mean annual incomes of customers who have completed a university degree and those who haven’t.

• Comment on whether it would be appropriate to use the Central Limit Theorem when calculating this confidence interval.

• Explain whether these data would be able to provide evidence of a causal effect of higher education leading to higher incomes. Specifically, if the confidence interval were to show a large positive difference, does this survey result imply that providing greater education would lead to higher incomes? (Remember to give reasons that clearly justify your answer.)

Task 5 [18 marks]

The manager wants to know whether there is a big difference in the 80th percentile incomes of those with and without a university degree.

Required analyses and components:

• Set this up using a Bayesian inference approach, as follows:

– Model each of the two education groups (with and without a university degree) separately.

– For each group, assume an exponential distribution for the annual income. Each group will have its own rate parameter, λg, where g = 0 refers to the group with no university degree and g = 1 to the group that has a university degree.

– For each group, use a conjugate prior. Select hyper-parameter values such that for each λg it gives a prior mean of 1 and prior variance of 1.

• State the posterior distribution for each λg and calculate 95% credible intervals.

• Simulate 10,000 values from the posterior distribution of the difference between the 80th percentile incomes between the two groups. (Note: you can’t simulate these directly, so think carefully about what you can simulate from and what steps you need to take to achieve the desired simulated values.)

• Calculate an approximate 95% credible interval for the difference between the 80th percentile incomes between the two groups.

statistical thinking in Python EDA
2018-10-08
ThinkingPython
SQL Thinking
2024-07-15
SQLThinking
Statistical Computing and Empirical Methods
2024-11-14
statistical_c03
2024-06-10
Statistical Process Control in SAP
2021-07-15
Thinking of Consultant:Trustworthy
2020-04-05
ThinkingRust
翻譯|Thinking Statefully
2019-04-27
ThinkingStatefully
EMATM0061: Statistical Computing and Empirical
2024-09-29
EMATM0061 Statistical Computing and Empirical
2024-11-21
【Bias 05】Representation Learning with Statistical Independence to Mitigate Bias
2020-11-26
MIT
Thinking in UML（第一章）
2020-01-05
Thinking
Design Thinking的適用性思考
2022-05-08
Thinking
machine learning model(algorithm model) .vs. statistical model
2018-08-16
MacGo
【RxSwift 實踐系列 3/3】thinking in Rx- UITableView
2018-03-05
SwiftThinkingUIView
SAS JMP Statistical Discovery Pro for mac 資訊統計處理工具
2021-08-31
Mac
Thinking in Java---多執行緒學習筆記(2)
2020-04-04
ThinkingJava執行緒筆記
Thinking in Java--使用NIO實現非阻塞Socket通訊
2020-04-04
ThinkingJava
【RxSwift 實踐系列 2/3】thinking in Rx- Create和Drive
2018-03-05
SwiftThinking
Thinking in Java--Java多執行緒學習筆記(1)
2020-04-04
ThinkingJava執行緒筆記
Thinking in Java---如何正確的終止子執行緒
2020-04-04
ThinkingJava執行緒
執行緒間協作-《thinking in java》讀書筆記（一）
2019-02-01
執行緒ThinkingJava筆記
Thinking, Fast and Slow-Ch12-13：可得性效應、效用層疊、概率忽略
2018-04-17
ThinkingAST
推理最強也最快，谷歌釋出Gemini 2.0 Flash Thinking，全面超越o1-preview
2024-12-20
谷歌ThinkingView
Thinking in Java---執行緒通訊+三種方式實現生產者消費者問題
2020-04-04
ThinkingJava執行緒
SciTech-BigDataAIML-Statistical Model-Bayes Inference-資料/事實 ∩ 假設: 政治經濟、社會和科學分析
2024-11-16
AI
Apache Software Foundation Thinking in Java:好的OOP語言都具有作為開發包一部分的一組容器
2020-12-27
ApacheThinkingJavaOOP
Apache Software Foundation Thinking in Java:分發資訊這個簡單思想的複雜性實際上是有很多不同層次的
2020-12-27
ApacheThinkingJava

ETC2420 / ETC5242 Statistical Thinking

相關文章