Statistical Computing and Empirical Methods

人生苦短6發表於2024-11-14

Summative Assessment

Statistical Computing and Empirical Methods, Teaching Block 1, 2024

ntroduction is document contains the specification for the summative assessment for the unit Statistical Computing

nd Empirical Methods, TB1 2024. Please read carefully the following instructions before you start answering

he questions.Deadline. Your report is due on 28 November 2024 at 13:00.

Rules: This is an independent task. For the summative assessment you should not share your answers withyour colleagues. The experience of solving the problems in this project will prepare you for real problems inyour career as a data scientist. If someone asks you for the answer, resist!Support: Whilst this is an independent task, there is a lot of support available if you need it. If you areunclear about what is required for any part of theassessment then discuss this issue with the our teachingteam in the computer lab or contact your unit director.

Plagiarism: Be very careful to avoid plagiarism. For more details, you should consult the “AcademicIntegrity” section under the Assessment tab within the central Blackboard page for the School of EngineeringMathematics & Technology.The use of generative AI: The use of generative AI, such as ChatGPT, is prohibited. Any use ofgenerativeAI in this assessment will be considered as plagiarism.Extenuating circumstances: For more details on the procedure for extenuating circumstances consult the“Assessment support options” section under the Assessment tab within the central Blackboard page for theSchool of Engineering Mathematics & Technology.ate submission panalty: Coursework that is submitted after a deadline should be subject to a latesubmission penalty, unless there is an extension or a justifiedexceptional circumstance. The more details,you should consult the central Blackboard page for the School ofEngineering Mathematics & Technology orcontact the School office.

Clarity: Clarity is highly important. Be careful to make sure you clearly explain each step in your answer.You should also include comments within your code when necessary. Your answer should clearly demarcatewhich part of the question you are answering. Wheneverpossible, include pieces of well-written codes in yourreport to promote clarity.Programming language: For Section A of this coursework you should use Tidyverse methods within theR programming language. For Section B and Section C, you can use either R or Python. Regardless of yourchoice of language, it is essentialthat your answers are clear and well-written.

1Submission points: To submit your solutions, please visit the “Assessment, submission and feedback” tab

on the course webpage at Blackboard. Make sure your submission follows the submission structure describedbelow.Multiple submissions: Submitting the coursework multiple times before the deadline is allowed. However,only the last submission will be considered for marking. You cantry to submit an temporary copy beforeyour final submission if you like.Submission structure: Please submit a single zip file that contains a folder named “SCEM_???” where “???”should be replaced by your unique UoB username (e.g., lf22553). The folder should contain three subfoldersnamed “A”, “B” and “C”.

1 Subfolder "A" should include 1) a PDF file that contains your answers to Section A, and 2) a folder

containing the code and data being used for Section A.2 Subfolder "B" should include 1) a PDF file that contains your answers to Section B, and 2) a foldercontaining the code and data being used for Section B.Time allocation: Section A & B and Section C both contain 50 marks, but we recommend that you allocatemore time for the tasks in Section C, for example 40% on Section A & B and 60% on Section C.

UoB confidential 2SECTION A (20 MARKS) Section A (20 marks) General instruction: In this part of your assessment, you will perform a data wrangling task using R programming. Note that clarity is highly important. Be careful to make sure you clearly explain each step inyour answer. You should also include comments within your code when necessary. Inaddition, make thestructure of your answer clear through the use of headings. You should also make sure your code is clean bymaking careful use of Tidyverse methods in R.(Q1). First download the files entailed"debt_data.csv", "country_data.csv" and "indicator_data.csv" whichare available within the Assessment section within Blackboard.The file 代寫Statistical Computing and Empirical Methods "debt_data.csv" contains debt datafor different countries under different indicators, from 1960"indicator_data.csv" contains a list of the indicator names as well as their associated indicator codes.The file "country_data.csv" contains information about the countrycode, income levels, and regions foreach country.First, Load the file "debt_data.csv" into an R data frame called "debt_df", load the file "country_data.csv" into an R data frame called "country_df", and load the file "indicator_data.csv" into adata frame called "indicator_df".Second,use R to check the number and the number of rows that the data frame "debt_df"has. Display your results.(Q2). Update"debt_df" by reordering its rows such that the values of the indicator "DT.NFL.BLAT.CD" isin descending order. Display a subset of theupdated "debt_df" consisting of the first 4 rows and thecolumns"Country.Code", "Year", "NY.GNP.MKTP.CD", and "DT.NFL.BLAT.CD".(Q3). In the data frame "debt_df", the indicators are represented by their associated indicatorcodes rather

than by their names. The data frame "indicator_df" contains a list of indicator names and theircorresponding indicator codes. Create a new data frame called "debt_df2" bycombining the datafrom the two data frames "debt_df" and "indicator_df". The new data frame "debt_df2" should be to "debt_df" except that "debt_df2" now contains indicator names rather than indicatorcodes. The indicator names in "debt_df2" should match the indicator codes in "debt_df" according totheir correspondence described in "indicator_df".Display a subset of "debt_df2" consisting of the first 5 rows and the three columns "Country.Code",Year", and "Net financialflows, others (NFL, current US$)".Q4). The data frame "country_df" contains information about Region, Income groups, and country name for

each country. Create a new data frame called "debt_df3" by combining data from the two data frames"debt_df2" and "country_df". The new data frame "debt_df3" should contains a) all columns from"debt_df2" and b) 3 columns from "country_df" called "Region", "IncomeGroup", and "Country.Name".

Make sure that in each row of "debt_df3", the "Region", "IncomeGroup", and "Country.Name" match"Country.Code" according to their correspondence described in "country_df".Your data frames "debt_df3" and "debt_df2" should have the same numbers of rows, but "debt_df3"has three more columns.UoB confidential 3SECTION A (20 MARKS)Display a subset of "debt_df3" consisting of the first three rows and 4 columns called "Country.Name","IncomeGroup", "Year", and "Total reservesin months of imports".Total ReservesSECTION B (30 MARKS) Section B (30 marks) B.1Suppose a product is being sold in a supermarket. We areinterested in knowing how quickly the productreturns to the shelf again after it is soldout. Let X be a continuous random variable denoting the length oftime between the time point at which it is sold out and the timepoint at which it is placed on the shelf again.So X should be a non-negative number, and X = 0 means that the product gets on the shelf immediatelyafter it is sold out. Here, we assume that the probability densityfunction of X is given byx < b, where b > 0 is a known constant, λ > 0 is a parameter of the distribution, and a is to be determined by λ and b.(1) First, determine the value of a: derive a mathematical expression of a in terms of λ and/or b.(2) Derive a formula for the population mean and standard deviation of the random variable X withparameter λ.

(3) Derive a formula for the cumulative distribution function and the quantile function for the random

variable X with parameter λ.

(4) Suppose that X1, · · · , Xn are independent copies of X with the unknown parameter λ > 0. What is

the maximum likelihood estimate λMLE for λ?Now download the .csv file entitled “supermarket_data_2024” from the Assessment section within Blackboard.The .csv file contains data on the length of time (in seconds) taken by a product to get on the shelf againafter being sold out. So the sample is a sequence of time lengths. Let’s model he sequence of time lengths inour sample as independent copies of X (X is the random variable mentioned above) with parameter λ andknown constant b = 300 (seconds). Answer the following questions (5) and (6) Given the sample, compute and display the maximum likelihood estimate λMLE of the parameter λ.(6) Apply the method of Bootstrap confidence interval to obtain a confidence interval for λ with a confidencelevel of 95%. To compute the Bootstrap confidence interval,the number of resamples (i.e., subsamplesthat are generated to compute the bootstrap statistics) should be setto 10000.Next, conduct a simulation study to explore the behaviour of the maximum likelihood estimator:(7) Conduct a simulation study to explore the behaviour of the maximum likelihood estimator λMLE for λ on simulated data X1, · · · , Xn (as independent copies of X with parameter λ) according to the followinginstructions. Let b = 0.01 and thtrue parameter be λ = 2. Generate a plot of the mean squared error

  1. For each sample size, consider 100 trials. In each trial, generate a random sample X1, · · · , Xn (as

ndependent copies of X with parameter λ = 2), and then compute the maximum likelihood estimate

λMLE for λ based upon the sample. Display a plot of the mean square error of λMLE as an estimator

for λ as a function of the sample size n.

B.2

Consider a bag of a red balls and b blue balls (the bag has a + b balls in total), where a 1 and b 1. Weandomly draw two balls from the bag without replacement. That means, we draw the first ball from the bagand, WITHOUT returning the first ball to the bag, we draw the second one. Each ball has an equal chanceof being drawn. Now we record the colour of the two balls drawn from the bag, and let X denote the number

UoB confidential 7B.2 SECTION B (30 MARKS) of red balls minus the number of blue balls. So X is a discrete random variable. For example, if we draw onered ball and one blue ball, then X = 0. Answer the following questions from (1) to (11)(1) Give a formula for the probability mass function pX : R [0, 1] of X.

(2) Use the probability mass function pX to obtain an expression of the expectation E(X) of X (i.e., the

population mean) in terms of a and/or b.

(3) Give an expression of the variance Var(X) of X in terms of a and b.

(4) Write a function called compute_expectation_X that takes a and b as inputs and outputs the expectation

E(X). Write a function called compute_variance_X that takes a and b as input and outputs the variance

Var(X). Display your code.

In the following questions, we additionally assume that X1, X2, · · · , Xn are independent copies of X. SoX1, X2, · · · , Xn are i.i.d. random variables having the sameas that of X. Let X = n 1 P n i=1

Xi be the sample mean.(5) Give an expression of the expectation of the random variable X in terms of a, b.(6) Give an expression of the variance of the random variable X in terms of a, b and n.(7) Create a function called sample_Xs which takes as inputs a, b and n and outputs a sample

X1, X2, · · · , Xn of independent copies of X.(8) Let a = 3, b = 5 and n = 100000. First, compute the numerical value of E(X) using thefunction compute_expectation_X and compute the numerical value of Var(X) using the functioncompute_variance_X. Second, use the function sample_Xs to generate a sample X1, X2, · · · , Xn ofindependent copies of X. With the generated sample, compute the sample mean X and samplevariance. How close is the sample mean X to E(X)? How close is the sample variance to Var(X)?Explain your observation.Moreover, let µ := E(X) and σ := p Var(X)/n (the random variable X is defined above), and let fµ,σ : R

[0, ) be the probability density function of a Gaussian random variable with distribution N (µ, σ2 ), i.e., the

xpectation is µ and the variance is σ 2 . Next, conduct a simulation study to explore the behaviour of thesample mean X by answering questions (9)-(11).(9) Let a = 3, b = 5 and n = 100. Conduct a simulation study with 50000 trials. In each trial, generate asample X1, · · · , Xn of independent copies of X. For each of the 50000 trials, compute the correspondingsample mean X based on X1, · · · , Xn.(10) Create a scatter plot of the points {(xi , fµ,σ(xi))} where {xi} are a sequence of numbers between µ3σ and µ + 3σ in increments of0.1σ. Then append to the scatter plot a curve representing the kerneldensity of the sample mean X within your simulation study (with50000 trials). Use different coloursfor the point {(xi , fµ,σ(xi))} and the density curve of the sample mean X.

(11) Describe the relationship between the density of X and the function fµ,σ displayed in your plot. Try toexplain the reason.UoB confidential 8SECTION C (50 MARKS) Section C (50 marks)

In this part of the assessment, you are asked to complete a Data Science report which demonstrates your

understanding of a statistical method. The goal here is to choose a topic that you find interesting and explore

that topic in depth. You are free to choose a topic and data set that interests you.There will be an opportunity to discuss and get advice on your chosen direction in thecomputer labs.Below are two flexible example structures you can consider for this section of your report. If you are unsurewhat to do, choose one of the following. Note that you shouldnot submit more than one of the example tasks below.Example task 1 Investigate a particular hypothesis test e.g. a Binomial test, a paired Student’s t test, an unpaired Student’st test, an F test for ANOVA, a Mann-Whitney U test, a Wilcoxon signed-rank test, a Kruskal Wallis test, orsome other test you find interesting.Note that clarity of presentation is highly important. In addition, you should aim to demonstrate a depth ofunderstanding. For this hypothesis test you are asked to do the following:

  1. Give a clear description of the hypothesis test being considered, including the details of the test statistic

and p-value, the underlying assumptions, the null hypothesis and the alternative hypothesis. Givean intuitive explanation for why the test statistic is useful in distinguishing between the null and thealternative.

  1. Perform a simulation study to investigate the probability of type I error under the null hypothesis forType I error is made with thesignificance level of the test. What happens when a different significancelevel is used?Choose a suitable real-world data set (for example, some places to find data sets are described below).Ensure that your chosen data set is appropriate for your chosenhypothesis test. For example, if yourchosen hypothesis test is an unpaired t-test then your chosen data set must have at least one continuousvariable and contain at least two groups. It is recommended that your data set for this task not be toolarge. You should explain the source and the structure of your data set within your report. You shouldalso explain the related problem on which you want to perform the test.

Carefully discuss the appropriateness of your statistical test in this setting and how your hypotheses

correspond to different aspects of the data set. You may want to use plots to demonstrate the validityof your underlying assumptions. Draw a statistical conclusion and report the value of your test statistic,the p-value and a suitable measure of effect size.Discuss whatscientific conclusions you can draw from your hypothesis test. Discuss how these would havediffered if the result of your statistical test had differed.Discuss key experimental design considerationscauseand effect?Exploring further this hypothesis test on one topic/direction of your choice. This could be for examplediscussing a property of the test such as how the power of the chosen test changes with sample size,ignificance level, or effect size. As another example, how robust is the test when assumptions areviolated and is there a robust alternative? How does the test compare to its non-parametric alternatives?How does the frequentist test compare with its Bayesian alternative? These are just a few examples.Make a clear statement on the question of interest and your conclusions. The details of your approachto support your findings should be visible within your report, and experiments or simulation studiescan be included if needed.UoB confidential

9Example task 2 SECTION C (50 MARKS)

Example task 2

Investigate a particular method for supervised learning. This could either be a method for regression orclassification but should be a method with at least one tunable hyperparameter. You could choose one fromridge regression, k-nearest neighbour regression, a regression tree, regularized logistic regression, k-nearestneighbour classification, a decision tree, a random forest or another supervised learning technique you findinteresting.Note that clarity of presentation is highly important. In addition, you should aim to demonstrate a depth ofunderstanding.Give a clear description of the supervised learning technique you will use, including the underlyingprinciples and any assumptions. Explain how the training algorithm works and how new predictionsare made on test data. Discuss what type of problems this method iappropriatefor.Choose a suitable data set where this method can be applied. Perform a train, validation,and test spli(or example, some places to find data sets are described below). Be careful to ensure that your data setis appropriate for your chosen algorithm. For example, if you have chosen to investigate a classification

lgorithm then your chosen data set must contain at least one categorical variable. Your data set forthis task does not need to be large to obtain good results. The size of your data set should not exceed100MB and you should aim to use a data set well within this limit. Your report should carefully givethe source for your data. In addition, describe your data set. How many features are there? How many? What type is each of the variables (e.g. categorical, ordinal, continuous, binary etc.)? Youshould also explain the associated problem that you will solve using your supervised learning method.

is an appropriate metric for the performance of your model? Give a clear explanation of themetric. Explore how the performance of your model varies on both the training data and the validationdata as you vary the amount of training data used. You should compare the performance of the modelsacross different sizes of the training dataExplore how the performance of your model varies on both the training data and the validation data asyou vary a hyperparameter.

  1. Choose a hyper-parameter and report your performance based on the test data. Can you get a betterunderstanding by using cross-validation?
  1. Exploring further this supervised learning method on one topic/direction of your choice. This couldbe for example discussing how the bias-variance trade-off impacts the performance of the chosenmethod. As another example, is your model robust? How does the performance of the method changewhen applied to imbalanced datasets? Does your method work on small data and if not is therean suitable alternative? You could also investigate how different regularisation techniques affect themodel’s performance, or carefullycompare the chosen method with other methods. These are just afew examples. Make a clear statement on the question of interest and your conclusions. The detailsof your approach to support your findings should be visible within your report, and experiments orsimulation studies can be included if needed.Further instruction for Section C.

Note:

  1. Do not complete and submit more than one of the above tasks. These are example tasks and you shouldonly choose one. The goal here is to explore a topic in detail.
  1. You will be graded on the level of understanding of the key concepts demonstrated within your report.Additional marks will be given for more advanced methods, provided that a very strong level of

understanding is displayed. However, you should avoid choosing complex methods without properly

UoB confidential

10Further instruction for Section C. SECTION C (50 MARKS) demonstrating your understanding. The main focus here is a clear understanding and you shoulnot sacrifice understanding for the sake of complexity. A clear understanding of the basic concepts isparamountYou do not need to use large data sets. The dataset youchoose should not be larger than 100MB. Thisis an upper bound. You should aim to use a data set well within this limit.We expect that your approach should be visual and clear within the report itself. Therefore it is highlyrecommendedto include pieces of clear and well-written code along with necessary comments andexplanations withithe report itself.

  1. We expect that you interpret and make sense of the experiment results obtained, instead of displayinga list of the results without explanation or analysis. A high quality report should be able to use the

experimental results to support its conclusions and findings in a consistent manner.We do not have a page limit for the report. A rough guideline is that your report should ideally be nomore than 10 pages, if all figures and large pieces of code were removed. However, this is not a strictconstraint. Again, clarity is highly important, and you should include sufficient details to demonstrateyour approach and the level of understanding of the key concepts.

Data sets

There are a vast number of freely available data sets across the internet. Below are a few example sources.You are also welcome to use data sets from other sources. Any data you use should be freely available andaccessible. The source of your data and the steps required to retrieve it should also be described within yourmain report.You should also explain its structure e.g. the number of rows and the number of columns, and what the datain each column of interest represent for, · · ·. You areencouraged to use tabular data throughout.

相關文章