MATH38161 Multivariate Statistics and Machine Learning
Courseworkovember 2024
Overview The coursework is a data analysis project with a written report. You will apply skills
and techniques acquired from Week 1 to Week 8 to analyse a subset of the FMNISTdataset.In completing this coursework, you should primarily use the techniques and methods introduced during the course. The assessment will focus on your understanding anddemonstration of these techniques in alignment with the learning outcomes, ratherthan the accuracy or exactness of the final results.The project report will be marked out of 30. The marking scheme is detailed below.Software: You should mainly use R to perform the data analysis. You may usebuilt-in functions from R packages or implement the algorithms with your owncodes.
- Report: You may use any document preparation system of your choice but thefinal document must be a single PDF in A4 format. Ensure that the text in thePDF is machine-readable.
- Content: Your report must include the complete analysis in a reproducible format,integrating the computer code, figures, and text etc. in one document.
- Title Page: Show your full name and your University ID on the title page of yourreport.
Length: Recommended length is 8 pages of content (single sided) plus title
The deadline for submission is 11:59pm, Friday 29 November 2024.
- Submission is online on Blackboard (through Grapescope).
Academic Integrity and Use of AI Tools This is an individual coursework. Your analysis and report must be completedindependently, including all computer code. Note that according to the Universityguidances, output generated by AI tools is considered work created by another person.
- Citations: Acknowledge all sources, including AI tools used to support text andcode writing.
- Ethics: Use sources in an academically appropriate and ethical manner. Do notcopy verbatim, and cite the original authors rather than second- or third-levelsources.
- Accuracy: Be mindful that sources, including Wikipedia and AI tools, may containnon-obvious errors.Copying and plagiarism (=passing off someone else’s work as your own) is a very serious offence and will be strictly prosecuted. For more details see the “Guidanceto students on plagiarism and other forms of academic malpractice” availableathttps://documents.manchester.ac.uk/display.aspx?DocID=2870 .
Analysis of the FMNIST data using principal component analysis
(PCA) and Gaussian mixture models (GMMs) 代寫 MATH38161 Multivariate Statistics and Machine LearningThe Fashion MNIST dataset contains 70,000 grayscale images of fashion productsandcontains 10,000 images, each with dimensions of 28 by 28 pixels, resulting in a total of84 pixels per image. Each pixel is represented by an integer value ranging from 0 to
You can download this data subset as “fmnist.rda” (7.4 MB) from Blackboard.
oad("fmnist.rda")# load sampled FMNIST data set dim(fmnist$x)# dimension of features data matrix (10000, 784) ## [1] 10000 784range(fmnist$x)# range of feature values (0 to 255) ## [1]0 25re is a plot of the first 15 images:
ar(mfrow=c(3,5), mar=c(1,1,1,1))
or (k in 1:15)
# first 15 images {
m = matrix( fmnist$x[k,] , nrow=28, byrow=TRUE)
}3Each sample is assigned to one label represented by an integer from 0 to 9 (as R factorwith 10 levels):fmnist$label[1:15]
# first 15 labels ## [1] 7 1 4 8 1 4 7 1 2 0 7 0 8 1 6## Levels: 0 1 2 3 4 5 6 7 8 9
Task 1: Dimension reduction for FMNIST data using principal components analysis (PCA) The following steps are suggested guidelines to help structure your analysis but are notmeant as assignment-style questions. Integrate your work as part of a cohesive reportwith a logical narrative.
- Do some research to learn more about the FMNIST data.
- Compute the 784 principal components from the 784 original pixel variables.
- Compute and plot the proportion of variation attributed to each principal component.
- Create a scatter plot of the first two principal components. Use the known labelsto colour the scatter plot.
- Construct the correlation loadings plot.
- Interpret and discuss the result.
- Save the first 10 principal components of all 10,000 images to a data file for Task 2.
Task 2: Analysis of the FMNIST data set using Gaussian mixture models (GMMs) Using all 784 pixel variables for cluster analysis is computationally impractical. Inthis task, use the 10 (or fewer) principal components instead of the original784pixelvariables. Again, these steps serve as guidelines. Integrate this work into your reportlogically following from Task 1.
Cluster the data using Gaussian mixture models (GMMs).
- Find out how many clusters can be identified.
- Interpret and discuss the results.
Structure of the report Your report should be structured into the following sections:
- Dataset
- Methods
- Results and Discussion
- References
n Section 1 provide some background and describe the data set. In Section 2 brieflyntroduce the method(s) you are using to analyse the data. In Section 3 run the analysesnd present and interpret the results. Show all your R code so that your results areully reproducible. In Section 4 list all journal articles, books, wikipedia entries, githubpages and other sources you refer to in your report.
4Marking scheme
The project report will be assessed out of 30 points based on the following rubrics.Criteria Marks RubricsDescription ofdata6Excellent (5-6 marks): Provides a clear and thoroughoverview of the FMNIST dataset, detailing the imagestructure, pixel data, and its context within multivariateanalysis.
Good (3-4 marks): Provides a clear overview of thedataset with some context; minor details may be missing.Adequate (1-2 marks): Basic description of the datasetwith limited context; lacks important details.
Insufficient (0 marks): Little to no description provided.ofMethods6Excellent (5-6 marks): Clearly and thoroughly explainsPCA and GMMs, their purposes, and how they apply tothis dataset.Good (3-4 marks): Provides a clear explanation of PCA
and GMMs, with minor gaps in clarity or relevance. (1-2 marks): Basic explanation of methods withlimited detail or relevance to the course techniques.
Insufficient (0 marks): Lacks clear explanations of themethods.Results andDiscussion12
Excellent (10-12 marks): Correctly applies PCA andGMMs, presents clear and informative visualisations, andprovides a coherent and insightful interpretation of theresults.
Good (7-9 marks): Accurately applies PCA and GMMswith mostly clear visuals and reasonable interpretation;minor improvements needed.Adequate (4-6 marks): Basic application of techniques,limited or unclear visuals, minimal interpretation.
Insufficient (0-3 marks): Incorrect application oftechniques, with little to no interpretation.OverallPresentation of
Report6Excellent (5-6 marks): Report is well-organised, clear, andprofessionally formatted, with a logical narrative and
adherence to page limits.Good (3-4 marks): Report is generally clear andorganised, with minor structural or formatting issues.Adequate (1-2 marks): Report lacks coherence or hassignificant formatting issues; may not meet all format
requirements.Insufficient (0 marks): Report lacks structure and clarity,does not meet formatting requirements.5