CS1026A Fall 2024
Assignment 3: YouTube EmotionsImportant Notes:
- Read the whole assignment document before you begin coding. This is a morecomplex specification than in past assignments and the examples and templatesnear the end of this document will be important in solving this assignment.Assignments are to be completed individually. Use of tools to generate code,working with another person, or copying from online resources are not allowed andresult in a zero on this assignment regardless of how much was copied.
- A code template is given in Section 6 (on page 17) for your main.py andemotions.py files. We highly recommend using these as a starting point for yourassignment. The code is also attached to the assignment on OWL.Change Log:
. 4th: The comments.csv file attached to Brightspace had an unexpectedunicode character in one of the comments the changed the outcome of some of theexamples given in this document. comments.csv has now been corrected and theexamples in this document to match.Nov. 13th: A type-o was found in the example formake_report() in section 5. This hasnow been corrected. The output shown at the end of the document in section 7 wasstill correct. This change has no impact on the (itwas marking correctly).Learning OutcomesBy completing this assignment, you will gain skills relating toFunctions
- Dictionaries and lists
- Complex data structures
- Text processing
- Working with TSV and CSV files
- File input and output
- Exceptions in Python
- Simple module use
- Writing code that adheres to a given specification
- Working with real world problem2. Background
With the emergence of social media sites such as YouTube, Facebook, Reddit, Twitter (alsoknown as X), LinkedIn, and WhatsApp, more and more data is being produced and madeaccessible online in a textual format. This textual data, such as YouTube comments,Tweets, or Facebook posts, can be hard to process but is incredibly importanfororganizations as it offers a current snapshot of the public’s emotions (affinity) or sentimentabout a topic at a current point in time. Having a live view of your customer’s current affinittowards your products or the public’s view of your political campaign can be critical forsuccess.Much work has been done towards the goal of creating large datasets of word affinity orsentiment. One such effort is the National Research Council (NRC) Emotion Lexicon whichis a list of English words and their associations with eight basic emotions (anger, fearanticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative andpositive).Our goal in this assignment is to use a simplified version of the NRC Emotion Lexicon tclassify YouTube comments based on one of the following emotions anger, joy, fear, trust,sadness, or anticipation. Based on the emotion contained in each comment for a particularvideo we then want to generate a report that details the most common emotions YouTubeusers have towards that video based on their comments.DatasetsYour Python program will deal with two datasets, a keywords data set that contains asimplified version of the NRC Emotion Lexicon (this dataset will remain the same for alltests of your program) and a Comma-Separated Values (CSV) file that contains thecomments for a particular YouTube video (this dataset will change for each test of yourprogram).3.1 Keywords Dataset (TSV File)The keywords.tsv file attached to this assignment contains a simplified version of the NRCEmotion Lexicon. This is a Tab-Separated Values (TSV) file in which each line of the filcontains a single word and its emotional classification based on six emotions (anger, joy,fear, trust, sadness, and anticipation). Each word in the file may be classified as having oneor more emotions.The following is an example of the first 10 lines of this file where tab (\t) characters arerepresented by arrows (→):
Each line starts with a keyword and is followed by a score (0 or 1) for each emotion in this
order: anger, joy, fear, trust, sadness, and anticipation. If a 1 is present it means thatkeyword is related to that emotion. If a 0 is present the keyword is unrelated to thatemotion.For example, according to the above the word “abacus” is related to the emotion of trust and no other emotions. The word “abandon” is related to the emotions fear and sadness and no other emotionsAll words in the dataset will be relatedto at least one emotion. This file’s contents willremain the same for all tests but may be given a different filename based on theusers input(e.g. it could be named keys.tsv or words.tsv rather than keywords.tsv).3.2 Comments Dataset (CSV File)The user will provide a Comma-Separated Values (CSV) file that contains a set of YouTubecomments for a particular video. The name of this file will change based on the user’s inputbut will always end in .csv and havethe same format.The following is an example of a possible line from this file (the file may contain one orlines). Note that this document wraps the line on to multiple lines but in the 代寫CS1026A YouTube Emotions file this isone line ended by a line break (\n):2,PixelPioneer24,brazil,The excavation scenes in the movie wereexcellent but the unnecessary derision of the hero's motives seemedunfair. His eventuality of success was not adequately showcased.Each line of this file will contain fourvalues separated by a single comma character (,). Thevalues will always be in the following order:Comment ID, Username, Country, Comment TextComment ID is a unique positive integer identifier for the comment. Usernameis theusername of the user who posted the comment. Country is the user’s home country, andcomment text is the text the of the comment posted by the user.No value will contain a line break or a comma character. The capitalization of countrynames could be different for each line even if it is for the same country, but the country willalways be spelled the same.Space characters will only occur in the comment text or country name.
- Tasksthis assignment, you will write two Python files, emotions.py and main.py, that willattempt to determine the most common emotion expressed in a YoutTubevideo’scomments. You will create a number of functions (as specified in the FunctionalSpecification in Section 5) that will perform simple sentiment analysis on the YouTubecomments.To accomplish this, you will need to do the following:Accept input from the user: The user will specify the file names of the keywordsand comments data sets as well as the name of the report file your program willcreate. The user will also input the name of the country they wish to filter thecomments by.Read. Your program will read in the keyword and comments datasets and storethem in the formats described in the functional specification (in Section 5).Clean. The text of the comments will be cleaned to remove any punctuation andconvert them to all lowercase letters.Determine Emotion. You will use the keyword’s dataset to determine the overallemotion expressed in each comment.Report. Based on your analysis of each comment, you will create a reportfile that contains a summary of the most common emotion expressed as well ashow common each emotion was (as specified in Section 5).Additionally, you must follow functional specification presented in Section 5 and therules and requirements in Section 8.
- Specification
5.1 emotions.py
The functions described in this section should be present in your emotions.py file and mustbe used in some way in your program to read, clean, process, analyze, or reporton thecomments in the given dataset. Each function and its parameters must havethesamename and spelling as specified below:clean_text(comment)This function should have one parameter, comment, which is a string that contains thetext of a single comment from the comments dataset. The function should clean thisext by replacing any characters that are not letters (A to Z) and replacing them withpace characters. It should also convert the comment’s text to all lower case.This function should return the cleaned text as a string.
Example:
clean_text("This4is-an example. It's a b*t silly.")will result in this output:this is an example it s a b t sillymake_keyword_dict(keyword_file_name) function should read the Tab-Separated Values (TSV) keywords file as described inSection 3.1. keyword_file_name is a string containing the name of the keywords file.This function can safely assume that this file exists, is in the current working directory,and is properly formatted. Checks on the file’s existence will be done in the main.pyfile later in this document.The function should return a dictionary with keys for each word in the file and the valueof this dictionary should be a new dictionary for each keyword that contains a value for
each emotion (anger, joy, fear, trust, sadness, and anticipation).Example: Assuming that keywords.tsv contains the following three lines (where → is a tabcharacter):
a make_comments_list(filter_country, comments_file_name)This function should read the Comma-Separated Values (CSV) file as described inSection 3.2. comments_file_name is a string containing the name of the CSV file andfilter_country is a string containing either a country name or the string “all”. Thisfunction should read the CSV file and return a list containing only comments forthegiven country listed in filter_country (or all countries if the string “all” is given).The list should contain one element for each comment in the file that matches the
ountry in the filter (or all comments if “all” is given). Each element in the list should be adctionary that contains a key for the Comment ID, Username, Country and CommentText. The keys should be named 'comment_id', 'username', 'country', and 'text'
rspectively.The comment text should be stripped of any leading and trailing whitespace.
Example 1: Assuming that comments.csv only contains the following two lines (note that the line iswrapped in this document and in the .csv file this is only two lines):1,RetroRealm77,united states,I was a bit disappointed with thefilm's portrayal of childhood heroism. It felt like the classicelements were just concealed under layers of unnecessary savageryand violence.2,PixelPioneer24,brazil,The excavation scenes in the movie wereexcellent but the unnecessary derision of the hero's motives seemed
unfair. His eventuality of success was not adequately showcased.then calling
make_comments_list("all", "comments.csv")should result in the following nested list and dictionary data structure:[ {'comment_id': 1,'username': 'RetroRealm77','country': united states','text': 'I was a bit disappointed with the film's portrayal ofchildhood heroism. It felt like the classic elements were justconcealed under layers of unnecessary savagery andviolence.'},{'comment_id': 2,'username': 'PixelPioneer24',
'country': 'brazil','text': 'The excavation scenes in the movie were excellent butthe unnecessary derision of the hero's motives seemed unfair. Hiseventuality of success was not adequately showcased.'} ]Example 2: Given the same contents of comments.csv as in Example 1, if the following function callwith the country name brazil was made:then the only element in the returned listwould be:[ {'comment_id': 2,'username': 'PixelPioneer24',
country': 'brazil','text': 'The excavation scenes in the movie were excellent butthe unnecessary derision of the hero's motives seemed unfair. Hiseventuality of success wasnot adequately showcased.'} ]Example 3: Given the same contents of comments.csv as in Example 1, if the function was called
with a country name that was not present in the file such as:make_comments_list("not a real country", "comments.csv")
then the resulting list would be empty[]Note that to pass the Gradescope tests this function must return a list and not anothercollection such as a set or dictionary, the values of each list element must be adictionary, and the keys used in that dictionary must match the spelling and lowercasecapitalization given in this section.
classify_comment_emotion(comment, keywords)This function takes the text of a comment and the keywords dictionary created by themake_keyword_dict function as parameters and classifies the comment as one of thepossible emotions (anger, joy, fear, trust, sadness, and anticipation), returning theemotion as a string.A comment is classified by first cleaning the text (using the clean_text function) and
then checking each word in the comment against the keywords dictionary. A total foreach possible emotion should be kept with each word in the comment matchingakeyword adding to the totals (based on the values in the keywords dictionary).
Example: For the comment:The excavation scenes in the movie were excellent but theunnecessary derision of the hero's motives seemed unfair. Hiseventuality of success was not adequately showcased.
the text should be first cleaned using clean_text to get:he excavation scenes in the movie were excellent but thunnecessary derision of the hero s motives seemed unfair hiseventuality of success was not adequately showcasedthen each word should be checked against the keywords dictionary and the totals foreach emotion kept. Words not matching any words in the dictionary (shown in blackabove) do not add to the scores. For example, using the full keywords.tsv dataset thewords shown in blue above have matches in the keyword dataset and would result inthe following totals:1 4 Therefore, this comment would be classified as havingtheemotion of anticipation and
the string “anticipation” should be returned by the function as it as the highest score.
I the event of a tie, the emotions should be given priority in this order: 1) anger, 2) joy, 3)fear, 4) trust, 5) sadness, and 6) anticipation.Hint: You may find the string split method useful for looping through words rather than characters. make_report(comment_list, keywords, report_filename)This function takes the comment_list (created by the make_comments_list function),the keywords dictionary (created by the make_keyword_dict function), and a stringcontaining the file name of the report to generate (report_filename) as parameters.Anew file should be created with the file name in report_filename and it should containthe name of the most common emotion classification in the comment_list dataset aswell as a count of the number of comments classified as each emotion. In the event of a
tie the emotions should be given priority in this order: 1) anger, 2) joy, 3) fear, 4) trust, 5)sadness, and 6) anticipation.The format of the report should match the following example which is based on theattached comments.csv and keywords.tsv with a country filter of “all”:Most common emotion: angerEmotion Totalsanger: 5 (33.33%)
joy: 2 (13.33%)fear: 1 (6.67%)trust: 3 (20.0%)sadness: 3 (20.0%)anticipation: 1 (6.67%)The emotion totals should occur in the same order (regardless of the counts) but thevalues would be different depending on the comment_list and keywords dictionaryssed to the function.All percentages should be rounded to two digits and all six emotions should always belisted even if their count is zero. Important: in your report fileeach percentage must bewritten with one or two decimal places. A value such as 20.000% or 6.6700% would bewrong even though it is technically rounded as there are too many decimal places. Youroutput must be formatted exactly as shown in the example above including the spacingnd line breaks.
Return
The function should return the name of the most common emotion; in this example it
would be “anger”.
Exception
n the event that the comment_list contains no comments (i.e. it is an empty list), thefunction should raise a RuntimeError containing the text “No comments in dataset!”.Reminder: The report should be saved to a file and not output to the screen or returned
ile andcomments file that the data will be read from, as well as the name of the report file that willbe created. It must use the functions defined in the emotions.py file to perform the tasksdescribed in Section 4 and write the final report.Your main.py file must contain the following two functions (ask_user_for_input and main)as specified:ask_user_for_input()This function takes no parameters but asks the user to input the file names of thekeywords TSV file, the comments CSV file, the country tofilter by, and the file name of thereport to be generated. These three filenames and the country name are returned in atuple in this order: 1) keyword filename, 2) comment filename, 3) country name(converted to lower case), and 4) report filename.
Example (of valid input):Input keyword file (ending in .tsv): keywords.tsv Input comment file (ending in .csv): comments.csv Input a country to analyze (or "all" for all countries): Canada Input the name of the report file (ending in .txt):report.txtUser input is shown in green and input prompts in black. Note that the filenames andcountry are based on the user’s input and can not be hardcoded to one set value.Thismeans that the filenames could be different depending on the values input by the userIn this case the following tuple would be returned:('keywords.tsv', 'comments.csv', 'canada', 'report.txt')Note that the country name was converted to all lowercase.
Exceptions Your ask_user_for_input() method must complete the following checks on the user input.If the input does not pass a check, an Exception should be raised causing the function toexit immediately.Exceptions should be raised as soon as the invalid input is given. For example, if thekeyword file does not exist, an exception should be raised before asking the user to inputthe comments file name.
Check 1: File Extension For each of the three filenames, if the user inputs a filename ending in the wrong fileextension (.csv, .tsv, or .txt) the function should raise a ValueError exception with amessage stating that the file extension is incorrect such as “Keyword file does not end in .tsv!”. The text of this message must be exactly the following for each file:
Keyword File: “Keyword file does not end in .tsv!”
Check 2: Files Exist For the keyword and comment files you must check if the file exists using theos.path.exists function. If it does not, your function must raise a IOError exception withtext explaining that the function does not exist. The message should have the text “<filename> does not exist!” where <file name> is replaced with the filename such as
“keywords.tsv does not exist!", where keywords.tsv is the missing file.
For the report file, if the file already exists, an IOError should be raised with text stating
that “<file name> already exists!” where <file name> is the name of the report file. For
example “report.txt already exists!” where the report file is named report.txt. This is to
help prevent accidentally overwriting any files.
Check 3: Valid Country
Lastly you must check that the country input is either “all” or one of the followingcountries: 'bangladesh', 'brazil', 'canada', 'china', 'egypt', 'france', 'germany', 'india', 'iran',
'japan', 'mexico', 'nigeria', 'pakistan', 'russia', 'south korea', 'turkey', 'united kingdom', or'united states'. If any other country or word is input, a ValueError should be raised withthe text “<country> is not a valid country to filter by!” where <country> is the country theuser input.This subset of countries was chosen as they tend to occur in the datasets, we are usingmore than others. In more realistic scenario you would likely want to include all valid
country names in this list, but this assignment limit to the above-mentioned countries.Keep in mind that this only limits the countries a user can filter by, it does not limiwhatcountry names can occur in the dataset.main()This function handles calling the other functions in main.py and emotions.py to performthe tasks listed in Section 3. It should check for any exceptions being raised bytheask_user_for_input function, output the error message contained in the exception (thiscan be done by simply printing the exception with print()), and ask the user to input thevalues again if any exception is raised.Once valid input has been received, it should call the functions from emotions.pyrequired to analyze the comments and generate the report.
Lastly it should output to the screen the most common emotion in the comment data set.This should be displayed as “Most common emotion is: <emotion name>” where emotionname is the name of the emotion such as “Most common emotion is: anger” if theemotion is anger.If the make_report function raises a RuntimeError exception (e.g. the comment list wasempty), it should output the message contained in that error.Example 1:
Example 2:
For the same values in keywords.tsv and comments.csv but a country of “Canada”Input keyword file (ending in .tsv): keywords.tsv Input comment file (ending in .csv):comments.csv Input a country to analyze (or "all" for all countries): Canada Input thename of the report file (ending in .txt): report_cad.txtMost common emotion is: sadnessAnd the contents of report_cad.txt would be:Most common emotion: sadnessEmotion Totalsanger: 1 (16.67%)joy: 0 (0.0%)fear: 0 (0.0%)trust: 2 (33.33%)sadness: 3 (50.0%)anticipation: 0 (0.0%)Example 3: In this example invalid inputs are given, and the user is asked to input them again.Input keyword file (ending in .tsv): not_a_real_file.tsv portsIt is important to import the files in the correct order and from the correct files. Main.pyshould import emotions.py as shown in the template above and not the other way around.
- Extra ExampleThe files keywords.tsv and comments.csv should be attached to this assignment on
OWL. The result of running them with the following countries is given below:Example 1: Country of “All” Input/Output: Input keyword file (ending in .tsv): keywords.tsv Input comment file (ending in .csv): comments.csv Input a country to analyze (or "all" for all countries): all Input the name of the report file (ending in .txt): my_report.txt Most common emotion is: angerContents of my_report.txt:Most common emotion: anger
Emotion Totalsanger: 5 (33.33%)joy: 2 (13.33%)fear: 1 (6.67%)trust: 3 (20.0%)sadness: 3 (20.0%)anticipation: 1 (6.67%)Example 2: Country of “brazil” Input/Output: Input keyword file (ending in .tsv): keywords.tsv Input comment file (ending in .csv): comments.csv Input a country to analyze (or "all" for all countries): brazil Input the name of the report file (ending in .txt): report_brazil.txt Most common emotion is: feaContents of report_brazil.txt: Most common emotion: fearEmotion Totalsanger: 0 (0.0%)joy: 0 (0.0%)fear: 1 (50.0%)trust: 0 (0.0%)sadness: 0 (0.0%)anticipation: 1 (50.0%)Example 3: Country of “germany” (there are no comments for this country in the data set) Input keyword file (ending in .tsv): keywords.tsvInput comment file (ending in .csv): comments.csvInput a country to analyze (or "all" for all countries): germanyInput the name of the report file (ending in .txt): report.txtError: No comments in dataset!Example 4: Invalid Inputs (these files do not exist or have the wrong extension) nput keyword file (ending in .tsv): badfile.pizza Error: Keyword file does not end in .tsv!Input keyword file (ending in .tsv): this_file_does_not_exist.tsv Error: this_file_does_not_exist.tsv does not exist!Input keyword file(ending in .tsv): keywords.tsv Input comment file (ending in .csv): badcsvfile.duckError: Comment file does not end in .csv!Input keyword file (ending in .tsv): keywords.tsv Input comment file (ending in .csv): not_a_real_csv_file.csv Error: not_a_real_csv_file.csv does not exist!Input keyword file (ending in .tsv): keywords.tsv Input comment file(ending in .csv): comments.csv Input a country to analyze (or "all" for all countries): not_a_real_country Error: not_a_real_country is not a valid country to filter by!Input keyword file(ending in .tsv): keywords.tsvInput comment file (ending in .csv): comments.csv Input a country to analyze (or "all" for all countries): JaPaN
the name of the report file (ending in .txt): badreportfile.exe Error: Report file does not end with .txt!Input keyword file (ending in .tsv): keywords.tsv Input comment file(ending in .csv): comments.csv Input a country to analyze (or "all" for all countries): JAPAN Input the name of the report file (ending in .txt): already_exists.txt Error: already_exists.txt exists, the report filecan not already existInput keyword file (ending in .tsv):keywords.tsv Input comment file (ending in .csv): comments.csv Input a country to analyze (or "all" for all countries): jApAn Input the name of the report file (ending in .txt): new_report_file.txt Most common emotion is: anger8. NonFunctional Specification
Your code must be written for Python 3.10 or newer.
- You may not use any modules or third-party libraries not described in thisdocument. Standard built-in functions such as the String, file, and math functionsare fine. You should not have to import anything other than your emotions.py andthe os.path module. TAs may manually remove marks from your Gradescope test ifyou violate this rule.
You must document your code with brief comments. Each file should contain acomment at the top of the file with your name, western e-mail, student number, anda brief description of what is contained in that file. At least one comment shouldYourprogram must be efficient and terminate within a reasonable time limit. Allgradescope test cases (including any hidden cases) must terminate withintheautograder’s time limit.
- Assignments are to be done individually and must be your own original work. Youmay not show or otherwise share your code for this assignment with others or usetools to generate your code for you. Software will be used to detect academicdishonesty (cheating). If you have any questions about what is or is not academicdishonesty, please consult the document on academic dishonesty and ask anyquestions toyour course instructor before submitting this assignment.
- You must follow Python style and coding conventions and good programmingtechniques, for example:Meaningful variable and function names.
- Use a consistent convention for naming variables, constants, and functions(snake case is recommended).Readability: indentation, white space, consistency.All of your code should be contained in the files main.py and emotions.py. Onlsubmit these files and no others and ensure the filenames match exactly. It is youresponsibility to ensure you have submitted the correct files.
- All function names and outputs should follow the specifications given in thisdocument exactly. Not following the specifications may lead to test cases failing. It isyour responsibility to ensure you have followed them correctly and your tests arepassing.
- Hardcoding or any other attempt to fool Gradescope’s autograder will result in azero grade for that test being manually assigned and possibly additional penalties. Ifyou have any doubts about what is or is not hardcoding, please ask your instructorbyposting to the course forums.Frequently backup your work remotely (e.g. using OneDrive) in a way that is secureand private. No extension will be given for lost or corrupted files or submittinincorrect files.
- Marking and Submission
9.1 Submission
You must submit the 2 files (main.py and emotions.py) to the Assignment 3 submissionpage on Gradescope. There are several tests that will automatically run when you uploadyour files. The result of each test with be displayed to you, but not necessarily the exactdetails of the test. It is your responsibility to ensure all tests pass before the assignment due date. It is recommended that you create your own test cases to check that the code is workingproperly for a multitude of different scenarios (some example datasets have been providefor you with this document on OWL).Assignments will not be accepted by email or by any other form other than aGradescopesubmission.
.2 MarkingThe assignment will be marked as a combination of your auto-graded tests (both visibleand hidden tests) and manual grading of your code logic, comments, formatting, style, etc.Marks will be deducted for failing to follow any of the specifications in this document(both functional and nonfunctional), not documenting your code with comments, usingpoor formatting or style, hardcoding, or naming your files incorrectly.Marking Scheme:
- Autograded Tests: 24.5 points
- Header comment including your name, student ID, course info, creation date, anddescription of file: 1.5 points
- Descriptive in-line comments throughout code: 1 point
- Meaningful variable names: 1 pointTotal: 28 points
9.3 Late SubmissionsLate assignments will only be accepted up to 3 days late and only if you have enough latecoupons remaining (at least one for each day late). If you submit one day late,you will needto use 1 late coupon. 2 days late, 2 late coupons. And 3 days late, 3 late coupons. If youinsufficient late coupons remaining or submit more than 3 days late, you will receive azero grade on this assignment.It is your responsibility to track your late coupon use. Any values shown on OWL should beconsidered an estimate and may not be accurate or up to date.Unlimited resubmissions are allowed, but the late penalty will be determined by thedate/time of your most recent (last) resubmission. This means if you resubmit past thedeadline, your assignment will be considered lat