Class Resources
Questions?
Tech Guides
Syllabus
- Instructor: Tori Ellison, Department of Statistics
- Teaching Assistant: Jordan Deklerk, Department of Statistics
- Course Website: http://courses.las.illinois.edu/spring2024/stat437/
- Office Hours: Held in person and online. Visit our Canvas page for Zoom links.
Lecture Date/Time CST |
|
Instructor: Dr. Tori Ellison, vellison@illinois.edu |
TuTh 3:30-4:50pm CST Location: Zoom |
Office Hours Date/Time CST |
|
Instructor: Dr. Tori Ellison, vellison@illinois.edu |
Fridays 1pm-3pm CST Location: Zoom |
TA: Jordan Deklerk, deklerk3@illinois.edu |
Wednesdays 2-4pm CST Location: Zoom |
This course focuses on applied unsupervised learning. It covers common clustering and dimensionality reduction algorithms used by data scientists, with practical applications using Python.
- Learn Tools for Hidden Insights: Understand algorithms to discover and describe insights within high-dimensional unlabeled data.
- Full Unsupervised Learning Analysis: Gain proficiency in Python for conducting a comprehensive unsupervised learning analysis on real-world datasets, including:
- Justifying the usefulness of clustering or dimensionality reduction for a given dataset and research question.
- Selecting the most appropriate algorithms for clustering/dimensionality reduction based on dataset characteristics.
- Choosing algorithm parameters effectively.
- Evaluating algorithm performance using suitable metrics.
- Interpreting and communicating algorithm results.
- Understanding data preprocessing's impact on results.
- Utilizing insights for predictions and business decisions.
- Algorithm Understanding: Grasp the functioning and outputs of each algorithm, as well as their evaluation metrics. Develop intuition through 2D dataset applications.
- Practical Proficiency: Be able to conduct and calculate algorithm iterations and metrics manually, and to code these processes.
- Effective Communication: Demonstrate data presentation best practices, including proper graph titles and labeled axes.
- MATH415 Applied Linear Algebra or MATH257 Linear Algebra with Computational Applications: Some algorithms will involve knowledge of:
- Eigenvectors and Eigenvalues
- Orthogonality
- Matrix operations
- Vector Norms
- STAT410 Statistics and Probability II: Some algorithms will involve knowledge of:
- Maximum Likelihood Estimation
- Ideally, students will have had some experience using Python. However, we will dedicate parts of the lecture and some tutorials towards using Python (for the purpose of conducting an unsupervised learning analysis) for those who haven't.
Lectures
Synchronous ONLINE Lecture
TuTh 3:30-4:50pm CST
Zoom links
You aren't required to attend the lecture, but you're encouraged to attend if you can make it. Attending lectures provides the opportunity to learn, ask questions, and engage with the instructor, TAs, and classmates.
- Laptop Computer: You need a laptop running Windows, OS X, or Linux. Tablets, Chromebooks, and iPads are not supported. You will need to be able to install Python to complete the labs (instructions provided).
-
Some version of Anaconda:
- Anaconda: A free and open-source distribution of Python and R, equipped with Python packages and applications including Jupyter Notebook.
- Jupyter Notebooks: Used for data science reports integrating interactive Python code blocks. Assignments involve working on and submitting Jupyter Notebooks.
- If you need help installing Anaconda and using Jupyter notebooks, follow these steps.
- Lecture notes: Posted on the course website, available within an hour of class.
-
Online Books: Chapters and resources for understanding concepts and algorithms:
- Chapter 7, Cluster Analysis: Basic Concepts and Algorithms [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
- Chapter 8, Cluster Analysis: Additional Issues and Algorithms (on Canvas Files page) [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
- Appendix B, Dimensionality Reduction [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
Important Note: You will be expected to know content covered in lectures or assignments. We will not necessarily test you on ALL content from these resources. These resources can be used to understand concepts.
- Quick Refresher on Concepts:
There will be twelve homework assignments, each worth 30 points, given during the semester as we cover corresponding material. The two lowest homework scores will be dropped. Homework must be submitted through Canvas by 11:59pm CST on the due dates (see schedule).
How to submit homework:
- All homework assignments should be submitted on Canvas.
- Coding assignments should be submitted as ipynb files.
- Some assignments may require handwritten answers. Scan or photograph these documents and submit them on Canvas as well.
- Clearly mark and number the questions you are answering in your code and scanned documents.
Late Policies:
- Homework late by 5 minutes up to 24 hours will be deducted 9 points.
- Homework late by more than 24 hours will receive 0 points.
Regrade Policies:
You have ONE week to request a grade correction after a homework score is posted. Provide the following information to the head TA (Jordan Deklerk, deklerk3@illinois.edu):
- Which homework is involved (e.g. Homework #6)
- A detailed explanation of the suspected error
- The number of points you believe you should have received for the question.
A project will be due near the end of the semester, involving conducting an unsupervised learning analysis using methods learned in this course. You can either choose your own dataset or select from preselected datasets. If you choose your own dataset, it must contain at least 5 variables and 50 rows.
Course Component | Points | Percentage |
Homework Assignments | 300 | 65.2% |
Checking Tuesday | 10 | 2.2% |
Final Project | 100 | 21.7% |
Final Exam | 50 | 10.9% |
Total | 460 | 100.0% |
Course points will be translated into a course grade at the end of the semester. The grade thresholds will be based on your percentage score out of 400:
Final Grade | Total Points Needed | Total Percentage Needed |
A+ | 446.2 | 97% |
A | 427.8 | 93% |
A- | 414 | 90% |
B+ | 400.2 | 87% |
B | 381.8 | 83% |
B- | 368 | 80% |
C+ | 354.2 | 77% |
C | 335.8 | 73% |
C- | 322 | 70% |
D+ | 308.2 | 67% |
D | 289.8 | 63% |
D- | 276 | 60% |
F | <276 | <60% |
Participation
If you are able to, we encourage you to attend class. Being present will help you keep up with what is going on, gain hands on experience in learning activities, and benefit from interacting with other students and instructional staff.
Please ask questions whenever anything is confusing. If you find errors in the notes, please report them to the instructor. The instructor will be very happy that you detected them so they can be corrected!
Learning Collaboratively
We encourage you to discuss all of your course activities (with the exception of exams) with your friends and classmates! You will learn more through talking through the problems, teaching others, and sharing ideas.
Continue to read on "Academic Integrity" to understand the difference between collaboration and giving an answer away.
Academic Integrity
Collaboration is about working together. Collaboration is not giving the direct answer to a friend or sharing the source code of an assignment. Collaboration requires you to make a serious attempt at every assignment and discuss your ideas and doubts with others so everyone gets more out of the discussion. Your answers must be in your own words and your code must be typed (not copied/pasted) by you.
Academic dishonesty is taken very seriously in STAT 437, and all cases will be reported to the University, your college, and your department. You should understand how academic integrity applies specifically to STAT 437: the sanctions for cheating on an assignment include a loss of all points for the assignment and a lowering of the final course grade by one whole letter grade (70 points). A second incident or cheating on an exam results in an automatic F in the course.
Academic integrity also includes protecting your work. If your work ends up submitted by someone else, we consider this a violation of academic integrity, just as though you submitted someone else’s work.
Checking Tuesday: Understanding the Code and Claims that you are Submitting
AI Tools are not Perfect and have Consequences when They're Wrong
Given the ease of use, but imperfect nature of AI tools that can sometimes assist the data scientist with the coding element of their task, it has become easier than ever to write code and make claims that are incorrect and that you don't understand. Unfortunately, using AI to write code or make claims about concepts that you don't undertand can have negative impacts on society, your organization that you work for, your career etc.
Do you understand the code/statements that you are making?
Therefore, the ultimate goal is to ensure that you understand the code that you wrote or the conceptual claims that you are making.
Checking Tuesday: Understanding the Code and Claims that you are Submitting
AI Tools are not Perfect and have Consequences when They're Wrong
Given the ease of use, but imperfect nature of AI tools that can sometimes assist the data scientist with the coding element of their task, it has become easier than ever to write code and make claims that are incorrect and that you don't understand. Unfortunately, using AI to write code or make claims about concepts that you don't undertand can have negative impacts on society, your organization that you work for, your career etc. (see Piazza discussion).
Do you understand the code/statements that you are making?
Therefore, the ultimate goal is to ensure that you understand the code that you wrote or the conceptual claims that you are making.
Checking Tuesday Idea
Therefore, every Tuesday about 12 people in the class will be randomly selected to discuss their thought process for solving a few questions (or almost similar questions) to the assignment that you just submitted on the preceding Friday.
If you did not simply just copy paste your assignment answers/code without thinking about it, then you should have nothing to worry about in checking Tuesday! Even if you got the answer wrong to the corresponding question in your individual assignment, if you thoughtfully engaged with the question, then you should have some sort of steps/logic process etc. that you can discuss. This discussion of your thought process is ultimately what I'm looking for.
On the other hand, if you got the answer right (or mostly right) on your individual assignment, but you have no idea why/how you got this answer, then this is not demonstrating that you understand and can EXPLAIN that code/concepts that you wrote.
Important Data Scientist Job and Interview Skill
Being able to explain your decision making process to your boss, client, or job interviewer is a very important skill to demonstrate as a data scientist. Many data scientist interviews will require you to do this! This can be an especially useful skill to develop in instances when you don't automatically know the answer to an interview question. Data scientist job interviewers are first and foremost interested in how your decision making process works, when it comes to problems. Being able to articulate your thought process is a very important skill to have in situations like these.
How many times will I be checked?
Each student in the class will be randomly checked 2 times, worth 5 points each at some point during the semester.
How to complete the check?
- Random selection: On Tuesday at 4:30pm CST (20 minutes before class ends), I will send out an email letting you know whether or not you were selected for the random check.
- Google Colab Link: If selected, the email that you receive will contain either a Google doc link or a link to a Google colab Jupyter notebook.
- Watch Final 20 Minutes of Class Later: If you were selected and are attending the lecture, you should log off the class Zoom link. You should watch the remaining 20 minutes of lecture after class.
- Make a < 10 minute video in Zoom:
If selected you have until the end of the lecture (ie. Tuesday 4:50pm CST) to do the following.
- Answer the questions asked in the Google doc/notebook. You may have to write a few lines of code and/or explain your answer to the question.
- You should record yourself in Zoom (Record to the Cloud) writing the code in the Google colab Jupyter notebook or simply just answering the question out loud.
- You should share your FULL screen when writing the code/thinking about your answer.
- Your camera should be on (let me know if you foresee an issue with this).
- You should explain the thought process for your answer.
- References you CAN use: Your assignments, class notes, the e-book.
- References you CANNOT use: Anything else, this includes assistance from the TA, another student, etc.
- Upload your video link to Canvas: In order to receive credit, you need to upload your Zoom video link BY THE END OF THE lecture 4:50pm CST to the Canvas Checking Tuesday Assignment . Once you stop the recording or log out of Zoom, you will be able to find the video link here: https://illinois.zoom.us/recording.
- Answer the questions asked in the Google doc/notebook. You may have to write a few lines of code and/or explain your answer to the question.
- You should record yourself in Zoom (Record to the Cloud) writing the code in the Google colab Jupyter notebook or simply just answering the question out loud.
- You should share your FULL screen when writing the code/thinking about your answer.
- Your camera should be on (let me know if you foresee an issue with this).
- You should explain the thought process for your answer.
How is the check graded?
- Checks will NOT be penalized if:
- You got the wrong answer! Remember, the Checking Tuesday is not about evaluating whether you got the RIGHT answer. We are evaluating whether you can EXPLAIN your thought process for YOUR answer.
- Checks will be heavily penalized if:
- The video is submitted after the end of the lecture 4:50pm CST.
- The video is more than 10 minutes long.
- You are not able to explain your thought process for how you arrived at your answer.
- Your camera was not turned on.
- You did not share your full screen while recording.
- You used a reference to help you was not allowed (like talking to one of your classmates).
What happens if I was randomly selected for a given Checking Tuesday, but I was sick and/or had another extenuating circumstance?
You'll be checked in the next Checking Tuesday with questions that look similar to the questions you would have been asked that week. For instance, if you missed the lecture in which we were asking you questions that looked very similar to assignment 5. Then we would still ask you questions that looked very similar to assignment 5.
Key thing Dr. Ellison wants you know about "checking Tuesday"
Relax! If you have put forth a good faith effort to try to understand what you wrote on your assignment, then you have nothing to worry about, even if you got it wrong or don't quite understand the concept fully yet.
IMPORTANT: This is NOT an academic integrity violation check and I don't necessarily think you are cheating
An important point to remember is that, if you do not get the full 5 points on a check, I am NOT making an academic integrity violation claim and I do NOT necessarily think that you are cheating.
Being able to articulate your understanding of data science code/concepts that you wrote is a graded learning outcome of this course, that is an important skill for a data scientist to have. Therefore, if you are not able to do this, then you do not get full points for this graded learning outcome because you were unable to demonstrate this skill.
General Topics | Course Content |
Getting Started |
Course Introduction |
What is unsupervised Learning? How is it different from supervised learning? Where does unsupervised learning fit in the data science pipeline? |
|
Python Primer |
|
Types of Clustering Algorithms |
|
What to consider when selecting/using a clustering algorithm? When should you use a clustering algorithm? |
|
Most Common Clustering Algorithms |
A closer look at k-means clustering (k-medoids, k-means++, bisecting k-means, and k-modes) |
Clustering evaluation metrics (unsupervised vs. supervised evaluation metrics) |
|
Hierarchical clustering (using single linkage, complete linkage, average linkages, and Ward's linkage) |
|
DBSCAN |
|
Mean Shift Clustering |
|
Prototype-Based Clustering Algorithms |
Fuzzy c-means clustering |
General Mixture Models and the EM Algorithm |
|
Self Organizing Maps (SOM) |
|
Online Clustering |
BIRCH |
Mini-batch kmeans |
|
Spectral/Graph-based/Optimization-Based Clustering |
Ratio Cut Clustering and the Fiedler vector |
Consensus Clustering |
Median Partition Problem and Co-association Based Consensus Clustering |
Collaborative Filtering Algorithms |
Nonnegative Matrix Factorization (NMF) |
Dimensionality Reduction Algorithms |
Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) |
Factor Analysis |
|
t-sne Algorithm |
|
Latent Dirichlet Allocation |