spring2022-STAT430-Syllabus

Class Resources

Questions?

Tech Guides

Spring 2022 - STAT430
Unsupervised Learning

General Course Info

Instructor: Tori Ellison, Department of Statistics
Teaching Assistant: Abhishek Ojha, Department of Statistics
Course Website: http://courses.las.illinois.edu/spring2022/stat430/index.html
Office Hours: Held in person and online. Visit our Canvas page for Zoom links.

Date/Time CST

Instructor: Dr. Tori Ellison, vellison@illinois.edu

Friday 1:30pm-3:30pm CST

Location: Computer Applications Building 138 and Zoom

TA: Abhishek Ojha, ojha4@illinois.edu

Wednesday 3-5pm

Location: English Building 150

Course Goals

This will be an applied course in unsupervised learning. This course will survey some of the most commonly used clustering algorithms and dimensionality reduction algorithms currently used by data scientists. Students will apply these algorithms to real and artificial datasets in Python.

In this course we will do the following.

Learn a series of tools (algorithms) that allow us to discover and describe hidden insights contained in high-dimensional unlabeled data.
Specifically given real-world data sets, students should be able to code a full unsupervised learning analysis in Python. This includes the following.
- Be able to justify when/if it is useful to use a clustering algorithm or dimensionality reduction algorithm for a given dataset, research question, and reserach scenario.
- Be able to justify which clustering and/or dimensionality reduction algorithms are most appropriate to use for a given dataset, research question, and research scenario.
- If the clustering and/or dimensionality reduction algorithm has different settings/parameters that can be utilized, be able to justify which parameters to use
- Be able to justify the evaluation metric(s)/methods that were used to: a.) select which algorithm/model/parameters to use as well b.) describe the nature of the results.
- Be able to interpret the results of the algorithms and effectively communicate as many hidden insights as possible about the dataset.
- Be able to understand how different aspects of data pre-processing might affect the results of the unsupervised learning algorithms.
- Be able to use these unsupervised learning insights to help make predictions as well as make good business decisions.
In general, students should also know the the following.
- How each algorithm works and the output of each algorithm.
- How each algorithm evaluation metric works.
- Develop an intuition for what happens when we apply to these algorithms and evaluation metrics to 2-d datasets.
- Students should know how to conduct at least one iteration of these algorithms and calculate these evaluation metrics by hand.
- Students should know how to code these algorithms and evaluation metrics by hand.
- Students should demonstrate best practices when effectively communicating and presenting data science results. (Ie: titles on graphs, label the axes etc.)

Prerequisites

MATH415 Applied Linear Algebra: Some algorithms will involve knowledge of:
- Eigenvectors and Eigenvalues
- Orthogonality
- Matrix operations
- Vector Norms
STAT410 Statistics and Probability II: Some algorithms will involve knowledge of:
- Maximum Likelihood Estimation

Useful Experience (but not required):

Ideally, students will have had some experience using Python. However, we will dedicate parts of the lecture and some tutorials towards using Python (for the purpose of conducting an unsupervised learning analysis) for those who haven't.

Course Section

Lectures

Synchronous In-Person Lecture
MWF 12am-12:50pm CST
106B8 Engineering Hall

You aren't *required* to attend the lecture, but you're *encouraged* to attend if you can make it though. Attending lecture at a fixed time will provide you with the opportunity to learn and ask questions from the instructor, the TAs, and your classmates.

Course Materials

Laptop Computer: You need a laptop running Windows, OS X, or Linux. Tablets, Chromebooks, and iPads are not supported. You will need to be able to install Python to complete the labs (instructions provided).
Some version of Anaconda:
- Anaconda is a free and open-source distribution of the Python and R programming languages. It comes equipped with many common Python packages and applications that allow you to code in Python in a variety of different ways. One of the applications Anaconda includes is the Jupyter Notebook application.
- Jupyter Notebooksare often used for writing data science reports that also need to be integrated with interactive Python code blocks. Part of our assignments will involve working on and submitting (on Canvas) Jupyter Notebooks.
- If you do not yet have some version of Anacondas installed on your computer and are unfamiliar with how to use Jupyter notebooks, these steps can help you get started.
Lecture notes: These will be posted on the course website. Each new day's lecture notes should be available within one hour of class at the latest.
Online Books: For an introductory-level overview of the concepts and algorithms that we will discuss in this course, the chapters below can be useful resources. However, given that data science is a modern and ever-changing field that is built upon the trial and error of current data scientists, we will be supplementing these base resources with more applied/current papers describing best practices and applied tips as well. Check back here as the semester progresses (or on the course schedule) for more useful resources.
Important Note: You will only be expected to know content that was covered in the lectures or assignments. (So we will not necessarily test you on ALL content that is described in these resources. But if you would like a different resource/way that explains the concepts that we discuss in this course, I would suggest taking a look at these resources first. If you're still confused on a topic, I can try to refer you to a different resource if you ask.)
- Chapter 7, Cluster Analysis: Basic Concepts and Algorithms, [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
- Chapter 8, Cluster Analysis: Additional Issues and Algorithms (see our Canvas Files page for this chapter), [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
- Appendix B, Dimensionality Reduction, [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
Need a Quick Refresher on Certain Concepts?:
- Appendix A, Linear Algebra, [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
- Appendix C and D, Probability, Statistics, and Regression, [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]
- Appendix E, Optimization (from your Calculus classes), [P. Tan, M. Steinbach, A. Karpatne, and V. Kumar (2018) Introduction to Data Mining (2nd ed)]

Homework Assignments
There will be twelve homework assignments each worth 30 points that will be given during the semester as we cover the corresponding material. The lowest two homework scores will be dropped. Homework must be turned in through Compass on the due dates (see schedule) by 11:59pm CST.

How to submit homework:

All homework assignments should be submitted on Compass.
All homework assignments will involve coding in Python. You should submit your code in an ipynb.
Some homework assignments may required you write out some of your answers by hand. Please scan(or take a picture of) these documents and submit them on you Compass assignment as well. (There are many scanner apps that you can download on your smartphone. Please let me know if don't have the appropriate technology to submit these and we can work something out.)
Please clearly mark and number the questions that you are answering in your code and scanned documents.

Late Policies:

Homework that is late by 5 minutes up to 24 hours will be deducted 9 points.
Homework that is late by more than 24 hours will receive 0 points.

Regrade Policies: You have ONE week to request a grade correction after a homework score is posted. You should clearly present the following information to the head TA (Abhishek Ojha, ojha4@illinois.edu):

Which homework is involved (e.g. Homework #6)
A detailed explanation of the suspected error
The number of points you feel you should have received for the question.

Projects
A project will be due near the end of the semester that involves conducting an unsupervised learning analysis using some of the methods that we learn in this course. You can either choose your own dataset or choose from a set of preselected datasets. If you choose your own dataset, it must contain at least 8 variables and at least 300 rows.

Grades

The final course grade will be based on the total number of points earned by a student on the homework assignments (max 300) and the project (max 100), for a total of 400 possible points.

Homework Assignments - 300 pts
Final Project - 100 pts total

Course points will be translated into a course grade at the end of the semester. The grade thresholds will be based on your percentage score out of 400:

Grade	Min Pct	Grade	Min Pct	Grade	Min Pct
A+	97	A	93	A-	90
B+	87	B	83	B-	80
C+	77	C	73	C-	70
D+	67	D	63	D-	60

Participation

If you are able to, we encourage you to attend class. Being present will help you keep up with what is going on, gain hands on experience in learning activities, and benefit from interacting with other students and instructional staff.

Please ask questions whenever anything is confusing. If you find errors in the notes, please report them to the instructor. The instructor will be very happy that you detected them so they can be corrected!

Learning Collaboratively

Working Together

We encourage you to discuss all of your course activities (with the exception of exams) with your friends and classmates! You will learn more though talking through the problems, teaching others, and sharing ideas.

Continue to read on “Academic Integrity” to understand the difference between collaboration and giving an answer away.

Academic Integrity

Collaboration is about working together. Collaboration is not giving the direct answer to a friend or sharing the source code to an assignment. Collaboration requires you to make a serious attempt at every assignment and discuss your ideas and doubts with others so everyone gets more out of the discussion Your answers must be your own words and your code must be typed (not copied/pasted) by you.

You may discuss basic concepts with other students, but your document should be completed on your own without evidence of having viewed another student’s solutions.

Academic dishonesty is taken very seriously in STAT 430 and all cases will be brought to the University, your college, and your department. You should understand how academic integrity applies specifically to STAT 430: the sanctions for cheating on an assignment includes a loss of all points for the assignment and that the final course grade is lowered by one whole letter grade (70 points). A second incident, or cheating on an exam, results in an automatic F in the course.

Academic integrity includes protecting your work. If you work ends up submitted by someone else, we have considered this a violation of academic integrity just as though you submitted someone else’s work.

Email Note
Given that this course is completely online, please check your email regularly for important class communications.

COVID-19 Note
Given the uncertain nature of how this semester might unfold due to COVID-19. The instructor reserves the right to make any changes she considers academically advisable. Such changes, if any, will be announced in class. Please note that it is your responsibility to attend the class and keep track of the proceedings. However, we will try to adhere to this syllabus and course schedule as much as possible, and will send an email informing you of any changes.

Course Topics

General Topics	Course Content
Getting Started	Course Introduction
	What is unsupervised Learning? How is it different from supervised learning? Where does unsupervised learning fit in the data science pipeline?
	Python Primer
	Types of Clustering Algorithms
	What to consider when selecting/using a clustering algorithm? When should you use a clustering algorithm?
Most Common Clustering Algorithms	A closer look at k-means clustering (k-medoids, k-means++, bisecting k-means, and k-modes)
	Clustering evaluation metrics (unsupervised vs. supervised evaluation metrics)
	Hierarchical clustering (using single linkage, complete linkage, average linkages, and Ward's linkage)
	DBSCAN
	Mean Shift Clustering
Prototype-Based Clustering Algorithms	Fuzzy c-means clustering
	General Mixture Models and the EM Algorithm
	Self Organizing Maps (SOM)
Online Clustering	BIRCH
Online Clustering	Mini-batch kmeans
Spectral/Graph-based/Optimization-Based Clustering	Ratio Cut Clustering and the Fiedler vector
Consensus Clustering	Median Partition Problem and Co-association Based Consensus Clustering
Collaborative Filtering Algorithms	Nonnegative Matrix Factorization (NMF)
Dimensionality Reduction Algorithms	Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
	Factor Analysis
	t-sne Algorithm
	Latent Dirichlet Allocation