STAT207 - Data Science Exploration
Fall 2023 - Ellison

Course Topics


  1. Introduction to the “data science pipeline”.
  2. Formulating research questions.
    1. What kind of research questions can you ask, given how the data was collected?
    2. Types of analyses: question first or dataset first?
  3. Types of dataset collection.
    1. Numerical vs. categorical
    2. Random sample
    3. Representative samples
    4. Data from random experiments
  4. Data management
    1. How to read in csvs into python
    2. How to write dataframes to csvs
  5. Data cleaning, manipulation, and representation
    1. Creating dataframes
    2. Quickly describing dataframes
    3. Summarizing/aggregating a dataframe
    4. Filtering dataframes
    5. Subsetting/splicing dataframes
    6. Combining dataframes
    7. Sorting dataframes
    8. Sampling dataframes
    9. Basic detecting and dealing with missing values
  6. Descriptive Analytics
    1. Summary Statistics and Visualizations
      1. Numerical variables
      2. Categorical variables
      3. Categorical variable and a numerical variable
      4. Two categorical variables
      5. Two numerical variables
      6. Three or more variables
    2. Dimensionality Reduction
      1. Principal Component Analysis (PCA)
  7. Basic probability
    1. Sampling
      1. With and without replacement
    2. Two definitions of probability
    3. Law of large numbers
    4. Calculating probabilities:
      1. Using the uniform probability rule
      2. Of two independent events
      3. Of two dependent events
      4. Using combinatorics rules
      5. Using probability mass functions
      6. Using probability density functions
    5. Random variables
      1. Discrete random variables
      2. Continuous random variables
    6. Types of probability distributions and their properties
      1. Bernoulli Distribution
      2. Binomial Distribution
      3. Uniform Distribution
      4. Normal Distribution
      5. Standard Normal Distribution
      6. t-Distribution
      7. Chi-Squared Distribution
      8. F Distribution
  8. Inference Basics
    1. Population distribution vs. sample distribution vs. sampling distribution
      1. Population distribution of numerical values vs. sample distribution of numerical values vs. sampling distribution of sample means
      2. Population distribution of categorical values vs. sample distribution of categorical values vs. sampling distribution of sample proportions
      3. Two population distributions of numerical values vs. two sample distributions of numerical values vs. sampling distribution of sample mean differences
      4. Two population distributions of categorical values vs. two sample distributions of categorical values vs. sampling distribution of sample proportion differences
    2. What is the mean, standard deviation, and shape of the a.) population distribution, b.) sample distribution, and c.) sampling distribution (respectively) and how do they relate to each other?
    3. What is the Central Limit Theorem and why does it help us conduct inference?
  9. Making an Inference
    1. Make an inference about a population parameter using one of the following techniques
      1. A confidence interval
      2. Test statistic
      3. P-value
    2. Make an inference about one of the following population parameters:
      1. Population mean
      2. Population proportion
      3. Difference between two population means
      4. Difference between two population proportions
      5. Three or more population means (ANOVA)
      6. One or more population slopes in a regression equation
  10. Predictive Analytics
    1. Fit a linear regression equation (for a numerical response variable).
      1. Checking conditions for a linear regression equation.
      2. Assess the fit for a linear regression equation.
      3. Select which explanatory variables to use in a linear regression equation.
    2. Fit a logistic regression equation (for a categorical response variable).
      1. Checking conditions for a linear regression equation.
      2. Assess the fit for a linear regression equation.
      3. Select which explanatory variables to use in a linear regression equation.
    3. Build classifier models using:
      1. Logistic regression model
      2. Support vector machines
      3. Linear discriminant analysis
      4. Random forests
      5. Introductory Neural Networks
    4. Make predictions using a regression equation.
    5. Understanding and managing the bias vs. variance trade off when fitting a regression model.
  11. Prescriptive Analytics
    1. How to use your data science analysis to make “good decisions” given the problem you are trying to solve with data.
  12. Coding
    1. Github Version Control
      1. Pulling a remote repository to your local computer
      2. Pushing the updates made on your local computer to the remote repository
    2. Python
      1. Types of objects
      2. Creating functions
      3. if/else statements
      4. for loop creations