DATA 311 - Fundamentals of Data Science

Scott Wehrwein

Winter 2023

Course Overview

What is this course about?

Synopsis from the WWU Course Catalog

Introduction to the fundamentals of data science, focusing on techniques for collecting, processing, visualizing and organizing data. Applied machine learning concepts will also be covered, including fundamentals of machine learning experimentation and the use of libraries to perform clustering, classification and regression. Includes lab.

Official Course Outcomes

On completion of this course students will demonstrate:

Textbook

The following books are recommended, but not required:

Assessment

Data science is a practical pursuit, and this course takes a particularly practical-minded approach to it. We will focus less on the mathematical underpinnings of the tools of data science and more on strategies for successfully using those tools to extract insights from data. As such, the assessment in this course is entirely project-based. Grades will be calculated as a weighted average of scores on the following course components, each of which is described in more detail below:

The standard letter grade ranges apply (i.e., 90–100% is an A, 80–90% is a B, and so on). The calculated raw percentages may be curved at the instructor’s discretion, but any such curve used will not lower anyone’s grade. “+” or “-” cutoffs will be decided at the instructor’s discretion.

Students who demonstrate mastery of the material will get grades in the A range, and it is my goal to give as many A’s as possible.

Assessment Philosophy

In the labs and final project you will put the skills, concepts and processes discussed in lecture into practice. My approach to grading is inspired by what might be expected of you in a professional setting. A data scientist’s job is to extract and clearly and convincingly present insights from the messy realities of real data. This has several implications for how I think about assessment:

Lab Assignments

Each full week of the course consists of three 50-minute “lectures” and one 50-minute lab. Lab periods will be spent introducing and getting started on the lab assignment for the week, which will be released on Friday at the start of the lab period and due the following Thursday night at 10:00pm. These labs, along with the final project, comprise the bulk of the workload for this course, so you should plan to allow significant time to complete them outside of class. Some labs will be done individually, while others may be completed in pairs.

Final Project

A final project will be completed in groups of 3 students. The project will have multiple deliverables, including a proposal, milestone reports, a final report, and a presentation.

Quizzes

Weekly quizzes will be given to help you make sure you’re keeping up with the course content. These will be taken asynchronously on Canvas. You will take each quiz between the end of lab Friday and must finish before the beginning of class Monday, with you have a 15-minute time limit once you start. If needed, I will switch to synchronous quizzes, in which case you would likely take the quiz in the first 10 or 15 minutes of class or lab.

In-Class Activities and Reading Responses

My goal is to make the lecture component of this course as interactive as possible. Activities may range from simple class discussions and group work to quick writing prompts that will be handed in. Anything handed in will be graded for participation only (i.e., if you make an honest effort, you will receive full credit).

I will assign a few (likely 2 or 3) required reading assignments that touch on the interactions between data science and society, often with a focus on ethical considerations. These will be evaluated by some combination of in-class discussions and short written responses.

Resources for Getting Help and Support

Help with Course Content

If you are stuck, struggling, or need help on any aspect of the course, you have several avenues for seeking help:

Other Resources

If you are have concerns that go beyond the course material you are welcome to talk to me. The following resources are also available to support you.

Community Ambassadors

The Computer Science department has both Faculty and Student community ambassadors who hold regular office hours:

These hours are a time for students, staff and faculty to bring concerns, feedback or questions as it related to equity, inclusion and diversity within STEM. We hope that we, the Community Ambassadors and the STEM Inclusion and Outreach Specialist, can advise and also guide people to college, university or external resources.

You can find information on Commnity Office Hours and contact details for both at the following link: https://cs.wwu.edu/diversity-equity-inclusion

University Resources

As a reminder, the following University resources are always available:

Logistics

Course Webpage / Syllabus

The Schedule section of this page will be kept up-to-date as the quarter progresses with topics, links to all lecture materials (notes, resources, etc), as well as links to assignment and lab handouts. I suggest bookmarking this page; if you forget the URL and need to find your way back here, you can find the link on the Syllabus page in Canvas.

Canvas

I generally minimize the use of Canvas in favor of sharing materials via the course webpage. However, we will use Canvas for announcements, grades, quizzes, and submission of assignments. Lab and assignment writeups will be linked from both the course webpage and the corresponding assignment on Canvas. Lecture materials, readings, etc. will only be posted on the course webpage.

Discord

Discord is a popular communication platform that enables text, voice, and video chats to take place in a dedicated server. I’ve found Discord very helpful during remote instruction, and indeed have taught some of my courses entirely on Discord. Although we are in-person, I think having a central online platform for communicating about the course is a great way to build community, so I’ve created a Discord server for the class. The invitation link to join the server is on the Syllabus page of Canvas.

You are not required to join or participate, but I hope that you will join and chat with your classmates, ask questions in the Q&A channel, and post all the data science memes. I will, however ask that you:

  1. Make sure that your nickname in our server is your real (or preferred) first name and last name.
  2. Keep in mind that our Discord server is an extension of our classroom environment, and everyone’s conduct therefore needs to be as professional and respectful as it would be in an in-person classroom or lab.

Computer Labs

The CS department maintains a set of Computer Science computer labs separate from the general university labs. These systems are all set up with the software that you need to complete the work for this class.

CS Accounts
To log into the machines in these labs, you will need a separate Computer Science account, which you’ll need to create unless you’ve taken another CS course already. Your username will be the same as your WWU username, but you will need to activate your account and set a new password by visiting http://password.cs.wwu.edu. Note that you’ll need to do this before your first lab, since you’ll be unable to log into the computers to access a web browser until you’ve done this.

If you didn’t already have a CS account, you may not be able to log in during the first lab since accounts may not be created until the first Monday of the quarter. If this is the case, let me know and I will try to pair you up with someone who is able to log in so you can work on the lab together.

Lab Locations and Access
The following rooms in Communications Facility are CS Department labs: 162, 164, 165, 167, 405, 418, 420. These labs are open to all CS students (that’s you!) any time except when scheduled for a class or other activity. The complete of CS labs and their schedules can be found on the CS Support Wiki. CF 405 is never booked, so it’s always available. Labs are open 24/7, although the building locks at 11pm so you won’t be able to enter later than that.

Remote Access

In this course, we’ll be doing most of our work in Jupyter notebooks using Google Colab, which can be accessed from anywhere using a web browser. If you have occasion to run your own Jupyter notebook server, you can do that from the labs, and you can even access the server remotely. I’ve written up detailed instructions for this, which you can find here: remote access instructions.

Feedback

If there’s something I can improve about the course, I sincerely want to know about it. I take student feedback seriously, and I believe it’s especially important this quarter given that this is a new course and the pandemic continues to drag on. Any feedback you’re willing to give is greatly appreciated, and I will do my best to act on constructive feedback whenever possible. I will solicit feedback through surveys periodically throughout the course, but you are welcome and encouraged to provide feedback anytime in my office hours, by email, or if you desire anonymity you can fill out this Google Form.

Schedule

This table contains a rough outline of a schedule for the quarter. As the quarter progresses, I will update it with more detail on past and upcoming topics. You will also find links to all course materials I post. Unless otherwise noted, References refer to chapters/sections in the Skiena book.

Date Topics Assignments References
1/4 (0) No Class Start of Quarter Survey (Canvas)
1/6 No Lab
1/9 (1) What is data science? What is data?
slides; typed notes; notebook
1.1, 1.3
1/10 Jupyter and Pandas - The Basics
Notes: ipynb; html
McKinney 5.1-2
1/11 Probability and statistics basics
Notes: ipynb; html
2.1-2.2
McKinney 5.3
1/13 Lab 1: Playing with Pandas Lab 1
Quiz 1
1/16 (2) No Class: MLK Day
1/17 Structured Data, Data Formats; Data Collection
Notes: ipynb; html
1.2
1/18 Conditional probability, independence, variance, histograms
Notes: ipynb; html
2.1-2.2
1/20 Lab 2: Answering weather questions Lab 2
Quiz 2
1/23 (3) When and why visualize; Visualization Aesthetics
Notes: ipynb; html
6.2
1/24 Plot types; Seaborn
Notes: ipynb; html
Ethics 1 out 6.3
1/25 Mulitdimensional arrays and numpy
Notes: ipynb; html
McKinney 4.1
1/27 Lab 3: Visualization Lab 3
Quiz 3
1/30 (4) Outliers and missing data
Notes: ipynb; html
Data cleaning scenarios
3.3
1/31 Numerical normalization; Text normalization, NLP basics
Notes: ipynb; html
Ethics 1 due
2/1 Text normalization
Data Ethics 1 Discussion
Notes: ipynb; html
2/3 Lab 4: Data preprocessing and normalization Lab 4
Quiz 4
2/6 (5) Intro to Exploratory Data Analysis
Notes: ipynb; html
NHANES preprocessing: ipynb
6.1
2/7 EDA: cold open
Notebook: ipynb; html
2/8 HTML and web scraping; scraping ethics
Notes: ipynb; html
Ethics 2 out 3.2.2
2/10 Lab 5: Movie Soup Lab 5
Quiz 5
2/13 (6) Machine Learning intro: what, why, example
Notes: ipynb; html
11.0
2/14 ML taxonomy
Notes: ipynb; html
11.5
2/15 Generalization: Bias, Variance Risk; Overfitting; Experimental setup ipynb; html Ethics 2 due 7.1, 7.4-7.5
2/17 Lab 6: Measuring bias in ML systems Lab 6
Quiz 6
FP proposal out
2/20 (7) No Class: President’s Day
2/21 Linear Algebra basics: vectors and matrices
Notes: ipynb; html
1.4; 8.1-8.2; 8.5; 10.1
2/22 Distance Metrics
Dimensionality reduction (PCA)
Clustering (k-means)
Scikit-Learn basics
Notes: ipynb; html
FP group formation due 8.5.2, 10.5.1
McKinney 12.4
2/24 Lab 7: Clustering and Dimensionality Reduction Lab 7
Quiz 7
FP proposal due
2/27 (8) Linear models: regression plus tricks
Notes: ipynb; html; whiteboard
9.1-9.2, 9.5
2/28 Generalization, continued
Evaluating ML systems: basilines; regression metrics
Notes: ipynb; html; whiteboard
7.3-7.5
3/1 Regression in scikit-learn
Linear models: classification
Notes: ipynb; html
10.2, 9.6, 11.2, 11.4
3/3 Lab 8: Linear Regression - YMMV Lab 8
Quiz 8
3/6 (9) Evaluating ML systems: classification metrics
Notes: ipynb; html
FP milestone (Sunday 3/5) 7.3
3/7 FP presentations:
Fast Food - Joshua, Eito, Rory
Women in Headlines - Maddie, Takira, Kate
Baseball - Tyler, Matthew, Owen
Air Travel - Nathan, Carter, Madeline
Diamonds - Andrew, Keith, Dennis
Stock Prediction - Quinn, Brittany, Alex
3/8 FP presentations:
Billboard 100 - Jadyn, Lexi, Rose
Car Accidents - Colton, Anahita, Sierra
Spotify Charts - Ian, Emma, J.P.
Music Genre Trends - Andrew, Steve, Theo
Game Genres - Brady, Theron, Nicholas
3/10 (FP presentations) FP due
Monday, 3/13 Final Exam - 8:00 - 10:00 AM

Course Policies

Professionalism

I am committed to maintaining an inclusive, supportive, and professional environment in all academic settings including lectures, labs, and course-related online spaces. Students are expected to live up to the expectations of WWU’s Student Code of Conduct defined in WAC 516-21. Failing to follow the Student Code of Conduct will negatively affect course grades up to and including a failing grade for the course. Conduct is also considered when determining admission to the major. Refer to University and Departmental policies for more information.

Attendance

I will not explicitly track attendance. However, in-class activities (generally graded on completion) cannot be made up after the fact. These assessments will be sufficeintly low-stakes that missing a handful of days will not affect your grade at all. If you will be missing more than an occasional class here and there, or if you have any concerns about the effect of absences on your grade, please have a conversation with me about it.

Communication

It is your responsibility to make sure that you promptly become aware of Canvas Announcements as they are posted; Canvas should be configured to send you an email notification by default, but if you are unsure, please come see me in office hours.

Late Work

You have three “slip days” that you may use at your discretion to submit labs late. Slip days apply only to labs and can not be applied to any other deadline. You may use slip days one at a time or together - for example, you might submit each of three labs one day late, or submit one lab three days late. A slip day moves the deadline by exactly 24 hours from the original deadline; if you go beyond this, you will need to use a second slip day, if available.

After your slip days are exhausted, a penalty of 10% * floor(hours_late/24 + 1) - that is, 10% per day late, will be applied. This is calculated as a percentage of the total points possible, not of the points earned.

The time of your submission will be recorded when you submit it on Canvas, so other than submitting your assignment and corresponding survey late, you do not need to take any action to use a slip day. Your grading feedback will include a note of how many slip days have been applied.

Academic Honesty

The academic honesty guidelines for this course differ somewhat from those of a typical CS course. Much of the code you write will be written in chunks of a few lines at a time. The challenge will more often be knowing which library functions to use and how to correctly apply them, rather than solving complex algorithmic problems.

Some labs will be done individually, while others may be done in pairs. For all lab assignments, you are welcome and encouraged to discuss the lab with your classmates. You should feel free to exchange ideas for how to solve pieces of an assignment; this collaboration may be as detailed as suggesting which library function to use and an English description of what you might use it for. You may not copy anyone else’s code, nor should you allow anyone else to copy your code. Finally, most tasks of most labs will ask you to intersperse descriptive text with your code, to explain what the code is doing. This text must be your own and cannot be copied from, or even “inspired by” anyone else’s text. If you did get help on how to code up a task, you can prove that you understand the solution well by explaining it in your notebook.

For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, both members must understand and be able to explain in detail all aspects of their submission. For this reason, “pair programming” is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.

Viewing or sharing code with anyone that you’re not paired with on any assignment is an academic honesty violation. If you’re discussing an assignment with a classmate, it is safest to do so away from computers.

University Policies

All University-wide policies apply to this course, including those outlined at http://syllabi.wwu.edu. These policies cover issues including: