CSEE4121 - Computer Systems for Data Science

Spring ‘22, Columbia University


Course Overview

Data scientists and engineers increasingly have access to a powerful and broad range of systems they use to conduct big data analysis and machine learning at scale: from databases, large-scale analytics to distributed machine learning frameworks.

The goal of this class is to provide data scientists and engineers that work with big data a better understanding of the foundations of how the systems they will be using are built. It will also give them a better understanding of the real-world performance, availability and scalability challenges when using and deploying these systems at scale. In the course we will cover foundational ideas in designing these systems, while focusing on specific popular systems that students are likely to encounter at work or when doing research. The class will include some written homework and programming assignments. One of the programming assignments will be done in pairs, and the rest will be done individually. In this course we will answer the following questions:

Instructors

Asaf Cidon and Sambit Sahu
OH: TBD (By appointment only)

Location and Time

TBD
Fridays 10:10 AM - 12:40 PM

TAs

TBD

Slack workspace

TBD

Prerequisites

Students are expected to have solid programming experience in Python or with an equivalent programming language. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.

Syllabus

Syllabus Link

Schedule (this is a work in progress, and is likely to change)

Date Topic Homework
Jan 21 Introduction
Jan 28 Infrastructure for Big Data Programming Homework 1 released
Feb 4 Relational Data Model
Feb 11 Transactions and Logging Written Homework 1 released
Feb 18 Storage/memory hierarchy and key value stores Programming Homework 1 due
Feb 25 Distributed databases and file systems Written Homework 1 due
Mar 4 Midterm
Mar 11 Challenges in scaling Programming Homework 2 released
Mar 18 Spring Break
Mar 25 MapReduce and stragglers
April 1 Spark and distributed analytics
April 8 Caching Programming Homework 2 due, Written Homework 2 out
April 15 Security and privacy
April 22 Data quality
April 29 Review Written Homework 2 due
May 6 Final Exam

Grade Breakdown

20% Programming Homework 1
10% Written Homework 1
20% Programming Homework 2
10% Written Homework 2
15% Midterm
25% Final

Collaboration/Copying Policy

Programming assignment 1 and the written assignment will be done alone. Programming assignment 2 will be done in pairs . You may not copy answers and code. We will enforce this policy when checking the assignments (we use a code similarity system).

Course Materials

No textbook.