CSEE4121 - Computer Systems for Data Science

Spring ‘22, Columbia University

Programming Homework 1 | Written Homework 1 solutions | Programming Homework 2 | Midterm solutions


Course Overview

Data scientists and engineers increasingly have access to a powerful and broad range of systems they use to conduct big data analysis and machine learning at scale: from databases, large-scale analytics to distributed machine learning frameworks.

The goal of this class is to provide data scientists and engineers that work with big data a better understanding of the foundations of how the systems they will be using are built. It will also give them a better understanding of the real-world performance, availability and scalability challenges when using and deploying these systems at scale. In the course we will cover foundational ideas in designing these systems, while focusing on specific popular systems that students are likely to encounter at work or when doing research. The class will include some written homework and programming assignments. One of the programming assignments will be done in pairs, and the rest will be done individually. In this course we will answer the following questions:

Instructors

Asaf Cidon and Sambit Sahu
OH: By appointment only

Location and Time

Asaf Cidon - Fridays 10:10 AM - 12:40 PM | 501 Northwest Corner Building
Sambit Sahu - Thursdays 7:00 PM - 9:30 PM | 402 Chandler

TAs

Please refer Ed for Office Hours

Rahul Chaudhari Shantanu Jain
Wei Hao Koushik Roy
Aashish Arora Manisha Rajkumar
Harshitha Malireddi Ruchika Goel
Joy Parikh Suvansh Dutta
Sai Karthik Ammanamanchi Zhejian Jin
Gaurav Sinha

Ed

Ed link has been posted on courseworks!

Prerequisites

Students are expected to have solid programming experience in Python or with an equivalent programming language. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.

Syllabus

Syllabus Link

Schedule (this is a work in progress, and is likely to change)

Week Topic Homework
1 Introduction (Slides)
2 Relational Data Model (Slides) Programming Homework 1 released (February 1, 2022)
3 Relational Data Model
4 Transactions and Logging (Slides) Written Homework 1 released
5 Storage/memory hierarchy (Slides)
6 Indices and bloom filters Programming Homework 1 due (February 25, 2022 4:59:59PM)
7 Distributed file systems (Slides) Written Homework 1 due (March 6, 2022 4:59:59PM)
8 Midterm (all material up to Topic 4, not including RocksDB)
9 Spring Break
10 MapReduce and stragglers (Slides)
11 Spark and distributed analytics Programming Homework 2 released
12 Caching (Slides)
13 Machine Learning (Slides) Written Homework 2 out
14 Security (Slides)
15 Data Quality and Review Programming Homework 2 due, Written Homework 2 due(April 29, 2022 4:59:59PM)
16 Final Exam: May 6

Grade Breakdown

20% Programming Homework 1
10% Written Homework 1
20% Programming Homework 2
10% Written Homework 2
15% Midterm
25% Final

Late Submission Policy

Each student will have a total of 3 late days for the entire semester. After all late days are used, there will be a 5% penalty for submission within 24 hrs of the deadline, 10% penalty for submission within 48hrs of the deadline and 20% penalty for submission within 72 hrs of the deadline. No submissions will be accepted after 72 hrs from the deadline.

Collaboration/Copying Policy

Programming assignment 1 and the written assignments will be done alone. Programming assignment 2 will be done in pairs. You may not copy answers and code. We will enforce this policy when checking the assignments (we use a code similarity system).

Course Materials

No textbook.

This project is maintained by CSEE-4121-2022