-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.qmd
385 lines (275 loc) · 23.1 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
---
title: Syllabus -- Spring 2025
---
<img src="assets/images/spark-logo.png" alt="Spark Logo" style="width: 100px; float: left; margin-right: 10px;">
# CS/DS-549 Spark! Machine Learning Practicum
The Spark! Machine Learning Practicum gives you hands-on experience developing solutions
for real world challenges.
---
Welcome to the Spark! Machine Learning Practicum! On this page you'll find the
course syllabus and elsewhere on this site you'll find the course schedule and
select lectures and assignments.
## Logistics
**Time/Location:** Tue/Thu 5:00pm–6:15pm, Spring 2025<br>
**Location:** CDS B64
**Course Number:** CS/DS 549<br>
**Course Credits:** 4
**Instructors:**
**Thomas Gardos** ([email protected])
* Office hours: Tuesdays and Thursdays, 1:30-3:00pm
* Location: CDS 1623
**Ali Nahvi** (TBD)
* Office hours: Thursdays 4-5pm
* Location: TBD
Project Managers and TE:
- PM: TBA
- PM: TBA
- TE: TBA
**Piazza:** TBD (access code to be shared via email)<br>
_Please use Piazza, not email, for all questions, including grading or missing class, etc. (use the private message to instructors for such requests)_
**GradeScope:** TBD (access code to be shared via email)<br>
_If you don’t have access, contact the instructors via Piazza or email._
## Course Description
Many machine learning (ML) courses underemphasize machine learning deployment practices
and software engineering principles to ensure students focus their attention on developing a solid understanding of ML.
While justifiable, this practice perpetuates an ever-widening gap between industry expectations and student skills.
The X-Lab Machine Learning Practicum affords students opportunities to work on real-world, semester-long projects while
highlighting architectural, infrastructural, and foundational considerations involved in building and deploying machine
learning Solutions. Ultimately, we hope to bridge the aforementioned gap between ML theory and ML engineering
through project-based learning.
This course is organized around a six phase _Machine Learning Project Lifecycle (PLC)._
**Phase 1: Project Definition** (~2 weeks)<br>
**Phase 2: Research and Problem Understanding** (~1 week)<br>
**Phase 3: Data Preparation and EDA** (~2 weeks)<br>
**Phase 4: Implement Proof-of-Concept AI/ML Model** (~2 weeks)<br>
**Phase 5: Model Deployment** (~3 weeks)<br>
**Phase 6: Evaluation and Delivery** (~1 week)
We briefly describe each phase here and will go into more depth in the class.
### Preliminaries
In the first couple of weeks while project assignment is completing, we will focus on
preliminary practical skills such as agile software methodologies and familiarizing ourselves
with the practical tools you will likely rely on throughout your project such as
version control with git, utilizing BU's Shared Computing Cluster for model training
and getting more familiar with ML toolkits like Pandas, Scikit-Learn, PyTorch
and HuggingFace.
### Phase 1: Project Definition
After projects are assigned and you form into teams, you will meet your client(s) and
then clarify and refine the _Project Definition_. The project goals and deliverables
will likely evolve as you become more familiar with the problem and feasibility
of various solutions. We'll talk about the importance of working with your
clients to collaborate on these updates and jointly manage expectations. This
phase will take about 2 weeks.
### Phase 2: Research and Problem Understanding
With a clearer understanding of the project and client expectations, the next
phase is to research the literature, open source projects and tools that may
be applicable. The research is to help you understand the problem better as well
understand what machine learning approaches have been applied to similar problems.
You and your team may already start developing simple ML models to understand
their strengths and weaknesses.
### Phase 3: Data Preparation and Exploratory Data Analysis (EDA)
Being a machine learning project, the quality of the data will be paramount.
Improving your dataset can sometimes bring better gains then improving your
ML model. In this 2-week phase you will explore and characterize your data.
This is a good time to check if there is missing data, incorrect labels or other
issues. You may also determine that you need to augment your data with other
publicly available datasets. This is a good time to develop your data ingestion
pipeline.
### Phase 4: Implement Proof-of-Concept AI/ML Model
If your team hasn't already, this is the phase where you will start developing
one or more proof-of-concepts model. The idea is to start gaining confidence that
the model architecture and direction your team has chosen will solve the problem
posed in the project. It can be helpful here to "fail fast" and quickly try a
few different models. We'll allocate about 2 weeks to this phase.
### Phase 5: Model Deployment
By this phase you will have clarified project and client expectations, completed
literature and project research, developed an understanding of your dataset and
prepared the data ingestion pipeline. From the PoCs you should have reasonable
confidence of the model you want to fully develop and so this is the phase
where fully define the model, train the model and tune hyperparameters. We
allocate about 3 weeks to this phase.
### Phase 6: Evaluation and Delivery
In this last and final phase, you will complete model training and focus on
model evaluation and delivery including cleaning up and documenting
your GitHub repository. You will also prepare your final presentation,
the report to your clients and the demo for Spark Demo Day.
## Prerequisites
To ensure that students get the most out of this class, we require students to have taken one of
- **DS 340 (Into to Machine Learning and AI)**
- **CS 440 (Intro to AI)**
- **CS 542 (Principles of ML)**
- **DS 542 (Deep Learning for Data Science)**
- **CS 523 (Deep Learning)**
- **CS 505 (Intro to Natural Language Processing)**
- **CS 585 (Image and Video Computing)**
- **CS 640 (Artificial Intelligence)**
or have equivalent experience. You must have a strong programming background especially with proficiency in Python. Familiarity with web and/or mobile application development is helpful, though not required. Please consult with course staff during class or office hours if you have questions about the prerequisites.
## Hub Learning Outcomes
### Hub Unit #1: Teamwork/Collaboration
***Learning Outcome #1:*** _As a result of explicit training in teamwork and sustained experiences of collaborating with others, students will be able to identify the characteristics of a well-functioning team._
The X-Lab Machine Learning Practicum affords students opportunities to work on real-world, semester-long projects while highlighting architectural, infrastructural, and foundational considerations involved in building and shipping an enterprise machine learning pipeline. Students in CS/DS 549 will explore and apply the various aspects of collaboration necessary for developing an enterprise machine learning pipeline, including developing a team agreement, assigning roles and responsibilities, pitching project ideas, making use of scrum, operating in sprints, pair programming, and presenting a well-designed, functional final project to both their peers and to their clients from industry. Teams will use a team agreement and mid-term review as a framework to establish expectations and provide feedback against those agreements.
***Learning Outcome #2:*** _Students will demonstrate an ability to use the tools and strategies of working successfully with a diverse group, such as assigning roles and responsibilities, giving and receiving feedback, and engaging in meaningful group reflection that inspires collective ownership of results._
As a team, students will work together to develop an understanding of their project's needs and constraints and explore possible ML-based solutions. Considerable theoretical and practical knowledge is explored and practiced, not only to ensure students are grasping course concepts but also to ensure students are adequately making progress toward the final project. Throughout the course of project development, students will be expected to provide each other with continuous feedback and implement said feedback to improve the functionality of their project. Students will learn the value of scrum, which ensures that all team members are aware of how each individual’s work is progressing. Establishing a consistent meeting cadence as a team and with the project’s PM will provide students the opportunity to reflect on their progress, both individually and as a team.
### Research & Information Literacy
***Learning Outcome #1:*** _Students will be able to search for, select, and use a range of publicly available and discipline-specific information sources ethically and strategically to address research questions._
When developing their projects, students will use both publicly and privately-available data sets and employ popular open source tools to build and train machine learning models. Students will be encouraged to read papers and evaluate open source models to identify the machine learning approach most suited to the problem they are seeking to address. They will conduct an ethics assessment to understand the potential risks and areas of bias involved with their chosen models.
***Learning Outcome #2:*** _Students will demonstrate understanding of the overall research process and its component parts, and be able to formulate good research questions or hypotheses, gather and analyze information, and critique, interpret, and communicate findings._
The semester-long project demands that students formulate and iterate on their topic, which includes: problem definition, data preprocessing, and exploratory research; designing and developing ML pipelines; and delivery and maintenance. In order to adequately analyze data used for each project, students will learn ways to preprocess and clean data, using techniques to augment sparse data, unearth hidden correlations, and contend with vast datasets.The course will wrap up with final presentations to industry partners, delivering the final work product which includes thorough documentation of the code and data before the end-of-semester.
### Ethical Reasoning
***Learning Outcome #1:*** _Students will be able to identify, grapple with, and make a judgment about the ethical questions at stake in at least one major contemporary public debate, and engage in a civil discussion about it with those who hold views different from their own._
Issues of bias, transparency, and fairness in the field of machine learning are gaining widespread exposure in the public sphere. Students will be provided an overview of ethics and responsibility issues in ML along with a framework to assess their work based on a set of ethics and responsibility principles. They will apply an assessment that covers issues of explainability and traceability, i.e. the ability to explain the model’s behavior at a high level and for a specific input and the ability to trace how the model was trained including underlying assumptions, acceptance criteria, and performance of the model itself. They will gain experience examining issues of bias in the collection of the underlying data as well as issues of equity and justice in the context of applying the model itself.
***Learning Outcome #2:*** _Students will demonstrate the skills and vocabulary needed to reflect on the ethical responsibilities that face individuals (or organizations, or societies or governments) as they grapple with issues affecting both the communities to which they belong and those identified as “other.” They should consider their responsibilities to future generations of humankind, and to stewardship of the Earth._
Computing and data science students are finding themselves entering workplaces that are underregulated by the public sector leaving a vacuum to be filled by self-regulation and public pressure. Students will gain an understanding of the risks inherent to machine learning models and applications. They will acquire both the language used to describe these risks as well as the processes needed to evaluate them. They will also grapple with determining their own values and personal responsibilities, so they are better equipped to operate responsibility in workplaces and in a world where societal expectations are surpassing government regulation. They will practice this skill through dialogue around their assessments with partners/ clients as well as students and instructors.
### Other Outcomes (e.g., School, Department, and/or Program Outcomes)
As a result of completing this course, students will be able to:
1. Plan, execute, and manage complex machine learning projects
2. Create reproducible and deployable ML pipelines
3. Improve teamwork and communication
## Instructional Format, Course Pedagogy, and Approach to Learning
In addition to lectures, we will also have team collaboration time. During these times, we will work on assimilating material covered in lecture into our projects. These are meant to be hands-on work sessions.
## Books and Other Course Materials
There is no required textbook for this course. Pertinent readings and lecture notes will be posted in the Course Schedule and Piazza.
## Courseware
We will be using the typical suite of software tools for this course:
* Blackboard: Used to support current grade status. However, the grade you see in Blackboard will not be completely accurate until the end of the semester as it doesn’t take into account participation and peer evaluation.
* Gradescope: Assignment grades and feedback about the assignments
* Piazza: Class discussion and assignment details. Piazza should be where you go first and has links to all information/software used in the course.
## Assignments
Assignments serve 2 purposes:
* Cement material learned in class
* Track team and project progress
Assignments due dates will be posted in the Lecture Schedule and in most cases may be submitted up to 24 hours late with a 5% late penalty. No late submissions will be accepted after 24 hours. Assignments must be submitted on Gradescope. To account for emergencies, we will drop the assignment with the lowest score from your final grade calculations. Gradescope due dates will be the final arbiter.
### Qualitative Assignments
While several of the assignments are testing grasp of programming content, there are several assignments that are relevant to collaborative work, client engagement, and importantly ethical reasoning. These assignments are explained in further detail below:
#### Team Agreement
We’ll have a discussion on how to facilitate effective functioning of teams in a project-based learning structure. We’ll share material on the research grounding for high performance teams and outline the GRPI model of teaming: Goals, Roles, Processes, and Interactions. Students will construct a team agreement following this format and be graded based on the completeness of the contract in addressing each component. The team will review and revise the team agreement at the mid-point in the semester and provide a final peer evaluation for the individual component of the team grade.
#### Ethics and Responsibility Assessments
Students will be provided with a foundational framework for assessing potential issues of ethics and responsibility which they will apply in a series of case study assignments and a class discussion. Three additional assignments will provide them with an opportunity to apply this ethics and responsibility assessment to their own project as well as the project of another student team in the class. Students will be asked to engage both their clients/ partners around issues or questions that arise as well as in class discussions with the instructor and other students as well as a guest lecturer specialized in the topic. Students will be graded individually on the completeness of their assessment and the quality of reflection for the two assessments. Students will be asked to meet as a team to develop a mitigation plan that outlines steps they will take and recommend to the partner to address potential issues of ethics and responsibility. We expect students will be able to identify ethical considerations in the following areas:
* Representation and blind spots
* Product/project intent
* Potential unintended harms (e.g. from inaccurate assessments)
* Technical vulnerabilities, limitations, and risks
* Data collection, privacy, storage, and security
* Auditability of algorithms: automation of human processes, testing, and monitoring
* Auditability of models: purpose of models, input data and training risks and bias, ability for human termination
* Disclosures
* Accessibility
* Use of work by other creators
### Project Description
While students will be presented with an initial project description from Spark!, they will be responsible for revising and reframing the project description based on a preliminary meeting with the client and any issues that arise with the ethics and responsibility assessment. A key component of the grade will be the students demonstration of their grasp of the project and ability to present this understanding back to the client along with any concerns or considerations regarding ethics or responsibility issues.
## Grading
### Grade Weightings
Your final grade will be a weighted sum of grades received in the following categories:
```{=html}
<table>
<tbody>
<tr>
<th> % of Grade </th>
<th> Category </th>
<th> Notes </th>
</tr>
<tr>
<td> 25% </td>
<td> Assignments </td>
<td>
<ul>
<li>Grading rubrics are available on individual assignment pages.</li>
<li> Student teams will give interim and final presentation on their projects to the rest of the class. </li>
<ul>
<li> For each team assignment, there will be an accompanying individual contribution assessment as well. </li>
</ul>
</ul>
</td>
</tr>
<tr>
<td> 10% </td>
<td> Attendance and Participation </td>
<td> Students are graded on their in attendance, participation and engagement during lectures. </td>
</tr>
<tr>
<td> 65% </td>
<td> Project </td>
<td> Students are graded on the overall impact and quality of the project as well as their individual contributions and (see below) </td>
</tr>
</tbody>
</table>
```
Given that this is a practicum, the project is central to this course and is
worth 60% of your final grade. We will be partnering with BU Spark! to work on a
semester-long, machine learning project. Projects are sourced from external
partners and are complex enough to provide students with real-world ML experience.
### Project Grading
Projects will be graded based on a combination of overall project outcome and
individual contributions.
```{=html}
<table>
<tbody>
<tr>
<th> % of Project Grade </th>
<th> Category </th>
<th> Description/Notes </th>
</tr>
<tr>
<td> 40% </td>
<td> Project Impact and Success </td>
<td> Ultimately the goal of the project is to deliver towards the
client's expectations.
<ul>
<li> Did the project accomplish a sufficient number of (possibly
revised) objectives?</li>
<li> Was the client relationship managed well? </li>
<li> Did the implementation show innovation and rigor? </li>
</ul>
</td>
</tr>
<tr>
<td> 20% </td>
<td> Repo Software and Documentation Quality, Reproducibility </td>
<td>
<ul>
<li> Is the Github repository well organized and easy to navigate?</li>
<li> Is the repo well documented especially with replication steps?</li>
<li> Can one start from a new environment and easily setup and run?</li>
</ul>
</td>
</tr>
<tr>
<td> 30% </td>
<td> Individual Contribution </td>
<td> Is there clear evidence of
<ul>
<li> Attendance and active participation in class lab time, client and team meetings?</li>
<li> Documented activities in sprint plan history?</li>
<li> Git commit history and co-authored git commits? </li>
<li> Record of individual's contributions in document and presentation revision history?</li>
</ul>
</td>
</tr>
<tr>
<td> 10% </td>
<td> Individual contribution to collaboration and teamwork</td>
<td> Is there indication, for example from peer reviews, of positive
collaborations and constructive teamwork?
</td>
</tr>
</tbody>
</table>
```
## Community of Learning: Class and University Policies
Course members’ responsibility for ensuring a positive learning environment (e.g., participation/ discussion guidelines):
### Integrity & Conduct
We take the [Student Responsibilities](https://www.bu.edu/dos/policies/student-responsibilities) guide very seriously and in particular: “civility and respect for others within the University.” In this class we should all strive to be the model for what we want our University and industry to be.
### Attendance & Absences
Due to the sequential nature of the product creation experience and the goal of completing a product demo by the end of the semester, attendance is required. Missing more than 3 classes may affect your final grade. If you must miss class for any reason, please email ahead of time. Absence from project meetings should be considered equivalent to absence from lecture.
### Academic Conduct Statement
Computing is an inherently collaborative endeavor. In most cases, you will find open source projects or code snippets on the internet that you might want to use in your own projects. While this is permitted, you *must* cite your sources appropriately. You are also responsible for ensuring that you have the original author's permission to use their work. The Open Source Initiative maintains an excellent page on [the different types of software licenses](https://opensource.org/licenses) and what you can and cannot do with them.
Using code you have borrowed from the internet without permission and/or attribution is an instance of plagiarism, which is a violation of the [Academic Code of Conduct](http://www.bu.edu/academics/policies/academic-conduct-code/). If you are in doubt about whether something might be construed as plagiarism, please check with course staff and in general—err on the side of caution. Remember, source code with no mentioned license is, by default, not available for reuse.
### Collaboration on Assignments and Projects:
Unless explicitly stated, collaboration on assignments and projects among teammates is both allowed and encouraged.
### Use of Generative AI
Generative AI tools are permitted for coursework, with a strong recommendation to maintain transparency by appropriately citing their usage.
### Disability Accommodations:
If you are a student with a disability or believe you might have a disability that requires accommodations, please contact the Office forDisability Services (ODS) at 617-353-3658 to coordinate any reasonable accommodation requests. For more information, please see [http://www.bu.edu/disability](http://www.bu.edu/disability).
## Course Feedback
There will be a formal course evaluation at the end of the term.
We also appreciate feedback at any time. You are welcome to do that via office hours, email, Piazza or you can submit feedback anonymously [HERE](https://forms.gle/YLqRcR4khusSexEt5).
What works for you? What doesn’t? Do you have an idea how to improve something?