Issues for lesson 5 #18

jrmcgarvey · 2025-02-11T22:36:13Z

The lesson is very short. A lot should be added.

There is no explanation of the handling of outliers, which is required for the assignment.

There is no explanation of the transformation of a column by a function, asfloat() or equivalent, which is required for the assignment.

There is no explanation of the standardization of string appearance, e.g. LA should be Los Angeles, as is required by the assignment.

justinjest · 2025-02-13T00:14:57Z

I am tackling this rewrite now, Points 2-4 I'm happy to address, but how far into the weeds do we want to get with data cleaning? It's about 90% of data analytics, but if we add too much it might be overwhelming. I also notice the name of the lesson ends in 1, is there an expectation of adding a 2 with more in depth information at some point?

reidrussom · 2025-02-13T00:24:15Z

Lesson 6 is Data Cleaning II and has an overview of these topics:

Handling Missing Data
Data Transformation
Removing Duplicates
Handling Outliers
Standardizing Data
Validating Data Ranges
Handling Categorical Data
Handling Inconsistent Data
Feature Engineering

justinjest · 2025-02-13T00:47:21Z

So the question for this lesson/problem set pair is do we want to provide additional information in lesson 5 or if we should adjust the assignment to only reflect what is presented already? Comparing what we have in lesson 5 and 6 with the issues @jrmcgarvey brought up I believe the following in lesson 6 would need to be moved without additional editing if we choose to not adjust the assignment.

Standardizing Data
Handling Outliers

I believe transformation of a column is already addressed in lesson 5 using
df['Age'] = df['Age'].astype(int)
which is the same request that we have in the assignment. I can move those two sections into lesson 5 as sections 5.4 and 5.5. Given the redundant information that is already in lesson 6 with

Handling missing data
Data Transformation
Removing Duplicates

I believe this is a reasonable change, although we may want to remove them from lesson 6 as it would mean nearly half of the lesson would be information already taught.

jrmcgarvey · 2025-02-13T01:19:46Z

Chance, I think you just use your judgement. I think it is appropriate to move stuff around to a more reasonable organization. So I think you may take considerable liberties, e.g. moving lesson objectives to the most appropriate place.

One point: This example: df['Age'] = df['Age'].astype(int) is a pretty spare explanation of transformation of a column. That is something that probably should be fixed in lesson 3. There are Series methods that can be used, like astype(), and perhaps most importantly, map(). Then there are Series operations, like df['column'] += 4. And finally there are numpy methods like numpy.sqrt() that operate on a Series. I think we need to explain each.

One of my concerns about some of the lessons is they just give one example and move on, without really explaining. I think we need to make sure that students understand, not just emulate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues for lesson 5 #18

Issues for lesson 5 #18

jrmcgarvey commented Feb 11, 2025

justinjest commented Feb 13, 2025

reidrussom commented Feb 13, 2025

justinjest commented Feb 13, 2025

jrmcgarvey commented Feb 13, 2025 •

edited

Loading

Issues for lesson 5 #18

Issues for lesson 5 #18

Comments

jrmcgarvey commented Feb 11, 2025

justinjest commented Feb 13, 2025

reidrussom commented Feb 13, 2025

justinjest commented Feb 13, 2025

jrmcgarvey commented Feb 13, 2025 • edited Loading

jrmcgarvey commented Feb 13, 2025 •

edited

Loading