Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues for lesson 5 #18

Open
jrmcgarvey opened this issue Feb 11, 2025 · 4 comments
Open

Issues for lesson 5 #18

jrmcgarvey opened this issue Feb 11, 2025 · 4 comments

Comments

@jrmcgarvey
Copy link
Contributor

The lesson is very short. A lot should be added.

There is no explanation of the handling of outliers, which is required for the assignment.

There is no explanation of the transformation of a column by a function, asfloat() or equivalent, which is required for the assignment.

There is no explanation of the standardization of string appearance, e.g. LA should be Los Angeles, as is required by the assignment.

@justinjest
Copy link
Contributor

I am tackling this rewrite now, Points 2-4 I'm happy to address, but how far into the weeds do we want to get with data cleaning? It's about 90% of data analytics, but if we add too much it might be overwhelming. I also notice the name of the lesson ends in 1, is there an expectation of adding a 2 with more in depth information at some point?

@reidrussom
Copy link
Collaborator

Lesson 6 is Data Cleaning II and has an overview of these topics:

  • Handling Missing Data
  • Data Transformation
  • Removing Duplicates
  • Handling Outliers
  • Standardizing Data
  • Validating Data Ranges
  • Handling Categorical Data
  • Handling Inconsistent Data
  • Feature Engineering

@justinjest
Copy link
Contributor

So the question for this lesson/problem set pair is do we want to provide additional information in lesson 5 or if we should adjust the assignment to only reflect what is presented already? Comparing what we have in lesson 5 and 6 with the issues @jrmcgarvey brought up I believe the following in lesson 6 would need to be moved without additional editing if we choose to not adjust the assignment.

  • Standardizing Data
  • Handling Outliers

I believe transformation of a column is already addressed in lesson 5 using
df['Age'] = df['Age'].astype(int)
which is the same request that we have in the assignment. I can move those two sections into lesson 5 as sections 5.4 and 5.5. Given the redundant information that is already in lesson 6 with

  • Handling missing data
  • Data Transformation
  • Removing Duplicates

I believe this is a reasonable change, although we may want to remove them from lesson 6 as it would mean nearly half of the lesson would be information already taught.

@jrmcgarvey
Copy link
Contributor Author

jrmcgarvey commented Feb 13, 2025

Chance, I think you just use your judgement. I think it is appropriate to move stuff around to a more reasonable organization. So I think you may take considerable liberties, e.g. moving lesson objectives to the most appropriate place.

One point: This example: df['Age'] = df['Age'].astype(int) is a pretty spare explanation of transformation of a column. That is something that probably should be fixed in lesson 3. There are Series methods that can be used, like astype(), and perhaps most importantly, map(). Then there are Series operations, like df['column'] += 4. And finally there are numpy methods like numpy.sqrt() that operate on a Series. I think we need to explain each.

One of my concerns about some of the lessons is they just give one example and move on, without really explaining. I think we need to make sure that students understand, not just emulate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants