Identifying, answering & communicating relevant questions.
"Work that takes more programming skills than most statisticians have, and more statistics skills than a programmer has." - kdnuggets
Project Domain
- Data science is not a stand-alone endeavor. Unless you go into data science research, odds are you will always be working on projects in a domain outside of your expertise. Study up! Be prepared to follow and contribute to conversations that are relevant to your project but outside of your comfort zone.
Context
- The work you do, how you do it, and how you work with others will change in each professional context. Are you working in research? With a sales & marketing team? Is the project a long-term project, or a quick turn around? Are you learning & practicing, or performing to a deadline? Will your results be used as advice, or to make the final decision? All of these contextual Considerations will modify the way approach your project.
Team
- You will be working with other humans. You will rely on other humans, and they will rely on you. Know your team mates, for better or worse you will be stuck with each other. Be prepared to support each other when needed and ask for help before it's necessary. Failure is more likely to come from between, not within you team members.
Questions
- The ultimate purpose of your data analyses is to answer questions relevant to your project's main objective. To do this you need to have well-defined questions that can be explored effectively by data analysis. Your whole team must agree on exactly what questions are being asked, and what qualifies as a satisfactory answer. These questions will act as the central pillar of your investigation and every decision made will have to circle back to the question in one way or another.
Data
- Know your data and how it relates to your central question. Where does it come from? How was it collected? What might be missing? How might it be corrupted? Is there extra data? Which dimensions are most relevant to your investigation? What format is should it be in for your analysis? Before moving on to any analysis minimize simplify your data as much as possible.
Strategy
- How will you ask the data your question? What's the simplest possible analysis? What are possible pitfalls to your strategy? The less complexity the less room for error, and the easier it will be to find your mistakes when you make them. Identify key milestones in your analysis that can be used for testing and communication.
Tools
- Which tool set is best for you question, team, context, and data? Either take the time to learn the chosen tools, or find a way to do the project with tools you do know. Working with unfamiliar tools, techniques, or libraries can not only slow down a project, but is likely to lead to mistakes.
Conclusions
- Be prepared to be wrong. or not find anything conclusive!
- Keep your conclusions tight and simple, tied directly to what the data says. Be careful not to use the results simply as support for your own ideas. You have to let the data answer you question. Your conclusion should serve only to consolidate what your analysis has uncovered.
- Make sure the whole team knows how you understand the findings. It's more than just a friendly thing to do, this will help you all learn and and catch mistakes that evade even the most experienced analysts.
Audience
- Communicate to the audience you do have, not the one you'd like. What level of understanding do they have of the domain, context, and data science? What do they want from you; clear, actionable advice? further research questions? When in doubt, ask.
General:
- 4 pillars of DS
- The data science process
- expectations vs reality
- A model of computation
- random numbers - highly under-rated
- python or r, python and r
Off-DataCamp dev workflow:
- Working in terminal (for unix)
- Git: Study this video, practice here
- For saving and versioning your work
- Git & GitHub - for sharing and collaborating
- Git?
- for practicing: "learngitbranching.js.org"
- for how it works behind the scenes: "THE Git video"
- for a quick reference: Roger Dudler's handbook
- GitHub?
- cloning, pushing & pulling
- pull requests & merging
- LearnGitBranching
- Git?
- Visual Studio Code: Download
- Installing python
- Jupyter Notebook
Practice:
Analysis & Inference:
- descriptive vs inferencial stats
- common distributions: article, quick reference
- parametric vs non-parametric
Software Design:
- learning to develop
- functional programming?
- functional programming for DS
- unit testing
- design, testing, logging
- testing & logging (advanced)
Data Science perspectives:
- type 1 & type 2 errors
- common mistakes
- Frequentist vs Baysian interpretations:
- (over)-confidence intervals