Statistics for Data Science

Probability

Probability is the probability of an event. Whether it's an event that will happen or not and how big it is that the event has the chance to happen.

P(A) = n(A) / n(S)

Range

To return a range of values along an axis can used statistical function that called numpy.ptp(), "ptp" stands for peak to peak. The range can be calculated using:

var = np.ptp(statement)

or

range = maximum_value - minimum_value

Variance

The variance() method calculates the variance from a sample of data (from a population). A large variance indicates that the data is spread out. A small variance indicates that the data is clustered closely around the mean.

var = statistics.variance(statement)

Standard deviation

Standard deviation is a helpful way to measure how spread out values in a data set are. A small standard deviation means that most of the numbers are close to the mean (average) value. However, a large standard deviation means that the values are further away from the mean.

var = statistics.stdev(statement)

Quantile

Quantile can be used as a mapping for arrays. As of now, you cannot use Quantile as a mapping from numpy. But by using its similar library Scipy, you can compute empirical quantiles of an array.

var = np.quantile(statement, [0, 0.25, 0.50, 0.75, 1])

Skewness

Skewness is a measure of asymmetry of the probability distribution about its mean and helps describe the shape of the probability distribution.

Positive : observed when the distribution has a thicker right tail and mode < median < mean.
Negative : observed when the distribution has a thicker left tail and mode > median > mean.
Zero (or nearly zero) : observed when the distribution is symmetric about its mean and approximately mode = median = mean

Correlation

One way to quantify the relationship between two variables is to use the Pearson correlation coefficient, which is a measure of the linear association between two variables. It always takes on a value between -1 and 1 where :

1 indicates a perfectly negative linear correlation between two variables
0 indicates no linear correlation between two variables
1 indicates a perfectly positive linear correlation between two variables The further away the correlation coefficient is from zero, the stronger the relationship between the two variables.

Handling outlier

When exploring data, the outliers are the extreme values within the dataset. That means the outlier data points vary greatly from the expected values—either being much larger or significantly smaller.

Handling missing value

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in a real-life scenarios. Missing Data can also refer to as NA(Not Available) values in pandas.

Inferential Statistics - Confidence Intervals

The confidence interval is the range of values that you expect your estimate to fall between a certain percentage of the time if you run your experiment again or re-sample the population in the same way.

Inferential Statistics - Hypothesis Testing

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis.

Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. Such data may come from a larger population, or from a data-generating process.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
Inferential_Statistics_Confidence_Intervals.ipynb		Inferential_Statistics_Confidence_Intervals.ipynb
Inferential_Statistics_Hypothesis_Testing.ipynb		Inferential_Statistics_Hypothesis_Testing.ipynb
README.md		README.md
Statistics for Data Science.ipynb		Statistics for Data Science.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistics for Data Science

Probability

Range

Variance

Standard deviation

Quantile

Skewness

Correlation

Handling outlier

Handling missing value

Inferential Statistics - Confidence Intervals

Inferential Statistics - Hypothesis Testing

About

Releases

Packages

Languages

nurulauliyas/Statistics-for-Data-Science

Folders and files

Latest commit

History

Repository files navigation

Statistics for Data Science

Probability

Range

Variance

Standard deviation

Quantile

Skewness

Correlation

Handling outlier

Handling missing value

Inferential Statistics - Confidence Intervals

Inferential Statistics - Hypothesis Testing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages