SciTech-Mathematics-Probability+Statistics-7 Key Statistics Concepts

abaelhe發表於2024-08-12

原文網址 : https://www.cnblogs.com/abaelhe/p/18354677

7 Key Statistics Concepts Every Data Scientist Must Master

BY BALA PRIYA CPOSTED ON AUGUST 9, 2024

Statistics is one of the must-have skills for all data scientists. But learning statistics can be quite the task.

That's why we put together this guide to help you understand essential statistics concepts for data science. This should give you an overview of the statistics you need to know as a data scientist and explore further on specific topics.

Let's get started.

1. Descriptive Statistics

Descriptive statistics provide a summary of the main features of a dataset for preliminary data analysis. Key metrics include measures of central tendency, dispersion, and shape.

Measures of Central Tendency

These metrics describe the center or typical value of a dataset:

Mean: Average value, sensitive to outliers
Median: Middle value, robust to outliers
Mode: Most frequent value, indicating common patterns

Measures of Dispersion

These metrics describe data spread or variability:

Range: Difference between highest and lowest values, sensitive to outliers
Variance: Average squared deviation from the mean, indicating overall data spread.
Standard deviation: Square root of variance, in the same unit as the data.
Low values indicate data points close to the mean, high values indicate widespread data.

Measures of Shape

These metrics describe the data distribution shape:

Skewness: Asymmetry of the distribution; Positive for right-skewed, negative for left-skewed
Kurtosis: "Tailedness" of the distribution;
High values indicate heavy tails(outliers), low values indicate light tails

Understanding these metrics is foundational for further statistical analysis and modeling, helping to characterize the distribution, spread, and central tendencies of your data.

2. Sampling Methods

You need to understand sampling for estimating population characteristics. When sampling, you should ensure that these samples accurately reflect the population. Let's go over the common sampling methods.

Random Sampling

Random sampling minimizes bias, ensuring the samples are representative enough. In this, you assign unique numbers to population members and use a random number generator to select the samples at random.

Stratified Sampling

Ensures representation of all subgroups. Stratified sampling divides population into homogeneous strata(such as age, gender) and randomly samples from each stratum proportional to its size.

Cluster Sampling

Cluster sampling is cost-effective for large, spread-out populations. In this, divide population into clusters (such as geographical areas), randomly select clusters, and sample all or randomly select members within chosen clusters.

Systematic Sampling

Systematic sampling is another technique that ensures evenly spread samples.
You assign unique numbers, determine sampling interval (k), randomly select a starting point, and select every k-th member.
Choosing the right sampling method ensures the design effectiveness of study and more representative samples. This in turn improves the reliability of conclusions.

3. Probability Distributions

Probability distributions represent the likelihood of different outcomes. When you’re starting out, you should learn about the normal, binomial, poisson, and exponential distributions—each with its properties and applications.

Normal Distribution

Many real-world distributions follow normal distribution which has the following properties:

Symmetric around the mean, with mean, median, and mode being equal. The normal distribution is characterized by mean (µ) and standard deviation (σ).
As an empirical rule, ~68% of data within one standard deviation, ~95% within two, and ~99.7% within three.
It’s also important to talk about Central Limit Theorem (CLT) when talking about normal distributions. In simple terms, the CLT states that with a large enough sample size, the sampling distribution of the sample mean approximates a normal distribution.

Binomial Distribution

Binomial distribution is used to model the expected number of successes in n independent Bernoulli trials. Each binomial trial has only two possible outcomes: success or failure. The binomial distribution is:

Defined by the probability of success (p)
Suitable for binary outcomes like yes/no or success/failure
Poisson Distribution
Poisson distribution is generally used to model the number of events occurring within a fixed interval of time. It’s especially suited for rare events and has the following properties:

Events are independent and have a fixed average rate (λ) of occurrence
Useful for counting events over continuous domains (time, area, volume)

Exponential Distribution

The exponential distribution is continuous and is used to model the time between events in a Poisson process.
The exponential distribution is:
Characterized by the rate parameter (λ) (which is the inverse of the mean)
Memoryless, meaning the probability of an event occurring in the future is independent of the past

Understanding these distributions helps in modeling various types of data.

4. Hypothesis Testing

Hypothesis testing is a method to make inferences on the population from sample data, determining if there is enough evidence to support a certain condition.

The $\large Null\ Hypothesis (H0)$ assumes no effect or difference.
Example: Hypothesis that a new drug has no effect on recovery time compared to an existing drug.

The $\large Alternative\ Hypothesis (H1)$ assumes an effect exists.
A new drug reduces recovery time compared to an existing drug.

$\large P-value$ is the probability of obtaining results at least as extreme as observed,
assuming H0 is true.

Low $\large p-value$ (say ≤ 0.05): Strong evidence against H0; Reject H0.
High $\large p-value (say > 0.05): Weak evidence against H0; Do not reject H0.

You should also be aware of Type I and Type II errors:

Type I Error ($\large \alpha$): Rejecting H0 when it is true.
Such as concluding the drug is effective when it is not.
Type II Error ($\large \beta$): Not rejecting H0 when it is false.
Such as concluding the drug has no effect when it actually does.

The general procedure for hypothesis testing can be summed up as follows:

hypothesis-testing

5. Confidence Intervals

A confidence interval (CI) is a range of values derived from sample data,
that is likely to contain the true population parameter.
The confidence level (e.g., 95%) represents the frequency with which the true population parameter would fall within the calculated interval if the experiment were repeated multiple times.

A 95% CI means we are 95% confident that the true population mean lies within the interval.
Suppose the 95% confidence interval for the average price of houses in the city is 64.412K to 65.588K. This means that we are 95% confident that the true average price of all houses in the city lies within this range.

6. Regression Analysis

You should also learn regression analysis to model relationships between a dependent variable and one or more independent variables.

Linear regression models the linear relationship between a dependent variable and an independent variable.

You can use multiple regression to include multiple independent variables. It models the relationship between one dependent variable and two or more independent variables.

Check out Step-by-Step Guide to Linear Regression in Python to learn more about building regression models in Python.

Understanding regression is, therefore, fundamental for predictive modeling and forecasting problems.

7. Bayesian Statistics

Bayesian statistics provides a probabilistic approach to inference, updating beliefs about parameters or hypotheses based on prior knowledge and observed data. Key concepts include Bayes’ theorem, prior distribution, and posterior distribution.

The Bayes' theorem updates the probability of a hypothesis H given new evidence E:
bayes-thm

P(H | E): Posterior probability of H given E
P(E | H): Likelihood of E given H
P(H): Prior probability of H
P(E): Probability of E
The prior distribution represents initial information about a parameter before observing data. The posterior distribution is the updated probability distribution after considering observed data.

Wrapping Up

I hope you found this guide helpful. This is not an exhaustible list of stats concepts for data science, but it should serve as a good starting point.

If you’re interested in a step-by-step guide to learn statistics, check out 7 Steps to Mastering Statistics for Data Science.

Oracle 19c Concepts(18)：Concepts for Database Administrators
2019-04-03
OracleDatabase
Oracle 19c Concepts(19)：Concepts for Database Developers
2019-04-04
OracleDatabaseDeveloper
Oracle 19c Concepts(00)：Changes in This Release for Oracle Database Concepts
2019-03-07
OracleDatabase
Part I Concepts and Administration
2020-03-22
SciTech-Mathmatics-Probability+Statistics-VI-Statistics:Quantifing Uncertainty + Regression Analysis）
2024-09-17
AI
Understanding System Statistics(zt)
2019-03-24
Statistics and Data Analysis for Bioinformatics
2024-10-30
ORM
SciTech-Statistics-英語授課：Business Statistics商務統計
2024-04-19
Concepts (k8s 概念)
2019-09-21
K8S
MySQL中的Statistics等待
2023-09-22
MySql
Oracle 19c Concepts(10)：Transactions
2019-03-22
Oracle
Oracle 19c Concepts(07)：SQL
2019-03-18
OracleSQL
mysql中key 、primary key 、unique key 與index區別
2024-05-31
MySqlIndex
Redis熱點key大key
2024-05-08
Redis
Oracle 19c Concepts(14)：Memory Architecture
2019-03-28
Oracle
Oracle 19c Concepts(15)：Process Architecture
2019-03-29
Oracle
Oracle 19c Concepts(05)：Data Integrity
2019-03-14
Oracle
9780078022159 Database System Concepts, 7th Edition
2024-04-12
Database
SciTech-Mathmatics-Probability+Statistics-VIII-Statistics:Quantifing Uncertainty+ANOCOVA(ANalysis of COVAriance)協方差分析原理
2024-09-18
AI
Oracle 12C Statistics on Column Groups
2018-12-29
Oracle
MATH38161 Multivariate Statistics and Machine Learning
2024-11-23
Mac
Oracle 19c Concepts(09)：Data Concurrency and Consistency
2019-03-21
Oracle
Oracle 19c Concepts(01)：Introduction to Oracle Database
2019-03-08
OracleDatabase
Oracle 19c Concepts(02)：Tables and Table Clusters
2019-03-11
Oracle
Oracle 20C Concepts(Part I-2)
2020-03-05
Oracle
Oracle 20C Concepts(Part III-1)
2020-03-01
Oracle
Oracle 20C Concepts(Part III-2)
2020-03-01
Oracle
2-Overview-Concepts (k8s 概念)
2019-09-21
ViewK8S
Oracle 20C Concepts(Part V-1)
2020-02-25
Oracle
Oracle 20C Concepts(Part V-2)
2020-02-26
Oracle
Oracle 20C Concepts(Part V-3)
2020-02-27
Oracle
Oracle 20C Concepts(Part V-4)
2020-02-27
Oracle
Learning Semantic Concepts and Order for Image and Sentence Matching筆記
2020-10-04
筆記
Oracle 19c Concepts(11)：Physical Storage Structures
2019-03-25
OracleStruct
Oracle 19c Concepts(12)：Logical Storage Structures
2019-03-26
OracleStruct
Oracle 19c Concepts(13)：Oracle Database Instance
2019-03-27
OracleDatabase
大Key
2024-07-06
熱Key
2024-07-06