diff --git a/Research Method/Quantitative Research/Descriptive Analysis.md b/Research Method/Quantitative Research/Descriptive Analysis.md deleted file mode 100644 index 2c2e201..0000000 --- a/Research Method/Quantitative Research/Descriptive Analysis.md +++ /dev/null @@ -1,6 +0,0 @@ ---- -Course: - - PSYC10100 Introduction to Statistics for Psychological Sciences ---- -## Central Tendency - diff --git a/Research Method/Quantitative Research/Descriptive Statistics.md b/Research Method/Quantitative Research/Descriptive Statistics.md new file mode 100644 index 0000000..ab6075c --- /dev/null +++ b/Research Method/Quantitative Research/Descriptive Statistics.md @@ -0,0 +1,529 @@ +--- +Course: + - PSYC10100 Introduction to Statistics for Psychological Sciences +--- +## 1. Size + +We note the size of: + +- Population: $N$ +- Sample: $n$ + +To distinguish the difference between Population and Sample. For difference between which, see more in [Population and Sample](Population%20and%20Sample.md). + +## 2. Central Tendency + +Central tendency provides a general description of the data set, using mean, mode, and median. + +| Measure | Advantages | Disadvantages | Best for | +| ---------- | ----------------------------------- | --------------------------------------- | -------------------- | +| **Mean** | Uses all data, algebraic properties | Sensitive to outliers | Normal distributions | +| **Median** | Robust to outliers | Ignores magnitude of extreme values | Skewed distributions | +| **Mode** | Works for categorical data | May not be unique, ignores other values | Nominal data | + +### 2.1. Mean + +Mean is the average of a specific variable in a data set. To tell apart, a population mean is noted as $\mu$, while a sample mean is noted as $\bar{x}$. + +$$ +\mu = \frac{\sum_{i}^{N}{x}}{N} +$$ +$$ +\bar{x} = \frac{\sum_{i}^{n}{x}}{n} +$$ + +In some article, it is also considered as the expectation of a variable, notated as $E[x] = \sigma$. In most situation, people estimate $\mu$ by $E[x]$, but which leads to a degree of freedom (DoF) lost. + +Trying to avoid the influence of outliers, a *trimmed mean* calculates the average after removing a specified percentage of the highest and lowest values to reduce the effect of outliers. + +``` R +describe(data, trim = 0.1) # Set the persentace for trimed mean +``` + +### 2.2. Mode + +The **mode** is the value that appears most frequently in a data set. Unlike mean and median, a data set can have: + +- **Unimodal**: One mode +- **Bimodal**: Two modes +- **Multimodal**: Multiple modes +- **No mode**: All values appear with equal frequency + +**Characteristics**: +- Useful for categorical data and discrete distributions +- Not affected by extreme values (outliers) +- May not exist or may not be unique +- For continuous data, mode is often estimated using kernel density estimation + +**Example**: For data set {1, 2, 2, 3, 4, 4, 4, 5}, the mode is 4. + +### 2.3. Median + +The **median** is the middle value in an ordered data set. It divides the data into two equal halves: + +- For odd number of observations: Middle value +- For even number of observations: Average of two middle values + +**Calculation**: +- Order the data from smallest to largest +- If $n$ is odd: $\text{median} = x_{(n+1)/2}$ +- If $n$ is even: $\text{median} = \frac{x_{n/2} + x_{n/2 + 1}}{2}$ + +**Characteristics**: +- Robust to outliers and skewed distributions +- More appropriate than mean for ordinal data +- Represents the 50th percentile (second quartile) + +**Example**: +- Data: {3, 1, 7, 4, 2} → Ordered: {1, 2, 3, 4, 7} → Median = 3 +- Data: {3, 1, 7, 4} → Ordered: {1, 3, 4, 7} → Median = (3+4)/2 = 3.5 + +## 3. Measures of Dispersion + +### 3.1. Variance + +Variance measures the average squared deviation from the mean. It distinguishes between population and sample variance due to degrees of freedom considerations. + +#### 3.1.1. Population Variance ($\sigma^2$) +The variance of the entire population: + +$$ +\sigma^2 = \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N} +$$ + +where: + +- $N$ = population size +- $\mu$ = population mean +- $x_i$ = individual values + +**Example**: +Population {1, 3, 5, 7, 9}, $\mu = 5$. + $\sigma^2 = \frac{(1-5)^2 + (3-5)^2 + (5-5)^2 + (7-5)^2 + (9-5)^2}{5} = \frac{16+4+0+4+16}{5} = 8$ +#### 3.1.2. Sample Variance ($s^2$) +The variance estimated from a sample, using $n-1$ in denominator for unbiased estimation: + +$$ +s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1} +$$ + +where: +- $n$ = sample size +- $\bar{x}$ = sample mean +- $x_i$ = individual values + +**Key Differences**: +- **Denominator**: Population uses $N$, sample uses $n-1$ (Bessel's correction) +- **Purpose**: Population variance describes the entire population, sample variance estimates population variance +- **Bias**: Using $n$ instead of $n-1$ in sample variance creates a biased estimator + +**Why $n-1$ for sample variance?** + +Using $n-1$ (degrees of freedom) corrects for the fact that we're estimating the population mean from the sample, making $s^2$ an unbiased estimator of $\sigma^2$. + +**Example**: +Subset {1, 3, 5} from population, $\bar{x} = 3$. + $s^2 = \frac{(1-3)^2 + (3-3)^2 + (5-3)^2}{3-1} = \frac{4+0+4}{2} = 4$ + +**R Implementation**: + +```R +sample_var <- var(data) # Sample variance (built-in function) +``` + +### 3.2. Standard Deviation + +Standard deviation is the square root of variance, providing a measure of dispersion in the original units of the data. Like variance, it distinguishes between population and sample standard deviation. + +#### 3.2.1. Population Standard Deviation ($\sigma$) +The standard deviation of the entire population: + +$$ +\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N}} +$$ + +#### 3.2.2. Sample Standard Deviation ($s$) +The standard deviation estimated from a sample: + +$$ +s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}} +$$ + +**Interpretation**: + +- Approximately 68% of data falls within $\pm1$ standard deviation from the mean (for normal distributions) +- Approximately 95% of data falls within $\pm2$ standard deviations from the mean +- Approximately 99.7% of data falls within $\pm3$ standard deviations from the mean + +**Advantages over Variance**: + +- Same units as original data (easier to interpret) +- More intuitive measure of spread +- Directly comparable to the mean + +**Example** (continuing from variance examples): + +- Population: $\sigma = \sqrt{8} \approx 2.83$ +- Sample: $s = \sqrt{4} = 2$ + +**R Implementation**: +```R +sample_sd <- sd(data) # Sample standard deviation (built-in function) +``` + +### 3.3. Median Absolute Deviation + +Median Absolute Deviation (MAD) is a robust measure of statistical dispersion that uses the median rather than the mean, making it resistant to outliers. + +**Definition**: +MAD is the median of the absolute deviations from the data's median: + +$$ +\text{MAD} = \text{median}(|x_i - \text{median}(x)|) +$$ + +**Scale Factor**: +For normally distributed data, MAD needs to be scaled by approximately 1.4826 to be comparable to standard deviation: + +$$ +\text{MAD}_{\text{normal}} = 1.4826 \times \text{MAD} +$$ + +**Why Use MAD?** + +- **Robustness**: Not influenced by extreme values +- **Breakdown point**: 50% - up to half the data can be outliers without affecting MAD +- **Interpretation**: Similar to standard deviation but for median-based analysis + +**Comparison with Standard Deviation**: + +| Aspect | Standard Deviation | MAD | +| --------------- | --------------------------- | ------------------- | +| **Sensitivity** | Sensitive to outliers | Robust to outliers | +| **Center** | Uses mean | Uses median | +| **Breakdown** | 0% (any outlier affects it) | 50% | +| **Units** | Original data units | Original data units | + +**Example**: +Data: {1, 2, 3, 4, 100} (100 is an outlier) + +- Median = 3 +- Absolute deviations: |1-3|=2, |2-3|=1, |3-3|=0, |4-3|=1, |100-3|=97 +- Sorted deviations: {0, 1, 1, 2, 97} +- MAD = median{0, 1, 1, 2, 97} = 1 + +**R Implementation**: +```R +# Calculate MAD +mad_value <- mad(data) + +# Scaled MAD for normal distributions +scaled_mad <- mad(data, constant = 1.4826) +``` +### 3.4. Interquartile Range + +Interquartile Range (IQR) is a robust measure of statistical dispersion that describes the spread of the middle 50% of the data. It is particularly useful for identifying outliers and understanding data distribution. + +**Definition**: +IQR is the difference between the third quartile (Q3) and the first quartile (Q1): + +$$ +\text{IQR} = Q_3 - Q_1 +$$ + +where: +- $Q_1$ = First quartile (25th percentile) +- $Q_3$ = Third quartile (75th percentile) + +**Quartile Calculation Methods**: +There are several methods to calculate quartiles. Common approaches include: + +1. **Method 1 (Inclusive)**: $Q_1$ at position $(n+1)/4$, $Q_3$ at position $3(n+1)/4$ +2. **Method 2 (Exclusive)**: $Q_1$ at position $n/4$, $Q_3$ at position $3n/4$ +3. **Tukey's Method**: Median of lower half and upper half + +**Outlier Detection**: +IQR is commonly used to identify outliers using the "1.5×IQR rule": +- **Lower fence**: $Q_1 - 1.5 \times \text{IQR}$ +- **Upper fence**: $Q_3 + 1.5 \times \text{IQR}$ +- Values outside these fences are considered potential outliers + +**Box Plot Relationship**: +IQR forms the "box" in box plots: +- Box extends from Q1 to Q3 +- Line inside box represents median +- Whiskers extend to most extreme non-outlier values + +**Advantages**: +- **Robust**: Not affected by extreme values +- **Intuitive**: Easy to interpret and visualize +- **Standardized**: Widely used in exploratory data analysis + +**Example**: +Data: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} + +- Q1 (25th percentile) = 3.25 +- Q3 (75th percentile) = 7.75 +- IQR = 7.75 - 3.25 = 4.5 +- Lower fence = 3.25 - 1.5×4.5 = -3.5 +- Upper fence = 7.75 + 1.5×4.5 = 14.5 +- No outliers in this dataset + +**R Implementation**: +```R +# Calculate IQR (built-in function) +iqr_value <- IQR(data) + +# Calculate quartiles +q1 <- quantile(data, 0.25) +q3 <- quantile(data, 0.75) +manual_iqr <- q3 - q1 + +# Outlier detection +lower_fence <- q1 - 1.5 * iqr_value +upper_fence <- q3 + 1.5 * iqr_value +outliers <- data[data < lower_fence | data > upper_fence] + +# Box plot visualization +boxplot(data, main = "Box Plot with IQR") +``` + +**Comparison with Other Dispersion Measures**: + +| Measure | Robustness | Interpretation | Best Use | +|---------|------------|----------------|----------| +| **Range** | Not robust | Total spread | Quick overview | +| **Variance/SD** | Not robust | Average deviation | Normal distributions | +| **MAD** | Very robust | Median-based spread | Outlier-prone data | +| **IQR** | Robust | Middle 50% spread | Exploratory analysis | + +## 4. Distribution Shape + +### 4.1. Skewness + +Skewness measures the asymmetry of a probability distribution around its mean. It indicates whether data are concentrated more on one side of the distribution. + +**Types of Skewness**: + +- **Positive Skew (Right Skew)**: Tail extends to the right, mean > median > mode +- **Negative Skew (Left Skew)**: Tail extends to the left, mean < median < mode +- **Zero Skew**: Symmetrical distribution, mean = median = mode + +**Calculation**: +The most common measure is Pearson's moment coefficient of skewness: + +$$ +\text{Skewness} = \frac{E[(X - \mu)^3]}{\sigma^3} +$$ + +For sample data: + +$$ +g_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^{3/2}} +$$ + +**Interpretation**: + +- **|Skewness| < 0.5**: Approximately symmetric +- **0.5 ≤ |Skewness| < 1**: Moderately skewed +- **|Skewness| ≥ 1**: Highly skewed + +**R Implementation**: +```R +# Using moments package +library(moments) +skewness_value <- skewness(data) + +# Using psych package +library(psych) +skewness_value <- describe(data)$skew + +# Manual calculation +manual_skew <- function(x) { + n <- length(x) + mean_x <- mean(x) + sd_x <- sd(x) + (sum((x - mean_x)^3) / n) / (sd_x^3) +} +``` + +### 4.2. Kurtosis + +Kurtosis measures the "tailedness" of a probability distribution, indicating how much data are in the tails compared to a normal distribution. + +**Types of Kurtosis**: + +- **Mesokurtic**: Normal distribution, kurtosis = 3 (excess kurtosis = 0) +- **Leptokurtic**: Heavy tails and sharp peak, kurtosis > 3 (excess kurtosis > 0) +- **Platykurtic**: Light tails and flat peak, kurtosis < 3 (excess kurtosis < 0) + +**Calculation**: +Pearson's moment coefficient of kurtosis: + +$$ +\text{Kurtosis} = \frac{E[(X - \mu)^4]}{\sigma^4} +$$ + +Excess kurtosis (more commonly used): + +$$ +\text{Excess Kurtosis} = \frac{E[(X - \mu)^4]}{\sigma^4} - 3 +$$ + +For sample data: + +$$ +g_2 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^4}{\left(\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^2} - 3 +$$ + +**Interpretation**: +- **Excess Kurtosis = 0**: Normal distribution +- **Excess Kurtosis > 0**: Heavy tails (more outliers) +- **Excess Kurtosis < 0**: Light tails (fewer outliers) + +**R Implementation**: +```R +# Using moments package +library(moments) +kurtosis_value <- kurtosis(data) # Returns excess kurtosis + +# Using psych package +library(psych) +kurtosis_value <- describe(data)$kurtosis + +# Manual calculation +manual_kurtosis <- function(x) { + n <- length(x) + mean_x <- mean(x) + sd_x <- sd(x) + (sum((x - mean_x)^4) / n) / (sd_x^4) - 3 +} +``` + +## 5. Estimation Accuracy + +### 5.1. Standard Error + +Standard Error (SE) measures the precision of a sample statistic as an estimate of the population parameter. It quantifies the variability of the sampling distribution. + +**Standard Error of the Mean (SEM)**: +The most common standard error, measuring how much the sample mean varies from sample to sample: + +$$ +\text{SEM} = \frac{s}{\sqrt{n}} +$$ + +where: + +- $s$ = sample standard deviation +- $n$ = sample size + +**Interpretation**: + +- Smaller SE indicates more precise estimate +- SE decreases as sample size increases +- Used to construct confidence intervals + +**Other Standard Errors**: + +- **Standard Error of Proportion**: $\text{SE}_p = \sqrt{\frac{p(1-p)}{n}}$ +- **Standard Error of Difference**: $\text{SE}_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$ + +**R Implementation**: +```R +# Standard Error of the Mean +sem <- function(x) { + sd(x) / sqrt(length(x)) +} + +# Using built-in functions +sample_mean <- mean(data) +sample_sd <- sd(data) +n <- length(data) +sem_manual <- sample_sd / sqrt(n) + +# For proportions +se_prop <- function(p, n) { + sqrt(p * (1 - p) / n) +} +``` + +### 5.2. Confidence Interval + +Confidence Interval (CI) provides a range of values that likely contains the population parameter with a specified level of confidence. + +**Confidence Interval for Mean**: +For large samples or known population variance: + +$$ +\text{CI} = \bar{x} \pm z_{\alpha/2} \times \frac{s}{\sqrt{n}} +$$ + +For small samples with unknown population variance: + +$$ +\text{CI} = \bar{x} \pm t_{\alpha/2, df} \times \frac{s}{\sqrt{n}} +$$ + +where: +- $\bar{x}$ = sample mean +- $s$ = sample standard deviation +- $n$ = sample size +- $z_{\alpha/2}$ = critical z-value +- $t_{\alpha/2, df}$ = critical t-value with $df = n-1$ + +**Common Confidence Levels**: +- **90% CI**: $\alpha = 0.10$, $z_{0.05} = 1.645$ +- **95% CI**: $\alpha = 0.05$, $z_{0.025} = 1.96$ +- **99% CI**: $\alpha = 0.01$, $z_{0.005} = 2.576$ + +**Interpretation**: +- "We are 95% confident that the true population mean lies between [lower, upper]" +- Does NOT mean there's a 95% probability that the specific interval contains the parameter + +**R Implementation**: +```R +# Using t.test for confidence interval +result <- t.test(data) +ci <- result$conf.int + +# Manual calculation for 95% CI +manual_ci <- function(x, conf_level = 0.95) { + n <- length(x) + mean_x <- mean(x) + sd_x <- sd(x) + se <- sd_x / sqrt(n) + t_critical <- qt(1 - (1 - conf_level)/2, df = n - 1) + lower <- mean_x - t_critical * se + upper <- mean_x + t_critical * se + return(c(lower, upper)) +} + +# For proportions +prop_ci <- function(x, conf_level = 0.95) { + p <- mean(x) + n <- length(x) + se <- sqrt(p * (1 - p) / n) + z_critical <- qnorm(1 - (1 - conf_level)/2) + lower <- p - z_critical * se + upper <- p + z_critical * se + return(c(lower, upper)) +} +``` + +## 6. Implementation in R + +The package `psych` provides pretty descriptive statistic: + +``` R +library(psych) +data <- c(1, 1, 6, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9) +View(describe(data)) +``` + +But which package does not calculate the mode of data set, you could calculate it by: + +``` R +# Calculate mode +mode_val <- as.numeric(names(sort(table(data), decreasing = TRUE)[1])) +``` \ No newline at end of file diff --git a/Research Method/Quantitative Research/Population and Sample.md b/Research Method/Quantitative Research/Population and Sample.md new file mode 100644 index 0000000..cf2248d --- /dev/null +++ b/Research Method/Quantitative Research/Population and Sample.md @@ -0,0 +1,14 @@ +--- +Course: + - PSYC10100 Introduction to Statistics for Psychological Sciences +--- +## Population + +Population is + +- Population size: $N$ +## Sample + +- Sample size: $n$ + +## Sampling