跳转至

单样本率置信区间

样本率用 \(\hat{p}\) 表示,总体率用 \(p\) 表示。

渐进正态法

样本率 \(\hat{p}\) 在大样本时,其分布近似服从正态分布:

\[ \hat{p} \ \dot{\sim} \ N\left(p, \frac{\hat{p}(1-\hat{p})}{n}\right) \]

使用 \(z\) 分布构建置信区间:

\[ z = \frac{\hat{p} - p}{\sqrt{\hat{p}(1-\hat{p})/n}} \sim N(0, 1) \]
\[ \begin{align} L & = \hat{p} - z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \\ U & = \hat{p} + z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \end{align} \]

定义置信区间的宽度为 \(d\),则:

\[ d = \min(U, 1) - \max(L, 0) \]
\[ \begin{align} L & = \hat{p} - z_{1-\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \\ U & = 1 \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = \hat{p} - \max(L, 0) \]
\[ \begin{align} L & = 0 \\ U & = \hat{p} + z_{1-\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = \min(U, 1) - \hat{p} \]

渐进正态法(连续性校正)

渐进正态法 的基础上添加校正项 \(1/2n\)

\[ z = \frac{\hat{p} - p \pm \frac{1}{2n}}{\sqrt{\hat{p}(1-\hat{p})/n}} \sim N(0, 1) \]
\[ \begin{align} L & = \hat{p} - z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} - \frac{1}{2n} \\ U & = \hat{p} + z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} + \frac{1}{2n} \end{align} \]

定义置信区间的宽度为 \(d\),则:

\[ d = \min(U, 1) - \max(L, 0) \]
\[ \begin{align} L & = \hat{p} - z_{1-\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} - \frac{1}{2n} \\ U & = 1 \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = \hat{p} - \max(L, 0) \]
\[ \begin{align} L & = 0 \\ U & = \hat{p} + z_{1-\alpha} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} + \frac{1}{2n} \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = \min(U, 1) - \hat{p} \]

Clopper-Pearson

\(F_{x;v_1,v_2}\) 表示自由度为 \(v_1\)\(v_2\)\(F\) 分布的 \(x\) 分位数。

\[ \begin{align} L & = \left[ 1 + \frac{n - n\hat{p} + 1}{n\hat{p} F_{\alpha/2;\ 2n\hat{p},\ 2(n - n\hat{p} + 1)}} \right]^{-1} \\ U & = \left[ 1 + \frac{n - n\hat{p}}{(n\hat{p} + 1) F_{1-\alpha/2;\ 2(n\hat{p} + 1), \ 2(n - n\hat{p})}} \right]^{-1} \end{align} \]

定义置信区间的宽度为 \(d\),则:

\[ d = U - L \]
\[ \begin{align} L & = \left[ 1 + \frac{n - n\hat{p} + 1}{n\hat{p} F_{\alpha;\ 2n\hat{p},\ 2(n - n\hat{p} + 1)}} \right]^{-1} \\ U & = 1 \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = \hat{p} - L \]
\[ \begin{align} L & = 0 \\ U & = \left[ 1 + \frac{n - n\hat{p}}{(n\hat{p} + 1) F_{1-\alpha;\ 2(n\hat{p} + 1), \ 2(n - n\hat{p})}} \right]^{-1} \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = U - \hat{p} \]

Wilson Score

\[ \begin{align} L & = \frac{\left(2n\hat{p} + z_{1-\alpha/2}^2\right) - z_{1-\alpha/2} \sqrt{z_{1-\alpha/2}^2 + 4n\hat{p}(1-\hat{p})}}{2\left(n + z_{1-\alpha/2}^2\right)} \\ U & = \frac{\left(2n\hat{p} + z_{1-\alpha/2}^2\right) + z_{1-\alpha/2} \sqrt{z_{1-\alpha/2}^2 + 4n\hat{p}(1-\hat{p})}}{2\left(n + z_{1-\alpha/2}^2\right)} \end{align} \]

定义置信区间的宽度为 \(d\),则:

\[ d = U - L \]
\[ \begin{align} L & = 0 \\ U & = \frac{\left(2n\hat{p} + z_{1-\alpha}^2\right) + z_{1-\alpha} \sqrt{z_{1-\alpha}^2 + 4n\hat{p}(1-\hat{p})}}{2\left(n + z_{1-\alpha}^2\right)} \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = \hat{p} - L \]
\[ \begin{align} L & = \frac{\left(2n\hat{p} + z_{1-\alpha}^2\right) - z_{1-\alpha} \sqrt{z_{1-\alpha}^2 + 4n\hat{p}(1-\hat{p})}}{2\left(n + z_{1-\alpha}^2\right)} \\ U & = 1 \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = U - \hat{p} \]

Wilson Score 连续性校正

\[ \begin{align} L & = \frac{\left(2n\hat{p} + z_{1-\alpha/2}^2 - 1\right) - z_{1-\alpha/2} \sqrt{z_{1-\alpha/2}^2 - \frac{1}{n} + 4n\hat{p}(1-\hat{p}) + 4\hat{p} - 2}}{2\left(n + z_{1-\alpha/2}^2\right)} \\ U & = \frac{\left(2n\hat{p} + z_{1-\alpha/2}^2 + 1\right) + z_{1-\alpha/2} \sqrt{z_{1-\alpha/2}^2 - \frac{1}{n} + 4n\hat{p}(1-\hat{p}) - 4\hat{p} + 2}}{2\left(n + z_{1-\alpha/2}^2\right)} \end{align} \]

定义置信区间的宽度为 \(d\),则:

\[ d = U - L \]
\[ \begin{align} L & = 0 \\ U & = \frac{\left(2n\hat{p} + z_{1-\alpha}^2 + 1\right) + z_{1-\alpha} \sqrt{z_{1-\alpha}^2 - \frac{1}{n} + 4n\hat{p}(1-\hat{p}) - 4\hat{p} + 2}}{2\left(n + z_{1-\alpha}^2\right)} \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = \hat{p} - L \]
\[ \begin{align} L & = \frac{\left(2n\hat{p} + z_{1-\alpha}^2 - 1\right) - z_{1-\alpha} \sqrt{z_{1-\alpha}^2 - \frac{1}{n} + 4n\hat{p}(1-\hat{p}) + 4\hat{p} - 2}}{2\left(n + z_{1-\alpha}^2\right)} \\ U & = 1 \end{align} \]

定义均值到置信限的距离为 \(d\),则:

\[ d = U - \hat{p} \]
Wilson Score 连续性校正置信区间宽度随样本量 \(n\) 的变化

\(p = 0.9\) 为例,绘制双侧 95% 置信区间宽度随样本量 \(n\) 变化的图像如下: Wilson Score 连续性校正置信区间宽度图像

如果将 \(n\) 视为连续型变量,则随着 \(n\) 的增大,置信区间宽度先增大后减小,这可能会给数值求解带来一些麻烦。

若设定置信区间宽度为 \(0.8\),则理论上存在两个数值解,实际应取较大的解作为样本量估算结果。

brentq 要求求根区间左右两端点处的函数值异号,此时可先用 minimize_scalar 求出区间内的极大值,将极大值点作为求根区间下限,再应用 brentq 进行数值求解。