Sample Selection and Size: Outliers vs. Cherry-Picking, Sampling vs. Inflation
- Sampling Selection
As discussed above, the integrity of any statistical analysis relies heavily on the sample used. Although removing outliers can be encouraged, it can easily lead to leaving out sample points in favour of your desired results. Intentionally selecting sample points is unethical and can be termed p-hacking.
To avoid this, it would be safer to document your data selection process clearly and visualise the clear outliers before removal. Another option would be to present two sets of results with and without selection so it is clear that all left-out sample points were truly outliers.
- Sample Size
Sampling is the act of selecting a good sample size that could represent the population of data. Inflation, however, involves duplicating or fabricating data samples to increase p-values. The ideal sample size can be calculated using the formula below to prevent undersampling. Inflation can be controlled by preregistering the study and clear visualisation to reduce the risk of duplication. Supporting p-values with other statistical tools, as mentioned in Lesson 1, also mitigates such errors.