Is Your Data Lying To You? Avoid Bias with Smarter Data Sampling

Published in

AI Mind

5 min readApr 23, 2024

We need data that accurately describes what we are studying to make informed decisions, solve complex problems, and uncover hidden truths. We begin by identifying a population that we are interested in. In statistics, a population is the entire group of individuals, objects, or events you want to study and draw conclusions about.

Data sampling is choosing a smaller group (a sample) from a larger group (a population) to learn more about the larger group. This sample should represent the whole group, so you don’t have to study the entire population for meaningful insights.

Figure 1 shows how we begin with a larger population and then create a smaller sample using an appropriate sampling method.

A diagram showing how sample data is extracted from a larger population using one or more sampling methods. — Figure 1: Populations and Samples

Why not use the entire population?

We’ll examine some standard methods for sampling data, but first, there is an important question to answer: Why don’t we use the whole population as our dataset?

There are several reasons for choosing a sample of the data, including:

Time and Cost

Examining an entire population can be incredibly time-consuming, expensive, and often impossible. Sampling offers a significantly more cost-effective approach, especially for large populations

Destructive Testing

In some scenarios, the testing process destroys the item being tested (for example, how cars behave when crashing into something). Sampling becomes essential in these cases.

Diminishing Returns

Increasing the sample size beyond a certain point often results in only minimal gains in accuracy. A well-designed, representative sample can often provide results that are very close to those obtained from studying the entire population.

There are times when analysing an entire population is appropriate. For example, a governing body should consider every vote during an election.

Methods for sampling data

There are two broad ways of sampling data, as Figure 2 illustrates: Probability sampling and non-probability sampling.

Probability Sampling

For probability sampling, each item in the population has a known, and often equal, chance of being selected as part of the sample. Figure 3 shows common approaches to probability sampling.

Table 1 (below) describes how each probability sampling method works and lists the pros and cons of each approach.

A table describing four common types of probability sampling: simple random sampling, systematic sampling, stratified sampling and cluster sampling. — Table 1: Probability Sampling Methods

Non-Probability Sampling

For non-probability sampling, the chance of selecting an item is unknown. Furthermore, the researcher’s judgement may influence the selection process.

Table 2 (below) describes how each non-probability sampling method works and lists the pros and cons of each approach.

A table describing four common types of non-probability sampling: Convenience sampling, voluntary response sampling, purposive sampling and snowball sampling. — Table 2: Non-Probability Sampling Methods

Non-probability sampling favours feasibility over representativeness. In other words, we should consider these methods when probability sampling is impossible. Non-probability methods may also be suitable if you are looking for qualitative insights. I write about the importance of data types in a separate article.

Whenever using non-probability methods, be aware of the limitations they may impose and take care when applying results to the population.

Ensuring the representativeness of your sample

Sample representativeness is not just a concept, it’s a responsibility. It means that the characteristics of your sample closely match the characteristics of the larger population you’re studying. A representative sample is like a miniature version of the population, allowing you to draw accurate conclusions about the whole group. A representative sample empowers you to ensure the validity of your findings.

Avoiding bias

Bias in a sample means the sample must accurately represent the larger population you’re trying to study. Understanding common causes for bias will help you take steps to reduce it.

A table describing common types of bias that can affect how samples are created from a larger population. The types of bias are selection bias, non-responsive bias, self-selection bias and survivorship bias. — Table 3: Common Types of Bias

How to achieve a representative sample?

Bias can occur for many reasons and is almost impossible to avoid completely. However, we should always be vigilant and try to reduce bias where we can and report it where we can’t. This caution is crucial in maintaining the integrity of your research.

Favour probability sampling where possible over non-probability sampling. If probability sampling is not entirely possible, consider employing a combination of probability and non-probability sampling methods.
Decide which factors you need to represent proportionally in your sample. Think about how they may be under or overrepresented.
Decide how large your sample should be. Larger is generally better, although the effort to create a sample is proportional to its size. There are statistical tools that can help you determine sample size, given your desired levels of accuracy.
Applying a weighting process to underrepresented groups as part of your analysis is possible, but exercise caution!

Data sampling, when executed thoughtfully, is a potent tool. It allows us to unlock insights about vast populations while saving significant time and resources.

Understanding your research goals, the nuances of your population, and the potential pitfalls of various sampling techniques is essential. By acknowledging sampling limitations and striving for strong representativeness, you can extract reliable and meaningful conclusions from your sample data. After all, if you put garbage in — you should not be surprised to get garbage out.