Sampling: Data
We last talked about data, but let’s take a moment to dive deeper into where it actually comes from. By definition, data is simply a collection of facts and figures. But how do we collect this data?
To understand how data is gathered, we first need to step back and think about what we’re trying to find using statistics. To do that, we need a dataset that’s large enough for us to study. And this is where the concepts of population come into play!
Population
A population in statistics refers to the entire set of items or observations you’re interested in studying.
For example, let’s say you want to study the average height of all humans. (Key term: all humans). In this case, the population would include _every human being on Earth! (Good luck with that!)
In more specific studies, a population can easily be infinite. However, in the real world, collecting such an infinite amount of data or even just a large dataset can be difficult. It's challenging to collect, access, and analyze, not to mention the time and cost involved, which can quickly overwhelm the researcher.
That said, studying the population is crucial because it allows us to find a generalized version of the study, leading to the most accurate predictions and results that can be applied to everyone on a larger scale.
So, what's the solution, smart guy? Samples!
Sample
You’ve probably visited a mall and seen a stall offering a small sample of food. You take a bite, and based on that tiny taste, you decide whether the product is worth buying. That’s exactly how a sample works in statistics!
A sample is simply a smaller subset of the population. Since studying the entire population is often impractical due to strict constraints, researchers take a smaller group to study instead.
But wait... a smaller group? Isn’t small the key word here? Does it even represent the entire population? The trick is that the sample is carefully selected to mirror the population in terms of its characteristics.
To be precise, the main goal is to select a sample that truly represents the population, capturing the diversity and variation that exists in the larger group!
So, In essence, statistics is all about studying a sample to make inferences, draw conclusions, and develop theories about the larger population!
Perfect! But that still doesn't answer our original question: How do we even collect this data? Even a sample can be huge (though smaller than the population), so what about that, Sherlock?
Now that we understand samples, let’s introduce the term sampling.
Sampling
Sampling is the process of selecting a smaller group (a sample) from a larger population or group to study and draw conclusions about, you guessed it, the entire population!
The main challenge, however, is collecting a sample in such a way that it accurately represents the population. After all, we want the results to be generalizable to the entire population, free from any bias.
To tackle this, we classify sampling into two types:
- Probability Sampling
- Non-Probability Sampling
Probability Sampling
What better way to create a dataset than to leave it to chance and let it work in your favor?
In probability sampling, every individual in the population has a known, non-zero chance of being selected. The benefit of this approach is that it allows for data from every category to be part of the sample, which means we can confidently generalize the findings from the sample to the entire population.
This type of sampling minimizes bias and increases the reliability of your results!
Let me give you a flavor of it!
Simple Random Sampling
In simple random sampling, every individual in the population has an equal chance of being selected. The best part? You can do it with a random number generator or even by drawing lots using software!
For example, imagine you're in a room with 10 people, and you need to pick someone to be the official 'cookie tester' for a batch of cookies you just baked. To make it fair, you decide to use simple random sampling.
You write everyone’s name on a piece of paper, toss them into a hat, and then randomly pull one slip out! (That's it!)
Do this repeatedly on a large scale, and you'll have your random sample ready.
Systematic Sampling
In systematic sampling, individuals are selected at regular intervals from a list. That is, every n-th individual is selected.
For instance, imagine you work in a toothpaste factory. To collect your data, you might every 4th toothpaste tube coming off the production line. Not the 5th, not the 3rd, just every 4th one!
Stratified Sampling
Stratified sampling is like sorting people into groups based on certain characteristics before choosing a sample from each group. The goal is to make sure each important subgroup of the population is properly represented in your sample.
Let’s say (again) you’re conducting a survey about favorite cookie flavors, and you want to be sure that people with different taste preferences are represented. You divide your population into two 'strata': chocolate chip lovers and non-chocolate chip lovers.
Now, instead of picking a random person from the whole group, you randomly sample from each of these strata. You might pick 5 people from the chocolate chip lovers group and 5 people from the non-chocolate chip lovers group.
This ensures that both groups are equally represented and gives you a more balanced view of cookie preferences!
Cluster Sampling
Cluster Sampling works by dividing the population into separate clusters, and then randomly selecting entire clusters to sample. Instead of selecting individual items from each cluster, you choose the entire cluster.
Basically, you’re sampling every data point within a selected cluster.
For example, imagine you're conducting a national survey on students’ academic performance in different schools. You want to gather data quickly, but visiting every school in the country is impossible.
So, you use cluster sampling. You divide the country into clusters based on schools or school districts, and then you randomly select a few districts to survey. Once a district is chosen, you survey all the students within that district.
It’s faster, but it might not give you the most variety.
Non-Probability Sampling
In Non-Probability Sampling, individuals are not randomly selected, and not everyone in the population has a known or equal chance of being included in the sample.
This method introduces bias, which makes the findings less generalizable.
Convenience Sampling
(The name says it all!) In convenience sampling, researchers select individuals who are easiest to access or most willing to participate.
This method is often used in the early stages of research when quick results are needed.
Judgmental (or Purposive) Sampling
In Judgmental (or Purposive) Sampling, the researcher handpicks individuals or groups that they believe are the most relevant to the study. This type of sampling is often used when the researcher has specific criteria and is looking for people with particular characteristics or expertise.
For example, let’s say you’re researching famous chefs' opinions on the best cooking techniques. You don’t want to ask just any cook off the street, right? Instead, you purposefully choose top chefs who are known for their unique culinary skills and perspectives.
In this case, you're using judgmental sampling because you're selecting participants based on your expertise or judgment about who will provide the best insights.
It’s efficient and focused, but there’s a downside: it can introduce bias since the sample is based on the researcher’s subjective choices. So, while it’s great for specific studies, the findings may not apply to the broader population.
Snowball Sampling
Now, Snowball Sampling is kind of like getting a recommendation for a secret club, you know, that cool underground group that you wouldn't find unless someone told you about it.
This method is used when you’re studying a hard-to-reach population, like people with rare diseases or individuals involved in secretive activities. Here's how it works:
Let’s say you’re conducting a survey extreme sports enthusiasts (think skydivers, cliff divers, etc.). You don’t know many people in this niche group, so you start by interviewing one person who is into extreme sports. After the interview, you ask them to refer you to others who might be willing to participate.
Each person you interview refers you to more people, and before you know it, you have a growing network of participants, like a snowball rolling down a hill, gathering more and more participants as it goes!
While this is great for accessing hidden or hard-to-find groups, it can also create bias. The sample tends to be homogeneous (everyone is referred by someone else with similar traits), which means the findings might not represent the wider population of extreme sports enthusiasts.