Ensuring Data Quality in Scraped Datasets: A Statistical Approach to Sample Testing

The Challenge of Variable Dataset Sizes

David Martin Riveros
4 min readJul 18, 2024

Imagine trying to build a skyscraper on a shaky foundation; that’s what making business decisions with poor-quality data is like. This is where a robust sample testing strategy comes in, guaranteeing that errors in scraped datasets remain below a critical threshold of 5%.

Photo by Simone Hutsch on Unsplash

Picture this: You’re collecting data from various sources, and you have no idea how large your dataset will get. It’s like trying to catch water with a net — the flow is unpredictable, and the volume keeps increasing.

Ensuring the quality of this ever-growing dataset requires a clever approach. We focus on two main types of errors: missing data and formatting issues, which are often inherent to the source and typically fixed during post-processing.

By examining a small, representative subset of the entire dataset, we can infer the quality of the whole with a degree of confidence. It’s like solving a mystery by analyzing a few crucial clues.

Determining Sample Size,

To crack the case, we need to determine the right sample size (N). This depends on the confidence level we aim for and the margin of error we’re willing to accept. For our purposes, we shoot for a 95% confidence level with a 5% margin of error.

Here’s a simple formula we use:

Where:

  • n is the sample size.
  • z is the Z-score for the desired confidence level (1.96 for 95% confidence).
  • p is the estimated proportion of errors in the dataset (we often start with 0.5 for maximum variability).
  • E is the margin of error (0.05 for 5%).

Plugging in the numbers:

# Parameters
Z = 1.96 # Z-score for 95% confidence
p = 0.5 # Estimated proportion of errors
E = 0.05 # Margin of error
# Calculating sample size
n = (Z**2 * p * (1 - p)) / E**2
print(f"Required sample size: {math.ceil(n)}")
Required sample size: 384

So, we need a sample size of approximately 384 entries to achieve our desired confidence level and margin of error.

Acceptable Error Rate in Samples

To ensure our skyscraper stands tall, we need to determine how many errors we can tolerate within our sample.

Given the small error rate and large sample size, we can tolerate approximately 19 errors in our sample of 384 entries (5% of 384). However, since this is a sample we need to be even more aggressive by identifying the lower and upper bounds:

  1. Determine the Sample Error Margin: Calculate the margin of error for the proportion of errors in the sample.
  1. Adjust for Confidence Interval: Use the confidence interval to determine the range of acceptable errors.

Here’s the Python code for this calculation:

# Parameters
p = 0.05 # Target error rate
n = 384 # Sample size

# Calculating margin of error
margin_of_error = Z * math.sqrt((p * (1 - p)) / n)
print(f"Margin of Error: {margin_of_error}")

# Confidence interval range
lower_bound = p - margin_of_error
upper_bound = p + margin_of_error
print(f"Acceptable error range: {lower_bound*100:.2f}% to {upper_bound*100:.2f}%")

# Acceptable number of errors
acceptable_errors_lower = lower_bound * n
acceptable_errors_upper = upper_bound * n
print(f"Acceptable number of errors: {math.ceil(acceptable_errors_lower)} to {math.ceil(acceptable_errors_upper)}")
Margin of Error: 0.0219
Acceptable error range: 2.81% to 7.19%
Acceptable number of errors: 11 to 28

Conclusion

By applying this statistical approach, we ensure that the error rate in our dynamically growing datasets remains within an acceptable range.

With a sample size of 384, we can confidently state that if we find 11 errors or fewer, the true error rate in the entire dataset will likely be less than 5%, assuming the sample is randomly selected and representative of the population.

This method provides our customers with the assurance that their data quality is rigorously maintained, empowering them to make informed decisions based on reliable information.

By using these calculations, we build a robust foundation for data quality, much like constructing a skyscraper with precision and care.

--

--

David Martin Riveros

CEO & Founder of Icebergdata | Entrepreneur | Serial Web Scraping Expert | Business Intelligence & Data Analytics | Speaker | Board Member