How we deal with Data Quality using Circuit Breakers
Imagine a business metric showing a sudden spike — is the spike real or is it a data quality problem? Analysts and Data Engineers today will spend hours, days, and even weeks analyzing whether a given metric is correct! In other words, Time-to-Reliable-Insights today are unbounded and are a widespread pain-point across the industry. At Intuit, we are working on addressing the data quality problem at scale and presented our platform (called QuickData SuperGlue) at the Strata Conference in New York, 2018.
Analogous to using the circuit breakers pattern in micro-services architecture, we are designing circuit-breakers for data pipelines. In the presence of data quality issues, the circuit opens preventing low-quality data from propagating to downstream processes. The result is that data will be missing in the reports for time-periods of low quality, but if present, it is guaranteed to be correct. This proactive approach makes Time-to-Reliable-Insights bounded to mins by automating data availability to be directly proportional to data quality. This approach also eliminates the unsustainable fire-fighting required for verifying-&-fixing metrics/reports on a case-by-case basis. The rest of the blog describes details for implementing and deploying circuit breakers and divided into three sections:
- Data Pipelines Ground realities
- Circuit Breaker Pattern for Data Pipelines
- Implementing Circuit Breakers in Production
Data Pipelines Ground realities
A data pipeline is a logical abstraction representing a sequence of data transformations required for converting raw data into insights. In our data platform, we have thousands of data pipelines running daily. Each pipeline ingests data from different sources, and applies a sequence of ETL and analytical queries to generate insights in the form of reports, dashboards, ML models, output tables. These insights are used for both data-driven business operations as well as in-product customer experiences.
We ingest 4 types of data collected across 100s of relational DBs, as well as NoSQL stores (Key-Value, Document):
- User Entered Data (UED): Data entered by customers in using the products
- Behavioral Analytics Data: Clickstream data capturing usage of the product
- Enterprise data: Back-office systems for customer care, billing, etc.