Reliability data
Now, let's compute the first health data. The toy shop sample project already has re_data configuration defined, but let's go over it and see what we needed to configure.
re_data configuration
For re_data pending orders models, we do have in file configuration defined:
{{
config(
re_data_monitored=true,
re_data_time_filter='time_created',
re_data_anomaly_detector={'name': 'z_score', 'threshold': 2.2},
)
}}
select o.id, o.amount ...
This configuration makes the pending_orders_per_customers
model monitored by re_data and uses column named: time_created
as timestamp column when computing re_data stats.
We also optionally configured an anomaly detector to check for anomalies for this table. Other additional things we can configure are:
re_data_columns
- to specify for what table columns re_data metrics should be computedre_data_metrics
- to add additional metrics to be computed for the table
For re_data seed files, let's look into the toy_shop/seeds/schema.yml
file:
version: 2
seeds:
- name: customers
config:
re_data_monitored: true
re_data_time_filter: null
columns:
- name: id
tests:
- not_null
- unique
- name: orders
config:
re_data_monitored: true
re_data_time_filter: time_created
re_data_anomaly_detector:
name: modified_z_score
threshold: 3.5
...
We do see the re_data_monitored
set for both of those tables. For the customers
table as data, there doesn't have any timestamp column we set re_data_time_filter
to null to compute stats globally for the whole table. For the orders
table similarly to the pending orders table we set it to the time_created
column.
Additionally for the orders
table, we configure the re_data_anomaly_detector
method and set up a specific threshold to be used when looking for anomalies in this table.
Apart from configuration specific to some of the tables (or table groups) re_data also has a possibility to define global configuration which will apply to all data computed. This is defined in the vars
section of dbt_project.yml
.
vars:
re_data:anomaly_detector:
name: modified_z_score
threshold: 3
Here we set the anomaly_detector method for re_data globally. This will have less priority than model defined configurations. In general re_data configuration follows the rule that the more specific configuration, the more important it is.
Now with this explained, we can compute re_data models for the first time:
First re_data run
We choose to run re_data for the first day of 2021.
dbt run --models package:re_data --vars \
'{
"re_data:time_window_start": "2021-01-01 00:00:00",
"re_data:time_window_end": "2021-01-02 00:00:00"
}'
Notice that re_data:time_window_start
and re_data:time_window_end
are another global configuration parameters. They can be defined also in the dbt_project.yml
vars file, but here we want them to be more dynamic and we use a dbt --vars
option to pass them. If we don't pass time window parameters re_data will compute stats for the previous day. (from yesterday's 00:00 AM up until today 00:00 AM)
Anytime re_data computes its models, it detects tables being monitored, their configuration and that inside your database as a re_data model.
postgres=> SELECT * FROM toy_shop_re.re_data_monitored;
name | schema | database | time_filter | metrics | columns | anomaly_detector
-----------------------------+----------+----------+--------------+---------+---------+------------------------------------------------
pending_orders_per_customer | toy_shop | postgres | time_created | {} | [] | {"name": "z_score", "threshold": 2.2}
customers | toy_shop | postgres | | {} | [] | {"name": "modified_z_score", "threshold": 3}
orders | toy_shop | postgres | time_created | {} | [] | {"name": "modified_z_score", "threshold": 3.5}
You would notice that here we will also see final configuration applied when actually computing models (taking into account priorties from different configuration levels). More information on configuration can be found here.
Notice that table was created inside toy_shop_re
schema. re_data tables are by default created with this schema suffix, except toy_shop_re_internal
tables which are internal tables not be used directly by you. You can change this behaviour however you want, we use following dbt config for our models. (which can be overwritten)
models:
re_data:
+schema: re
internal:
+schema: re_internal
Metrics have been computed for the window between 2021-01-01 and 2021-01-02, let us see how many rows we have for the tables being monitored. The row_count
metric gives us that.
postgres=> SELECT table_name, metric, value, time_window_start, time_window_end from toy_shop_re.re_data_metrics where metric in( 'row_count', 'global__row_count');
table_name | metric | value | time_window_start | time_window_end
-----------------------------------------------------+-------------------+-------+---------------------+---------------------
"postgres"."toy_shop"."customers" | global__row_count | 15 | 2021-01-01 00:00:00 | 2021-01-02 00:00:00
"postgres"."toy_shop"."orders" | row_count | 20 | 2021-01-01 00:00:00 | 2021-01-02 00:00:00
"postgres"."toy_shop"."pending_orders_per_customer" | row_count | 5 | 2021-01-01 00:00:00 | 2021-01-02 00:00:00
Note, if a model being monitored has no time filter specified, re_data will compute the metric over the whole table. A global__
prefix would be added to that metric.
re_data run for ten first days of January
On production, we would set up re_data to run daily/hourly/etc. For toy shop, by using the re_data python package command we backfill daily data for the past.
re_data run --start-date 2021-01-01 --end-date 2021-01-11
Looking into anomalies
And now let's look into the re_data_anomalies
table to see if re_data
found anything suspicious for us:
postgres=> select table_name, metric, z_score_value, modified_z_score_value, last_value, time_window_end from toy_shop_re.re_data_anomalies;
table_name | metric | z_score_value | modified_z_score_value | last_value | time_window_end
-----------------------------------------------------+------------+------------------+------------------------+------------------+---------------------
"postgres"."toy_shop"."orders" | stddev | 2.26512659456492 | 33.1559468401042 | 76623.0542133031 | 2021-01-08 00:00:00
"postgres"."toy_shop"."orders" | max | 1.90332380570845 | 5.39599999982013 | 220 | 2021-01-08 00:00:00
"postgres"."toy_shop"."orders" | avg | 2.23609834440421 | 12.5165457875454 | 83714.2857142857 | 2021-01-08 00:00:00
"postgres"."toy_shop"."orders" | variance | 2.17679662067361 | 14.3162734688659 | 5933.28235294118 | 2021-01-08 00:00:00
"postgres"."toy_shop"."orders" | avg_length | 1.99359828729157 | 4.83162244794424 | 12.1428571428571 | 2021-01-08 00:00:00
"postgres"."toy_shop"."orders" | stddev | 2.12566830905018 | 12.0321401712802 | 77.0278024672986 | 2021-01-08 00:00:00
"postgres"."toy_shop"."orders" | row_count | 1.97360659802582 | 5.05874999974706 | 35 | 2021-01-08 00:00:00
"postgres"."toy_shop"."orders" | variance | 2.26750636806239 | 96.2432445786529 | 5871092436.97479 | 2021-01-08 00:00:00
"postgres"."toy_shop"."pending_orders_per_customer" | avg | 2.23118408449371 | 11.0541393078969 | 114782.608695652 | 2021-01-08 00:00:00
"postgres"."toy_shop"."pending_orders_per_customer" | max | 2.25075673466898 | 27.6544999999994 | 250000 | 2021-01-08 00:00:00
We can see there are a couple of things re_data flagged for us. Recall that re_data would flag anomalies based on the method of detection and the threshold set.
Running tests
Before moving on and investigating it in re_data UI. Let's run tests to see if they point to any problems in our data.
First we update the dbt_project.yml and enable storing tests history (simply by adding re_data:save_test_history
set to true
).
vars:
...
re_data:save_test_history: true
$ dbt test --select package:toy_shop
...
22:36:20 1 of 7 START test accepted_values_orders_status__PENDING_PAYMENT__PAID__SHIPPED__DELIVERED [RUN]
22:36:20 2 of 7 START test not_null_customers_id......................................... [RUN]
22:36:20 3 of 7 START test not_null_orders_amount........................................ [RUN]
22:36:20 4 of 7 START test not_null_orders_customer_id................................... [RUN]
22:36:21 2 of 7 PASS not_null_customers_id............................................... [PASS in 0.12s]
22:36:21 3 of 7 PASS not_null_orders_amount.............................................. [PASS in 0.13s]
22:36:21 4 of 7 PASS not_null_orders_customer_id......................................... [PASS in 0.13s]
22:36:21 1 of 7 PASS accepted_values_orders_status__PENDING_PAYMENT__PAID__SHIPPED__DELIVERED [PASS in 0.13s]
22:36:21 5 of 7 START test not_null_orders_status........................................ [RUN]
22:36:21 6 of 7 START test not_null_orders_time_created.................................. [RUN]
22:36:21 7 of 7 START test unique_customers_id........................................... [RUN]
22:36:21 5 of 7 PASS not_null_orders_status.............................................. [PASS in 0.05s]
22:36:21 6 of 7 PASS not_null_orders_time_created........................................ [PASS in 0.05s]
22:36:21 7 of 7 PASS unique_customers_id................................................. [PASS in 0.05s]
22:36:21
22:36:21 Finished running 7 tests in 0.56s.
22:36:21
22:36:21 Completed successfully
22:36:21
22:36:21 Done. PASS=7 WARN=0 ERROR=0 SKIP=0 TOTAL=7
Ok, so all of the tests are actually passing. Let's move to the next chapter and investigate what's going on.