Skip to main content

Configuration โš™๏ธ

To run re_data you would need to configure what tables should be monitored and set up some properties of this monitoring. You may also want/need to update some of the defaults vars use by re_data to run it for specific time windows or compute types of metrics you need.

Model level configโ€‹

Currently re_data supports dbt native configuration, by leveraging dbt models & sources configs.

If you are not familiar with dbt models & sources configuration we encourage to check the dbt: model-configs and source-configs documentation.

re_data dbt native cofiguration follows the same rules as dbt configuration, config block inside model will have the most priority and configuration in dbt_project.yml will have the least priority.

<model_name>.sql
{{
config(
re_data_monitored=true,
re_data_time_filter='creation_time',
re_data_columns=['amount', 'status'],
re_data_metrics={'table': ['orders_obove_100'], 'column': { 'status': ['distinct_values'],
re_data_anomaly_detector={'name': 'modified_z_score', 'threshold': 3.0} }},
re_data_owners=['datateam']
)
}}


select ...
info

dbt 1.1.0 introduced config for sources so Property file configuration for sources in available since dbt 1.1.0

You can define configuration on many levels; it's common to enable re_data for a group of tables (for example, all sources). It's also common to override some of the properties for specific groups.

Now let's go over the configuration you can set on models

Monitoring propertiesโ€‹

re_data_monitoredโ€‹

Set to true to enable monitoring for a given table or set of tables.

re_data_time_filter (optional)โ€‹

SQL expression (for example, column name) to filter records of the table to a specific time range. It can be set to null if you wish to compute metrics on the whole table. This expression will be compared to re_data:time_window_start and re_data:time_window_end vars during the run. (described below)

re_data_columns (optional)โ€‹

Set of columns for which re_data should compute metrics. If not specified, re_data will compute stats for all columns with either numeric or text types.

re_data_metrics (optional)โ€‹

Additional metrics to be computed for a given table (set of tables). Those can be either whole table level or column level. (Check out metrics section to learn distinction between the two)

You can be pass any number of already defined or your custom metrics to be computed. Check out extra metrics section for available metrics and custom metrics for ways to define your own metrics.

re_data_anomaly_detector (optional)โ€‹

Alternative anomaly dector with it's parameters to use when detecting anomalies in a given model (set of models)

For details about configuration look into Anomaly Detection

re_data_owners (optional)โ€‹

Group of single person which should receive and alert about problem with a given model.

For details about configuration look into Notifications

Global config varsโ€‹

Apart from model specific config re_data enables you to edit global configuration for some of the parameters. All of them are optional so we start with sensible defaults and let you override if there is a need.

Parameters of re_data configuration
vars:
# (optional) if not passed, stats for last day will be computed
re_data:time_window_start: '{{ (run_started_at - modules.datetime.timedelta(1)).strftime("%Y-%m-%d 00:00:00") }}'
re_data:time_window_end: '{{ run_started_at.strftime("%Y-%m-%d 00:00:00") }}'

# (optional) override standard metrics computed for all your tables
re_data:metrics_base:
table:
- row_count
- freshness
- buy_count # my own custom metric

column:
numeric:
- min
- max
- avg
- stddev
- variance
- nulls_count
- diff # my own custom metric

text:
- min_length
- max_length
- avg_length
- nulls_count
- missing_count

# (optional) tells how much hisory you want to consider when looking for anomalies
re_data:anomaly_detection_look_back_days: 30

# (optional) configuring storing tests history
re_data:save_test_history: true

# (optional) configuring owners
re_data:owners_config:
datateam:
- type: slack
identifier: U02FHBSXXXX
name: user1
backend:
- type: email
identifier: user1@getre.io
name: user1

re_data:time_window_start, re_data:time_window_endโ€‹

re_data metrics are time-based. (re_data filters all your table data to a specific time window.) In general, we advise setting up a time window this way that all new data is monitored. It's also possible to compute metrics from overlapping data for example last 7 days.

By default, re_data computes daily stats from the last day (it actually uses exact configuration from example for that)

re_data:metrics_baseโ€‹

This is a set of metrics to compute for all monitored tables. Some metrics like row_count are table level, others are specified per column type: so that expression will be run for all columns of this type.

Here you only add metrics if you want to compute them for all tables which are monitored. You can also set up metrics to be computed for specific tables only (with table specific config)

info

schema changes are currently not set up using this parameter. Every monitored table is scanned for schema changed currently.

re_data:anomaly_detectorโ€‹

See Anomaly Detection

re_data:anomaly_detection_look_back_daysโ€‹

The period which re_data considers when looking for anomalies. (By default, it's 30 days)

re_data:save_test_historyโ€‹

Variable to enable storing test history. See re_data tests history for more details.

re_data:owners_configโ€‹

Variable to configure owners for your data. See re_data notifications for more details.