Configuration ⚙️

To run re_data you would need to configure what tables should be monitored and set up some properties of this monitoring. You may also want/need to update some of the defaults vars use by re_data to run it for specific time windows or compute types of metrics you need.

Model level config

Currently re_data supports dbt native configuration, by leveraging dbt models & sources configs.

If you are not familiar with dbt models & sources configuration we encourage to check the dbt: model-configs and source-configs documentation.

re_data dbt native cofiguration follows the same rules as dbt configuration, config block inside model will have the most priority and configuration in dbt_project.yml will have the least priority.

Config block
Property file
Project file

<model_name>.sql
{{
    config(
      re_data_monitored=true,
      re_data_time_filter='creation_time',
      re_data_columns=['amount', 'status'],
      re_data_metrics_groups=['table_metrics', 'column_metrics'],
      re_data_metrics={'table': ['orders_obove_100'], 'column': { 'status': ['distinct_values'],
      re_data_anomaly_detector={'name': 'modified_z_score', 'threshold': 3.0} }},
      re_data_owners=['datateam']
    )
}}


select ...

schema.yml
version: 2

models:
  - name: pending_orders
    config:
      re_data_monitored: true
      re_data_time_filter: created_at
      re_data_columns:
        - amount
        - status
      re_data_metrics_groups:
        - table_metrics
      re_data_metrics:
        table:
          - orders_above_100
        column:
          status:
            - distinct_values
      re_data_anomaly_detector:
        name: modified_z_score
        threshold: 3
          
          

dbt_project.yml
models:
  toy_shop:
    revenue:
      +re_data_monitored: true
      +re_data_time_filter: created_at
      +re_data_anomaly_detector:
        name: modified_z_score
        threshold: 3
      +re_data_metrics_groups:
        - table_metrics

      orders_per_age:
        +re_data_metrics:
          table:
            - orders_above_100

sources:
  toy_shop:
    toy_shop_sources:
      toy_shop_customers:
        +re_data_monitored: true
        +re_data_time_filter: joined_at
  
seeds:
  toy_shop:
    order_items:
      +re_data_monitored: true
      +re_data_time_filter: added_at
      +re_data_anomaly_detector:
        name: z_score
        threshold: 3
      +re_data_columns:
        - name
        - amount

info

dbt 1.1.0 introduced config for sources so Property file configuration for sources in available since dbt 1.1.0

You can define configuration on many levels; it's common to enable re_data for a group of tables (for example, all sources). It's also common to override some of the properties for specific groups.

Now let's go over the configuration you can set on models

Monitoring properties

re_data_monitored

Set to true to enable monitoring for a given table or set of tables.

re_data_time_filter (optional)

SQL expression (for example, column name) to filter records of the table to a specific time range. It can be set to null if you wish to compute metrics on the whole table. This expression will be compared to re_data:time_window_start and re_data:time_window_end vars during the run. (described below)

re_data_columns (optional)

Set of columns for which re_data should compute metrics. If not specified, re_data will compute stats for all columns with either numeric or text types.

re_data_metrics_groups (optional)

List of groups of metrics to compute. You can use any re_data:metrics_groups defined in your vars here. If not specified, re_data will compute metrics defined by re_data:default_metrics variable.

re_data_metrics (optional)

Additional metrics to be computed for a given table (set of tables). Those can be either whole table level or column level. (Check out metrics section to learn distinction between the two)

You can be pass any number of already defined or your custom metrics to be computed. Check out extra metrics section for available metrics and custom metrics for ways to define your own metrics.

In a lot of cases when you extend metrics which are computed we recommend creating a new re_data:metrics_groups in your vars, adding your metrics to it and then defining re_data_metrics_groups to use it for set of models. This approach is usually more flexible when adding new metrics for a given model.

re_data_anomaly_detector (optional)

Alternative anomaly dector with it's parameters to use when detecting anomalies in a given model (set of models)

For details about configuration look into Anomaly Detection

re_data_owners (optional)

Group of single person which should receive and alert about problem with a given model.

For details about configuration look into Notifications

Global config vars

Apart from model specific config re_data enables you to edit global configuration for some of the parameters. All of them are optional so we start with sensible defaults and let you override if there is a need.

Parameters of re_data configuration
vars:
  # (optional) if not passed, stats for last day will be computed
  re_data:time_window_start: '{{ (run_started_at - modules.datetime.timedelta(1)).strftime("%Y-%m-%d 00:00:00") }}'
  re_data:time_window_end: '{{ run_started_at.strftime("%Y-%m-%d 00:00:00") }}'
  
  # (optional) configuring 
  re_data:select:
    - model_name1
    - model_name2
    - source_name1

  # (optional) tells how much hisory you want to consider when looking for anomalies
  re_data:anomaly_detection_look_back_days: 30

  # (optional) configuring storing tests history
  re_data:save_test_history: true

  # (optional) querying db for failing rows
  re_data:query_test_failures: true

  # (optional) limit the number of failed rows returned per test
  re_data:test_history_failures_limit: 10

  # (optional) configuring storing table samples
  re_data:store_table_samples: true

  # (optional) configuring owners
  re_data:owners_config:
    datateam:
      - type: slack
        identifier: U02FHBSXXXX
        name: user1
    backend:
      - type: email
        identifier: [email protected]
        name: user1

re_data:time_window_start, re_data:time_window_end

re_data metrics are time-based. (re_data filters all your table data to a specific time window.) In general, we advise setting up a time window this way that all new data is monitored. It's also possible to compute metrics from overlapping data for example last 7 days.

By default, re_data computes daily stats from the last day (it actually uses exact configuration from example for that)

re_data:select

This is a list which allows you to additionally restrict re_data to only compute metrics/anomalies for certain models. Each model listed here still needs to have re_data_monitored=true to be monitored. If the list is not passed, re_data will computed stats for all re_data_monitored=true models.

List elements can be either model/source name or dbt tags which your models have. Example select section with tags would look like this:

vars:
  ...

  re_data:select:
    - tag:my_tag_name
    - tag:my_other_tag_name
    - specific_table
    - other_specific_table

re_data:metrics_groups

Groups of metrics to compute. By defult table_metrics and column_metrics are defined here, and that's their definition:

re_data:metrics_groups:
  table_metrics:
    table:
      - row_count
      - freshness

  column_metrics:
    column:
      numeric:
        - min
        - max
        - avg
        - stddev
        - variance
        - nulls_count
        - nulls_percent
      text:
        - min_length
        - max_length
        - avg_length
        - nulls_count
        - missing_count
        - nulls_percent
        - missing_percent

You can redefine this var any way you want. If you remove table_metrics, column_metrics group you will then not be able to use them in re_data_metrics_groups settings.

re_data:default_metrics

Default metrics to compute for each model if no re_data_metrics_groups is specified. You can use any of the metrics groups defined in re_data:metrics_groups here. The default re_data configuration is a follows:

re_data:default_metrics:
  - table_metrics
  - column_metrics

re_data:anomaly_detector

See Anomaly Detection

re_data:anomaly_detection_look_back_days

The period which re_data considers when looking for anomalies. (By default, it's 30 days)

re_data:save_test_history

Variable to enable storing test history. See re_data tests history for more details.

re_data:query_test_failures

Variable to configure if re_data should query failed rows (true by default)

re_data:test_history_failures_limit

Variable to configure how many failured rows to fetch per table (10 by default)

re_data:store_table_samples

This is used to enable storing sample data of monitored tables.

re_data:owners_config

Variable to configure owners for your data. See re_data notifications for more details.

Configuration ⚙️

Model level config​

Monitoring properties​

re_data_monitored​

re_data_time_filter (optional)​

re_data_columns (optional)​

re_data_metrics_groups (optional)​

re_data_metrics (optional)​

re_data_anomaly_detector (optional)​

re_data_owners (optional)​

Global config vars​

re_data:time_window_start, re_data:time_window_end​

re_data:select​

re_data:metrics_groups​

re_data:default_metrics​

re_data:anomaly_detector​

re_data:anomaly_detection_look_back_days​

re_data:save_test_history​

re_data:query_test_failures​

re_data:test_history_failures_limit​

re_data:store_table_samples​

re_data:owners_config​