Data Filtering
Data filtering refers to the process of choosing a smaller part of your dataset and using that subset for viewing or analysis.
Filtering may be used to:
- Look at results for a particular period of time.
- Exclude erroneous or "bad" observations from an analysis.
- Extract erroneous or "bad" observations from an analysis for manual (by data stewards)/ augmented (AI) Data Quality Management.
re_data provides the following macros for filtering data. Check out the list of currently available filters and let us know if you could use some different ones on Slack ๐ or Github.
filter_remove_duplicatesโ
(source code)โ
Arguments:
- relation: dbt model to perform the filtering on
- unique_cols: List of columns that uniquely identify each row
- sort_columns: Order in which we want to sort the partitioned rows. e.g. (created_at DESC, created_at ASC to choose the latest or earliest row based on the timestamp column
Return type: table with filtered rows
This macro allows you to remove duplicate rows from a dbt model based on certain conditions.
id | status | updated_at |
--------------------------------------+
1 | pending | 13:00:45 |
2 | completed | 13:05:23 |
1 | completed | 13:10:35 |
2 | pending | 13:04:49 |
3 | completed | 13:30:00 |
=> select id, status, updated_at from {{ re_data.filter_remove_duplicates(ref('duplicated'), ['id'], ['updated_at desc']) }} duplicates
-- After filtering, the resulting rows are:
id | status | updated_at |
--------------------------------------+
1 | completed | 13:10:35 |
2 | completed | 13:05:23 |
3 | completed | 13:30:00 |
filter_get_duplicatesโ
(source code)โ
Arguments:
- relation: dbt base model to perform the filtering on
- unique_cols: List of columns that uniquely identify each row
- sort_columns: Order in which we want to sort the partitioned rows. e.g. (created_at DESC, created_at ASC to choose the latest or earliest row based on the timestamp column
Return type: table with duplicate rows
along with the fields of the base model the macro returns duplication context in new fields: re_data_duplicates_count - total number of duplicates with the same current key set re_data_duplicate_row_number - number of current duplicate row inside the group of duplicates with the same current key set
This macro allows you to get duplicate rows from a dbt model based on certain conditions.
id | status | updated_at |
--------------------------------------+
1 | pending | 13:00:45 |
2 | completed | 13:05:23 |
1 | completed | 13:10:35 |
2 | pending | 13:04:49 |
3 | completed | 13:30:00 |
=> select id, status, updated_at,
re_data_duplicate_group_row_count,
re_data_duplicate_group_row_number
from {{ re_data.filter_get_duplicates( ref('duplicated') , ['id'], ['updated_at desc']) }} duplicates
-- After filtering, the resulting rows are:
id | updated_at | status | re_data_duplicate_group_row_count | re_data_duplicate_group_row_number
----+------------+-----------+-----------------------------------+------------------------------------
1 | 13:10:35 | completed | 2 | 1
1 | 13:00:45 | pending | 2 | 2
2 | 13:05:23 | completed | 2 | 1
2 | 13:04:49 | pending | 2 | 2
Your ideasโ
If you have other suggestions of filtering data which you would like to be supported let us know on Slack! ๐