etl 表设计_ETL表设计最佳实践

易成双
2023-12-01

etl 表设计

Not so far ago, the approach taken to table design in source systems (application databases) used to be — we don’t care about ETL. Figure it out, we’ll concentrate on building the application. The last couple of years have been great for the development of ETL methodologies with a lot of open-source tools coming in from some of the big tech companies like Airbnb, LinkedIn, Google, Facebook and so on. And with cloud going mainstream, providers like Azure, Google and Microsoft have made sure that they build upon and support all the open source technologies in the data engineering space.

不久以前,在源系统(应用程序数据库)中进行表设计的方法曾经是-我们不在乎ETL。 弄清楚了,我们将专注于构建应用程序。 最近几年对于ETL方法论的发展非常有用,许多开放源代码工具来自Airbnb,LinkedIn,Google,Facebook等一些大型科技公司。 随着云成为主流,Azure,Google和Microsoft等提供商已确保它们在数据工程领域中建立并支持所有开源技术。

I have been a part of many ETL projects, some of which have failed miserably and the rest have succeeded. There are many ways an ETL project can go wrong. We’ll talk about one of the most important aspects today — table design in the source system.

我参与了许多ETL项目,其中一些失败了,其余的都成功了。 ETL项目有很多错误的方法。 我们将讨论当今最重要的方面之一-源系统中的表设计。

ETL pipelines are as good as the source systems they’re built upon.

ETL管道与建立在其上的源系统一样好。

This statement holds completely true irrespective of the effort one puts in the T layer of the ETL pipeline. The transform layer is usually misunderstood as the layer which fixes everything that is wrong with your application and the data generated by the application. That is absolutely untrue. Without further ago, let’s look at the bare minimum that you should take into account while designing tables which are going to be ETL’d to a target system —

不管人们在ETL管道的T层中花费多少精力,该声明都完全正确。 转换层通常被误解为修复应用程序和应用程序生成的数据中所有错误的层。 那绝对是不正确的。 不久之前,让我们看一下在设计要ETL到目标系统的表时应考虑的最低限度—

加强独特性 (Enforce uniqueness)

This should have be left unsaid but I have seen systems this is also not enforced (as part of the design). It doesn’t matter if the unique key is a single column or composite in nature. Although, one can potentially be asked to do a full load for a table without a unique key and infer the changes after doing the full load every time. This solution would actually be worse than it sounds.

应该不加说明,但是我已经看到系统也没有强制执行(作为设计的一部分)。 本质上,唯一键是单列还是复合键都没有关系。 虽然,可能会要求一个人在没有唯一键的情况下对表进行完全加载,并在每次完成完全加载后推断出更改。 这种解决方案实际上比听起来要糟糕。

启用增量ETL (Enable incremental ETL)

By enabling data engineers to identify new & updated records by accessing simple fields like created_timestamp and updated_timestamp. Make sure that both these fields are populated by the database and not the application. You should have a separate datetime or timestamp field if you want to populate it from the application. These ones should be defined something like —

通过使数据工程师能够通过访问created_timestampupdated_timestamp类的简单字段来识别新记录和更新记录。 确保这两个字段均由数据库而不是应用程序填充。 如果要从应用程序中填充日期时间或时间戳记字段,则应该有一个单独的日期时间或时间戳记字段。 这些应该定义为-

1. created_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
2. updated_timestamp TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP

Append only tables with an always-incrementing primary key would work as they are. You don’t need to have audit timestamp columns for those.

仅追加具有始终递增的主键的表将按原样工作。 您无需为此设置审核时间戳列。

定义并记录关系 (Define & document the relationships)

In most cases, we’re dealing with relational databases as source systems. It is, therefore, of utmost importance to understand the data flow & lineage within the source system. In most cases, the data flow & lineage in the target system for load remains the same, although that’s not mandatory. Two things will help here — service architecture diagram from the application which produces the data and an ER diagram for the source database.

在大多数情况下,我们将关系数据库视为源系统。 因此,了解源系统中的数据流和谱系至关重要。 在大多数情况下,目标系统中用于负载的数据流和谱系保持不变,尽管这不是强制性的。 这里有两件事会有所帮助-生成数据的应用程序的服务体系结构图和源数据库的ER图。

所以? (So?)

All these problems can be completely or partially solved even if these points aren’t taken care of in the source system but none of the solutions are going to be sustainable. And yes, that’s all it takes to get started for building a neat ETL pipeline.

所有这些问题都可以完全或部分解决,即使在源系统中不解决这些问题,但所有解决方案都无法持久。 是的,这就是开始构建整洁的ETL管道所需要的全部。

As a rule, ETL systems should be tasked just to move data from one place to another with a general layer of transformations (one that doesn’t take care of bugs, special one-off cases). Remember the universal principle of Garbage in Garbage out — if you have buggy data in the source system, your target system will have buggy data.

通常,ETL系统应仅负责将数据从具有通用转换层的位置从一个位置移动到另一位置(其中不涉及错误,特殊的一次性情况)。 记住垃圾回收的普遍原则-如果源系统中有错误数据,则目标系统将具有错误数据。

A practice that I have seen at many places is that data teams try to patch the buggy data and handle-it-for-the-time-being. It usually works until the sprint is over and invariably the same issue comes back to haunt you. Resist the temptation of getting into a habit of fixing issues like that. The right way to do this is to report the issue, fix the data in source system and do a clean reload for the time period in question. A reload will be easier to justify than different data showing up on the application UI and the data BI dashboard.

我在很多地方都看到过一种做法,那就是数据团队尝试修补有问题的数据并及时处理。 它通常可以工作到冲刺结束为止,并且相同的问题总是会困扰您。 抵制养成解决此类问题的习惯的诱惑。 正确的方法是报告问题,修复源系统中的数据,并在相关时间段内重新加载。 与在应用程序UI和数据BI仪表板上显示的不同数据相比,重新加载的理由更加容易。

翻译自: https://towardsdatascience.com/table-design-best-practices-for-etl-200accee9cc9

etl 表设计

 类似资料: