marlin 三角洲
Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logical foundation behind this and presents practical use case with tool called Delta Lake. Enjoy!
数据湖正被越来越多的寻求有效存储其资产的公司采用。 与行业标准数据仓库相比,其背后的理论非常简单。 总结这篇文章解释了背后的逻辑基础,并用名为Delta Lake的工具提出了实际用例。 请享用!
什么是数据湖? (What is data lake?)
A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
集中式存储库,可让您以任何规模存储所有结构化和非结构化数据。 您可以按原样存储数据,而无需先构建数据结构并运行不同类型的分析-从仪表板和可视化到大数据处理,实时分析和机器学习,以指导更好的决策。
Firstly, the rationale behind data lakes is quite similar to widely used data warehouse. Although they fall into same category are quite different in the logic behind them. For instance data warehouse’s nature is that information stored inside it is already pre-processed. In other words reason for storing has to be known and data model well defined. However data lake takes different approach. As a result the reason of storing and data model don’t have to be defined. In conclusion, both variants can be compared like below:
首先,数据湖背后的原理与广泛使用的数据仓库非常相似。 尽管它们属于同一类别,但它们背后的逻辑却有很大不同。 例如,数据仓库的性质是存储在其中的信息已经过预处理。 换句话说,必须知道存储的原因并明确定义数据模型。 但是数据湖采取不同的方法。 因此,不必定义存储原因和数据模型。 总之,可以如下比较两种变体:
+-----------+----------------------+-------------------+
| | Data Warehouse | Data Lake |
+-----------+----------------------+-------------------+
| Data | Structured | Unstructured data |
| Schema | Schema on write | Schema on read |
| Storage | High-cost storage | Low-cost storage |
| Users | Business analysts | Data scientists |
| Analytics | BI and visualization | Data Science |
+-----------+----------------------+-------------------+
使用Delta Lake OSS创建数据湖 (Using Delta Lake OSS create a data lake)
Now let’s use that theoretical knowledge and apply it using Delta Lake OSS. Delta Lake is open source framework based on Apache Spark, used to retrieve, manage and transform data into data lake. Getting started is quite simple — you will need an Apache Spark project (use this link for more guidance). Firstly, add Delta Lake as SBT dependency:
现在,让我们使用该理论知识,并使用Delta Lake OSS进行应用。 Delta Lake是基于Apache Spark的开源框架,用于检索,管理数据并将其转换为Data Lake。 入门非常简单-您将需要一个Apache Spark项目(使用此链接可获得更多指导)。 首先,添加Delta Lake作为SBT依赖项:
libraryDependencies += "io.delta" %% "delta-core" % "0.5.0"
将数据保存到Delta (Saving data to Delta)
Next, let’s create a first table. For this, you will need a Spark Dataframe, which can be an arbitrary set or data read from another format, like JSON or Parquet.
接下来,让我们创建第一个表。 为此,您将需要一个Spark Dataframe,它可以是任意集合,也可以是从其他格式(如JSON或Parquet)读取的数据。
val data = spark.range(0, 50)
data.write.format("delta").save("/data/delta-table")
从Delta读取数据 (Reading data from Delta)
Reading the data is as simple as writing to it. Just specify the path and correct format, same as you would do with CSV or JSON data.
读取数据就像写入数据一样简单。 只需指定路径和正确的格式即可,就像处理CSV或JSON数据一样。
val df = spark.read.format("delta").load("/data/delta-table")
df.show()
在Delta中更新数据 (Updating the data in Delta)
The Delta Lake OSS supports a range of update options, thanks to its ACID model. Let’s use that to run a batch update, that overwrite the existing data. We do this by using following code:
借助其ACID模型,Delta Lake OSS支持一系列更新选项。 让我们使用它来运行批处理更新,该更新将覆盖现有数据。 我们通过使用以下代码来做到这一点:
val data = spark.range(0, 100)
data.write.format("delta").mode("overwrite").save("/data/delta-table")
df.show()
摘要 (Summary)
I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally you can follow me on my social media if you fancy so :)
我希望您发现这篇文章有用。 如果是这样,请随时喜欢或分享此帖子。 此外,如果您愿意,也可以在我的社交媒体上关注我:)
Sources: https://docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
资料来源: https : //docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
翻译自: https://medium.com/swlh/delta-lake-and-data-lakes-getting-started-41ce957ed0da
marlin 三角洲