这学期修的数据库系统需要写一篇有关NoSQL数据库的文章。这两天查阅了不少资料,终于憋出了2000字来。虽然理解并不深入,但也算对这方面内容有所了解了。文章先介绍了NoSQL数据库兴起的动机,以及在哪些场合下推荐或不支持使用NoSQL数据库。之后介绍了NoSQL数据库中的几个基本的概念,包括NoSQL应该具备怎样的特点,数据模型的分类,还有一致性模型的区别。然后通过介绍 MongoDB 和 Google BigTable 两个具体数据库的CRUD(创建/读取/更新/删除)操作讨论他们跟传统关系型数据库的优劣。文笔不好,仅够浅谈。有兴趣的同学可以参考reference中的文章。
1 Introduction
Relational databases have been around for more than 30 years, and have been essential to variousfields, such as business, education, etc. Almost all database system we use today are RDBMS,including those of Oracle, SQL Server, MySQL and so on. The reason for the dominance of relationaldatabases are not trivial. In all, with various constraints as well as normalization model, relationaldatabases can continually offered the best mix of simplicity, robustness, flexibility, performance,scalability and compatibility in managing generic data [1]. As a result, even though there havebeen several so-called revolutions flared up briefly, all of them fizzled out, without making a dentin the dominance of relational databases.
However, the good mix of those benefits does not mean that the performance of RDBMS ineach of these areas is better than that of an alternate solution pursing one of these benefits inisolation. This concern has not been much of a problem before because the universal dominance ofRDBMS has outweighed the need to push any of these boundaries. But recently, especially withthe rise of Web 2.0 applications, one of these benefits is becoming more and more critical, that is,scalability. One of the most significant differences between Web 2.0 and the traditional www is thegreater collaboration among Internet users, content providers and enterprises [2]. This leads to ascalability requirements that can, first of all, change very quickly and, secondly, grow very large.As scaling a traditional relational DBMS is hard, we need a data management system that canscale well horizontally, that is, scale OLTP-syle workloads to thousands or millions of users, usinghundreds or thousands of nodes. This is the most important motivation of ”NoSQL” databases, socalled, ”Not only SQL” database.
There’re also many other motivations of NoSQL databases, which reveal the advantages of them,too. For example, one motivation is agility or speed of development. Companies has always lookedto adapt to the market more quickly and embrace agile development methodologies. In this way,NoSQL databases have far more relaxed, or even nonexistent, data model restriction compared withRDBMS. The result is that application changes and database schema changes do not have to bemanaged as one complicated change unit, which will allow application to iterate faster in theory[3]. In addition, in many cases companies are driven by the desire to identify viable alternativesto expensive proprietary software of RDBMS. NoSQL database, on the other hand, is much moreeconomics because they use clusters of cheap commodity servers. This leads to the cost per gigabyteor transaction/second for NoSQL can be many times less than the cost for RDBMS, allowing youto store and process more data at a much lower price point [3].
In all, it seems clear that under what circumstances the use of NoSQL databases is recommended:your foremost concern is large-scale, distributed scalability, which is always the case when you havelots of data and a great amount of active users. Imagine that you have developed a very successfulapplication, which attracts many users and gains lots of data, it is more reasonable for you to makeuse of the relatively cheap data store platform with massive potential to scale as well that providedby web services vendors, such as Google and Amazon.
On the other hand, there’re also cases that we discourage to use NoSQL databases, as we cansee there’re still many challenges of NoSQL database. For example, maturity. The maturity of theRDBMS is more reassuring than NoSAL database, while RDBMS are stable and richly functional.In comparison, most NoSQL alternatives are in pre-production versions with many key features yetto be implemented. Another concern is the limitations on analytics. Data in an application hasvalue to the business that goes beyond the CRUD cycle of a typical web application. Businesses mineinformation in corporate databases to improve their efficiency and competitiveness, and businessintelligence (BI) is a key IT issue for all medium to large companies [3]. However, NoSQL databasedoffer few facilities for analysis-style queries, which means the simple demands like tracking usagepatterns and providing recommendation based on user histories are difficult at best, and impossibleat worst, with this type of database system.
We have already introduced the motivation of NoSQL database, as well as the circumstancesunder which NoSQL databases are recommended and discouraged. The rest of this paper willdiscuss NoSQL database in more detail, taking two common modern NoSQL database system:Google BigTable and MongoDB as examples. We first briefly introduce some basic concept ofNoSQL database. After that, we discuss the CRUD in Google BigTable and MongoDB respectively.At last, we compare the pros and cons between CRUD of NoSQL databases and RDBMS.
2 NoSQL Database
2.1 Basic concepts
What is NoSQL database? In Wikipedia we can see: A NoSQL (often interpreted as Not onlySQL) database provides a mechanism for storage and retrieval of data that is modeled in meansother than the tabular relations used in relational databases [5]. In particular, NoSQL systemsgenerally have six key features [2]:
the ability to horizontally scale ”simple operation” throughput over many servers,
the ability to replicate and to distribute (partition) data of many servers,
a simple call level interface or protocol (in contrast to a SQL binding),
a weaker concurrency model than the ACID transactions of most relational (SQL) databasesystems,
efficient use of distributed indexes and RAM for data storage, and
the ability to dynamically add new attributes to data records.
The primary way in which NoSQL databases differ from relational databases is the Data Model [4].There’re roughly following three categories of data model:
Document Model Document databases store data into a complex data structure known asdocument. Documents can contain many different key-value pairs, or key-array pairs, or even nesteddocuments. Document database are useful for a wide variety of applications due to the flexibilityof the data model, the ability to query on any field and the natural mapping of the document datamodel to objects in modern programming language. MongoDB, which we will see later, belongs tothis category.
Graph Model Graph databases use graph structures with nodes, edges and properties torepresent data. As data is modeled as a network of relations between specific elements, it iseasier to model relations between entities in an application. Graph databases are useful in caseswhere relationships are core to the application, like social networks. Examples are Neo4j andHyperGraphDB.
Key-Value and Wide Column Model Key-value are the most basic type of NoSQL database[1]. Every item in the database is stored as an attribute name, or key, together with its value. On theother hand, wide column stores, or column family stores, use a sparse, distributed multi-dimensionalsorted map to store data. The appeal of the systems using key-value and wide column stores istheir performance and scalability, while is also constrained to a narrow set of applications that onlyquery data by a single key value. Google BigTable is an example of wide column model database.
One last important concept is the Consistency Model. While NoSQL systems typically maintainmultiple copies of the data for availability and scalability purposes, there’re different guaranteesregarding the consistency of the data across copies. NoSQL systems have two kind of consistencymodel: consistent system and eventually consistent system. With a consistent system, writes by theapplication are immediately visible in subsequent queries. With an eventually consistent systemwrites are not immediately visible. Most applications and development teams expect consistentsystems. Meanwhile, eventually consistent systems provide some advantages for writes at the costof making reads and updates more complex. These two consistency models pose different trade-offsfor applications in the areas of consistency and availability.
2.2 MongoDB
As a document-oriented database, MongoDB database handle CRUD (create / read / update/ delete) through operating inside the documents. Formally, MongoDB documents are BSONdocuments, which is a binary representation of JSON with additional type information [6]. In thedocuments, there’re field and value pairs, the value of a field can be any of the BSON data types,including other documents, arrays, and arrays of documents. Furthermore, MongoDB stores alldocuments in collections. A collection is a group of related documents that have a set of sharedcommon indexes. Collections are analogous to a table in relational databases.
Read operation Read operations, or queries, retrieve data stored in the database. Queriesspecify criteria, or conditions, that identify the documents that MongoDB returns to the clients.
A query may include a projection that specifies the fields from the matching documents to return.The projection limits the amount of data that MongoDB returns to the client over the network.
Write operation There are three classes of write operations in MongoDB: insert, update,and remove. Insert operations add new data to a collection. Update operations modify existingdata, and remove operations delete data from a collection. All write operations in MongoDB areatomic on the level of a single document. Notice that there’s read isolation for MongoDB system,that means, MongoDB allows clients to read documents inserted or modified before it commitsthese modifications to disk, regardless of write concern level or journaling configuration.
2.3 BigTable
BigTable is a distributed storage system that is structured as a large table: one that may bepetabytes in size and distributed among tens of thousands of machines [7]. One sentence in [7]describes the characteristic of BigTable very well: ”A Bigtable is a sparse, distributed, persistentmulti- dimensional sorted map”. In particular, ”map” means associating the keys to values; ”sorted”means data is ordered by keys; ”multidimensional” means one key is formed by several values(rows, column families and columns); ”persistent” means the data is stored persistently on diskonce written; ”distributed” means data is stored among many independent nodes; ”sparse” meansthe many values are not defined.
As the internal mechanism of BigTable is not revealed by Google, we can only use open APIto access the Google Datastore. On the other hand, there’s a Hadoop version of BigTable calledHBase, which is very similar to BigTable, and we can find some literature about how it read andwrite data. As discussed in [8], when reading data by row key, it queries the RegionServer whichattending the corresponding region. After that, all HRegions which store a column family whosedata is requested by the query must be checked. Finally, it only return the last version of eachvalue. In writing stage, rows are written on the in-memory map of the corresponding RegionServer.Update operations is equal to writing a new version of data. Row deletion depends on where the rowis located: if rows are in the in-memory map, they are just deleted; if HBase files are immutable,then deletion markers are used.
2.4 Pros and cons
Pros Compared with RDBMS, NoSQL database is most notable for its scalability, that is,very fast for adding new data and for simple operations / queries. In addition, NoSQL data modelcan provide very flexible schema, which is highly restricted by the relations in RDBMS.
Cons First of all, without inherent constraints, the responsibility for ensuring data integrityin NoSQL databases falls entirely to the application, which is unlikely bug-free in practical use.On the other hand, in RDBMS, data that violate integrity constraints cannot physically be enteredinto the database, which can enable that robustness of the database. Secondly, just like otherNoSQL systems, MongoDB and BigTable cannot provide ACID transactional properties, which canguarantee that database transactions are processed reliably. This is the tradeoff between consistencyand scalability. Furthermore, without the relationship existing among entities, NoSQL databasescannot provide complex query operations such as entity joins.
3 Reference
[1]. Is the Relational Database Doomed? http://readwrite.com/2009/02/12/is-the-relational-database-doomed
[2]. Cattell, Rick. ”Scalable SQL and NoSQL data stores.” ACM SIGMOD Record 39.4 (2011):12-27.
[3]. 10 things you should know about NoSQL databases. http://www.techrepublic.com/blog/10-things/10-things-you-should-know-about-nosql-databases/
[4]. Mango, D. B. ”Top 5 considerations when evaluating NoSQL Databases.” White Paper.
[5]. Wikipedia: http://en.wikipedia.org/wiki/NoSQL
[6]. MongoDB CRUD Introduction: http://docs.mongodb.org/manual/core/crud-introduction/
[7]. Chang, Fay, et al. ”Bigtable: A distributed storage system for structured data.” ACM Trans-actions on Computer Systems (TOCS) 26.2 (2008): 4.
[8]. Storage of Structured Data: BigTable and HBase: http://lsd.ls.fi.upm.es/lsd/nuevas-tendencias-en-sistemas-distribuidos/HBase 2.pdf