Sunday, January 4, 2015

The Anatomy of "No SQL"

Continuing from where i stopped about NO SQL in my previous post, in this article we will have a closer look at what options we have, when it comes to storing data for Big Data storage/processing. Well, the preference is obviously for NoSQL databases.

Defination of NoSQL:
the following five characters of NoSQL gives better understanding before the definition

"Non-Relational", "Open-Source", "Cluster-Friendly", "21st Century Web" and "Schema-Less"

NoSQL databases are a way of persisting data in non-relational way. Here, the data is no longer stored in rigid schemas of tables and columns distributed across various tables. Instead, related data is stored together in a fluid schema-less fashion. NoSQL databases tend to be schema-less (key-value stores) or have structured content, but without a formal schema (document stores).

How to Data Model in NOSQL:
Yet another complicated question asked by people worked on RDBMS technologies. in a broad sense NOSQL Data Modeling falls under four categories to choose from. This also helps us to choose the NO SQL under laying technologies.

Different Types of NoSQL Data Modeling
Broadly it has four types of Data Modeling
1>Document type - MongoDB CouchDB and Raven DB
Apache Cassandra, Google Big Table, HBase
2>Column family - Apache Cassandra, Google Big Table, HBase
3>Graph -Neo4j
4>Key-Value - Redis and Riak

In this article we will focus in detail about the key-value pair and document oriented database, as these are the most commonly used ones

Cassandra
Used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability data store with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster. There are only two ways to query, by key or by key-range.
Data Modeling in Cassandra
Data storage in Cassandra is row-oriented, meaning that all content of a row is serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.
  • The following layout represents a row in a Column Family (CF):
Column Family
  • The following layout represents a row in a Super Column Family (SCF):
Super Column Family
  • The following layout represents a row in a Column Family with composite columns. Parts of a composite column are separated by ‘|’. Note that this is just a representation convention; Cassandra’s built-in composite type encodes differently, not using ‘|’. (Btw, this post doesn’t require you to have detailed knowledge of super columns and composite columns.)
Column Family with composite columns
Use cases
Now if we quickly discuss the use cases where you would use the Key Value kind of database is probably where you would only have a query based on the key. The database does not care what is stored as the value. The indexes are only on the key and you always retrieve and insert values as one big chunk of black box.
- See more at: http://blog.aditi.com/data/what-why-how-of-nosql-databases/#sthash.2szaRw8d.dpuf
Cassandra

Cassandra
Used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability data store with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster. There are only two ways to query, by key or by key-range.

Data Modeling in Cassandra Data storage in Cassandra is row-oriented, meaning that all content of a row is serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.

 

MongoDB

This is a NoSQL database which supports the notion of documents. Documents are JSON structures, to be precise in case of MongoDB it is BSON (Binary equivalent of JSON).

Below is the terminology used in Mongo DB and its analogy with respect to normal RDBS:-

TABLE — > Collection

ROW — > Document

Primary Key — > _id

A sample document looks like the picture below, which is nothing but key value pairs, but unlike key-value database, here you can index and have a query individual key within the document.

{ “item”: “pencil”, “qty”: 500, “type”: “no.2″ }

For document stores, the structure and content of each “document” are independent of other documents in the same “collection”. Adding a field is usually a code change rather than a database change: new documents get an entry for the new field, while older documents are considered to have a null value for the non-existent field. Similarly, “removing” a field could mean that you simply stop referring to it in your code rather than going through the trouble of deleting it from each document (unless space is at a premium, and then you have the option of removing only those with the largest content). Contrast this to how an entire table must be changed to add or remove a column in a traditional row/column database.

Documents can also hold lists as well as other nested documents. Here’s a sample document from MongoDB (a post from a blog or other forum), represented as JSON:

{ _id : ObjectId(“4e77bb3b8a3e000000004f7a”), when : Date(“2011-09-19T02:10:11.3Z”), author : “alex”, title : “No Free Lunch”, text : “This is the text of the post. It could be very long.”, tags : [ “business”, “ramblings” ], votes : 5, voters : [ “jane”, “joe”, “spencer”, “phyllis”, “li” ], comments : [ { who : “jane”, when : Date(“2011-09-19T04:00:10.112Z”), comment : “I agree.” }, { who : “meghan”, when : Date(“2011-09-20T14:36:06.958Z”), comment : “You must be joking. etc etc …” } ] }

Note how “comments” is a list of nested documents with their own independent structure. Queries can “reach into” these documents from the outer document, for example to find posts that have comments by Jane, or posts with comments from a certain date range.

Some of the notable advanced features of MongoDB include, automatic master slave replication, auto sharing of data, very rich query language, supports 2nd level of indexes on documents ensuring efficient retrievals, in-built support for Map-Reduce. It also offers very fine grained control over the reliability and durability for someone who does not like the auto pilot mode.



No comments:

Post a Comment