onAzureCloud: Riak

Sunday, January 4, 2015

The Anatomy of "No SQL"

Continuing from where i stopped about NO SQL in my previous post, in this article we will have a closer look at what options we have, when it comes to storing data for Big Data storage/processing. Well, the preference is obviously for NoSQL databases.

Defination of NoSQL:
the following five characters of NoSQL gives better understanding before the definition

"Non-Relational", "Open-Source", "Cluster-Friendly", "21st Century Web" and "Schema-Less"

NoSQL databases are a way of persisting data in non-relational way. Here, the data is no longer stored in rigid schemas of tables and columns distributed across various tables. Instead, related data is stored together in a fluid schema-less fashion. NoSQL databases tend to be schema-less (key-value stores) or have structured content, but without a formal schema (document stores).

How to Data Model in NOSQL:
Yet another complicated question asked by people worked on RDBMS technologies. in a broad sense NOSQL Data Modeling falls under four categories to choose from. This also helps us to choose the NO SQL under laying technologies.

Different Types of NoSQL Data Modeling

Broadly it has four types of Data Modeling
1>Document type - MongoDB CouchDB and Raven DB

Apache Cassandra, Google Big Table, HBase

2>Column family - Apache Cassandra, Google Big Table, HBase
3>Graph -Neo4j
4>Key-Value - Redis and Riak

In this article we will focus in detail about the key-value pair and document oriented database, as these are the most commonly used ones

Cassandra
Used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability data store with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster. There are only two ways to query, by key or by key-range.
Data Modeling in Cassandra
Data storage in Cassandra is row-oriented, meaning that all content of a row is serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.

The following layout represents a row in a Column Family (CF):

The following layout represents a row in a Super Column Family (SCF):

The following layout represents a row in a Column Family with composite columns. Parts of a composite column are separated by ‘|’. Note that this is just a representation convention; Cassandra’s built-in composite type encodes differently, not using ‘|’. (Btw, this post doesn’t require you to have detailed knowledge of super columns and composite columns.)

Use cases
Now if we quickly discuss the use cases where you would use the Key Value kind of database is probably where you would only have a query based on the key. The database does not care what is stored as the value. The indexes are only on the key and you always retrieve and insert values as one big chunk of black box.
- See more at: http://blog.aditi.com/data/what-why-how-of-nosql-databases/#sthash.2szaRw8d.dpuf

Cassandra

Cassandra
Used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability data store with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster. There are only two ways to query, by key or by key-range.

Data Modeling in Cassandra Data storage in Cassandra is row-oriented, meaning that all content of a row is serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.

MongoDB

This is a NoSQL database which supports the notion of documents. Documents are JSON structures, to be precise in case of MongoDB it is BSON (Binary equivalent of JSON).

Below is the terminology used in Mongo DB and its analogy with respect to normal RDBS:-

TABLE — > Collection

ROW — > Document

Primary Key — > _id

A sample document looks like the picture below, which is nothing but key value pairs, but unlike key-value database, here you can index and have a query individual key within the document.

{ “item”: “pencil”, “qty”: 500, “type”: “no.2″ }

For document stores, the structure and content of each “document” are independent of other documents in the same “collection”. Adding a field is usually a code change rather than a database change: new documents get an entry for the new field, while older documents are considered to have a null value for the non-existent field. Similarly, “removing” a field could mean that you simply stop referring to it in your code rather than going through the trouble of deleting it from each document (unless space is at a premium, and then you have the option of removing only those with the largest content). Contrast this to how an entire table must be changed to add or remove a column in a traditional row/column database.

Documents can also hold lists as well as other nested documents. Here’s a sample document from MongoDB (a post from a blog or other forum), represented as JSON:

{ _id : ObjectId(“4e77bb3b8a3e000000004f7a”), when : Date(“2011-09-19T02:10:11.3Z”), author : “alex”, title : “No Free Lunch”, text : “This is the text of the post. It could be very long.”, tags : [ “business”, “ramblings” ], votes : 5, voters : [ “jane”, “joe”, “spencer”, “phyllis”, “li” ], comments : [ { who : “jane”, when : Date(“2011-09-19T04:00:10.112Z”), comment : “I agree.” }, { who : “meghan”, when : Date(“2011-09-20T14:36:06.958Z”), comment : “You must be joking. etc etc …” } ] }

Note how “comments” is a list of nested documents with their own independent structure. Queries can “reach into” these documents from the outer document, for example to find posts that have comments by Jane, or posts with comments from a certain date range.

Some of the notable advanced features of MongoDB include, automatic master slave replication, auto sharing of data, very rich query language, supports 2nd level of indexes on documents ensuring efficient retrievals, in-built support for Map-Reduce. It also offers very fine grained control over the reliability and durability for someone who does not like the auto pilot mode.

Sunday, July 15, 2012

The Rise of "No SQL"

One of the most discussed topic every architect participates now a days is on "No SQL", the views expressed during these discussion remains in favor of using RDMS and still Big Data remains somewhat technology which is unclear for many of them. In fact, i had similar argument with my team and it took my hard effort to convince them regarding advantages of using Big Data.

Computing Layer(Application Layer) is getting transformed, why not Database technology?
Application Computing Layer has changed in fundamental ways over the last 10 years, in fact this transformation has happened so rapidly from Mainframe systems to Desktop applications to Web Technologies to current Mobile Application trend. One of the main reason for this rapid transformation is growing online business needs and million and million users moving towards web and mobile way. In fact A modern Web application can support millions of concurrent users by spreading load across a collection of
application servers behind a load balancer. Changes in application behavior can be rolled out
incrementally without requiring application downtime by gradually replacing the software
on individual servers. Adjustments to application capacity are easily made by changing the
number of application servers.

Now comes Database technology which has not kept race. This age old "Scale Up" technology still in widespread use today, was optimized for the applications, users and infrastructure of that era. Because it is a technology designed for the centralized computing model, to handle more users one must get a bigger server (increasing CPU, memory and I/O capacity) Big servers tend to be highly complex, proprietary, and disproportionately expensive pieces of engineered machinery, unlike the low-cost, commodity hardware typically deployed in Web- and cloud-based architectures. And, ultimately, there is a limit to how big a server one can purchase, even given an unlimited willingness and ability to pay.

Upgrading a server is an exercise that requires planning, acquisition and application
downtime to complete. Given the relatively unpredictable user growth rate of modern
software systems, inevitably there is either over- or under-provisioning of resources. Too
much and you’ve overspent, too little and users can have a bad application experience or
the application can outright fail. And with all the eggs in a single basket, fault tolerance and
high-availability strategies are critically important to get right.

Also, not to forget how rigid the Database schema is and how difficult it is to change the schema after inserting records.Want to start capturing new information you didn’t previously consider? Want to make
rapid changes to application behavior requiring changes to data formats and content? With RDBMS technology, changes like these are extremely disruptive and therefore are frequently avoided – the opposite behavior desired in a rapidly evolving business and market environment.

Some ways to fool around saying Still RDMS works!
In an effort to argue saying still RDMS works when used with current application layer, here goes few tactics.

Sharding
One of the technique where we spit data across the servers by doing horizontal portioning. For example, we will store 1 lakh records related to users who belong to india in server 1 and remaining records (rows) in the server 2. so when ever there is a need to fetch records which belong to india, get it from server 1.

Well, this approach has serious problems when it comes to joins and normalization techniques. Also, when You have to create and maintain a schema on every server. If you have new information you want to collect, you must modify the database schema on every server, then normalize, retune and rebuild the tables. What was hard with one server is a nightmare across many. For this reason, the default behavior is to minimize the collection of new information.

Denormalizing
A normalized database farm is hard to implement. That is, if you are planning to "Scale Out" then it is highly impossible to achieve this on normalized database which also results in lot of concurrency issues.

To support concurrency and sharding, data is frequently stored in a denormalized form when
an RDBMS is used behind Web applications. This approach potentially duplicates data in the
database, requiring updates to multiple tables when a duplicated data item is changed, but it
reduces the amount of locking required and thus improves concurrency.

Now denormalizing a data base defeats the purpose of being RDBMS.

Distributed caching
Another tactic used to extend the useful scope of RDBMS technology has been to employ
distributed caching technologies, such as Memory Cache.

Although performance wise this technique works well, it falls flat on Cost wise. Also, for me this looks like another tier to manage.

Now comes the rise of "No SQL"
The techniques used to extend the useful scope of RDBMS technology fight symptoms but not the disease itself. Sharding, denormalizing, distributed caching and other tactics all attempt to paper over one simple fact: RDBMS technology is a forced fit for modern interactive software systems. Already technology giants like Google, Facebook, Amazon etc are moving away from RDBMS. Also, with windows Azure Table Storage, Microsoft is also serious about Big Data.

Although implementation differs in a big way compared to using RDMS, NoSQL Database management system offers these common set of characteristics:

1>No Schema: Data can be inserted in a NoSQL database without first defining a rigid database schema. As a corollary, the format of the data being inserted can be changed at any time, without application disruption. This provides immense application flexibility, which ultimately delivers substantial business flexibility.

2>Auto-sharding (also called as “elasticity”)A NoSQL database automatically spreads data across servers, without requiring applications to participate. Servers can be added or removed from the data layer
without application downtime, with data (and I/O) automatically spread across the servers.

3>Distributed query support. “Sharding” an RDBMS can reduce, or eliminate in certain cases, the ability to perform complex data queries. NoSQL database systems retain their full query expressive power.

4>Integrated caching. To reduce latency and increase sustained data throughput, advanced NoSQL database technologies transparently cache data in system memory. This behavior is transparent to the application developer and the operations team, in contrast to RDBMS technology where a caching tier is usually a separate infrastructure tier that must be developed to, deployed on separate servers, and explicitly managed by the separate team.

What more, it is also available at free of cost under open source license.
Unlike Google, Amazon and Microsoft, A number of commercial and open source database technologies such as Couchbase (a database combining the leading NoSQL data management technologies CouchDB, Membase and Memcached), MongoDB, Cassandra, Riak and others are now available and increasingly
represent the “go to” data management technology behind new interactive Web applications.