Categories

History and Trends in the Development of Databases

In the series about databases, I will introduce the development of this computer industry. We will go from the past through the present to the latest trends. The first part will be an introduction to this issue.

Since the theme of databases is very large, I decided to divide it into three thematically independent units. In this part, we will move together from the need to sort data into the computer age. We will mention the development of database models, database management systems and query languages. Next part will bring different types of database models as well as database management systems and their query languages on the technical side. We will also recapitulate their advantages and problems and get to less frequent types of databases, such as TimeSeries databases, and to possibilities of their use. In the third part, you can expect an excursion into the increasingly mentioned concept of Big Data, we will take a look at specialized database file systems, sorting algorithms, you will be familiarized with open-source technologies Apache (Hbase, Hadoop, etc.) and there will be a little surprise, too.

Deep History

Establishment of databases as we know them today dates back to the 1950s. However, the need to store and sort data is much older.

Leaving out historic cave paintings and rare clay tablets found separately, the library of Ugarit
(a city in what is now Syria) can be considered to be the first documented mention of a comprehensive effort to store data. There was found a larger amount of clay tablets together with diplomatic texts and literary works originated from the twelfth century BC. Thus efforts to collect data are documented, but not to sort them. The effort to sort data was objectively confirmed in the Roman Forum library. However, this deep history is only a grain of sand in time.

It is the index card what is considered to be a predecessor of computer databases. Its history does not reach back so long. It is associated with the name of the naturalist Carl Linnaeus. In the 18th century AD, he introduced a system to sort his records, where each species was put on a separate sheet of paper. Thanks to it he could easily organize his records and complement them.

Involvement of Electromechanical Machines

Index cards had one big drawback – they had to be handled by a man. That was very restrictive. Therefore, in 1890, an American statistician and inventor Herman Hollerith created a counting machine for the needs of public authorities in the census. This machine used punched cards to store information. This was the first electromechanical data processing. In 1911, there was a merger of four companies, one of which was Hollerith’s firm, and the Computing Tabulating Recording Corporation was established. The company later changed its name to International Business Machines Corporation. It still exists and operates under its acronym IBM. Electromechanical data processing was on the technological forefront for the next half century.

Digitalisation Breakthrough

State authorities have contributed to a shift in technology and digitization once again. Before World War II (in 1935), the obligation to record the Social Security numbers of employees was enacted in the USA. The IBM Company created a new machine at the request of the authorities. It was launched in 1951, again in the census. It is known as the UNIVAC I and it has a privileged position in the computer history. It was the first mass-produced digital computer for commercial use. Eventually, however, the inefficiency of the use of the source code for database tasks was shown.

Arrival of Programming Languages

On 28 and 29 May 1959, there was a conference, the aim of which was to discuss and consider possibilities of introducing a common computer programming language. A consortium, Conference on Data Systems Languages (known by the acronym CODASYL), was established for this purpose already in 1960 and in the same year, at a conference, this language was described and called Common Business-Oriented Language, abbreviated COBOL (version COBOL-60). I would like to mention a historic contribution of an officer in the US Navy, Grace Hopper, who came up with the idea of programming language close to English. This idea subsequently influenced most programming languages and has survived to this day. Development of programming languages has been associated with the transition from magnetic tapes to magnetic disks, which allowed direct access to data. This laid the foundation of modern database systems as we know them today.

The Database Management System

In 1961, Charles Bachman of General Electric introduced the first data warehouse. At the CODASYL Conference in 1965, some elements and indications of the Bachman’s data warehouse database management led to an idea and then creating the concept of database systems. The Database Task Group committee was established and its task was to develop this concept. Bachman was a member as well. In 1971, the committee published a report that defined the database management system. It introduced concepts such as data integrity, data model, database schema, atomicity or entity. The architecture of network database system was described.

During this period (1965-1970), database systems were divided to two basic models: network (most of CODASYL database systems) and hierarchical/tree (IBM implementation). The hierarchical database concept was restricted when modelling reality and the network concept also had its drawbacks. Edgar F. Codd, an IBM employee, realized that and, in 1970, he came with the idea of relational database model. The relational theory included basic operations for working with data (selection, projection, join, union, Cartesian product and difference), which make it possible to carry out all the necessary operations. Frank was also inspired by the idea of Grace Hopper and suggested using understandable terms based on the common English for query language.

SQL and the Golden Age of Databases

The main feature of these database systems was storing structured data. The introduction of algebra, relational calculus and understandable terms led to the creation of SEQUEL language (Structured English Query Language). Promoting this language to version 2 resulted in the structured query language (SQL). This language was approved in 1987 and established as ISO and ANSI standard. Since the 1970s, thanks to the success of the SEQUEL, query languages were inspired by ideas and theories of E. F. Codd. The Ingres project of University of California at Berkeley and language QUEL resulted in development of another type of database model. It was an object-relational database model, known as Postgres. The Postgres and SQL were connected in 1995 when the QUEL query language interpreter was replaced by SQL language interpreter. In 1996, it was renamed to PostgreSQL language.

Everything to Objects

The development of an object-relational model was closely associated with both implemented models. The structured object model experienced a massive development mainly in the 1990s, although its origin is associated with the period of the golden age of databases. The pressure made on object-oriented programming in the 1990s was transferred to the database world. This was the time of boom for purely object-oriented databases. Communication uses the Object Query Language, which is built over the SQL-92 standard, and the Object Definition Language, which is based on the used programming languages.

The ODMG-93 standard is created for definition languages as a superset of more general model, the Common Object Model (COM). Hand in hand with programming with the same language, it also communicates, data from databases are acquired and presented. Using the same language reduces the incidence of errors which may be in the steps of migration from the relational model to objects. The advantage of the object-oriented database management systems is the possibility of a direct expression of the complexity of modelled reality in the database, but more on that next time.

Migration to Unstructured “NoSQL” Databases

At the turn of the millennium, there was a certain shift in the perception of data. Boundaries in approach built on a structured data model (relational/object-oriented model) and object application are emerging. This led to dusting off the idea of unstructured database. This idea originated in the late 1960s. The use of this model, however, was very rare at that time.

Today unstructured databases called NoSQL (the NoSQL term was first coined in 1998 by Carlo Strozzi) share only this basic idea. The boom of unstructured databases is associated with the Google. It presented its database proposal called BigTable designed for large amounts of data. This proposal was inspired by the Amazon, which presented its project of unstructured database called Dynamo.

These proposals of databases later became the basis for today used NoSQL databases. The biggest difference in these proposals was that the databases were not line-oriented like the SQL databases, but column-oriented. More about the difference between line and column data storage will be dealt with next time.

Summary

The issue of properly selected data storage has been discussed a lot recently. An appropriately selected data storage is a necessary condition for smooth and efficient running of applications, as well as the appropriate use of technology for working with data. Next time, we will say more on the technical solutions of various database models, the advantages and disadvantages of relational database management systems and query languages.

We will go through and define the referred terms as well as the terms such as database integrity, OLAP (online analytical processing) or database scalability.

Relational and non-relational data model in the context of business intelligence

Business Intelligence provides valuable insight and allows for better interpretation and presentation and at the same time it can help in better decision-making, which can ultimately mean a competitive advantage. What does it mean relational and non-relational model in this context?

The history has shown us how this sector has responded to the increasingly demanding requirements both of professional and general public. With the advent of computers and the Internet the amount of analyzable data has increased enormously. By now, these data have had the increasing importance and value. A suitable analysis provides valuable insight and allows for better interpretation and presentation and at the same time it can help in better decision-making, which can ultimately mean a competitive advantage.

Our view of the advantages and problems of the already mentioned technologies will be in the end given in the context with the concepts of Web system integration for decision making support (business intelligence). This context should serve for better understanding the issue. Let’s start with the description of several important terms.

Vertical and horizontal scaling

Scaling is the answer to the question of how to improve performance while maintaining efficiency in response to the development of the needs of the system. Vertical scaling means a qualitative change in the existing elements of the system (upgrade/degrade of the computer hardware, change in throughput of the network connectivity, and so on.). The horizontal scaling means the quantitative change of elements of the system (at high load another computer may help with data processing).

Transaction

The transaction is indivisible summary of the operations, which are performed either all at once or none of the operations is performed and the entities remain unchanged (i.e. money transfer at a bank). This approach ensures consistency and integrity of data in time.

Relational database model

The term “relational” database model relates to the theory of sets. The basic constructor of the relational databases is relations (tables) that contain records (rows). Most relational databases can be scaled horizontally, i.e. spread over multiple nodes (machines), which, however, brings certain difficulties.

To create relationships (relations) between the data so called foreign keys are used. Individual tables can be, based on a defined relationship, combined into a single logical unit (by means of JOIN operations). Joining tables of large horizontally scalable database may be expensive and slow, as the final connection of relations must be made on one node and also a large data transfer between various nodes of the database system can occur.

Relational databases support transactions. Properties of transactions in the relational model include atomicity, consistency, isolation and durability (ACID) (explained below). Safety, consistency and integrity of data in time are in many cases redeemed by lower performance of the database.

Non-relational database model

It has been referred to as NoSQL, presented in its beginning as No-SQL (not SQL) and later rather as a Not-only SQL. Even the author Carlo Strozzi in 2010 argued that the indication NoSQL was highly imprecise, and suggested a more accurate indication NoREL as non-relational. NoSQL is a new database concept that allows for “fast and efficient processing of the data with a focus on performance, reliability and agility”.

In the beginning NoSQL databases were an antithesis of relational databases. They were focused on the availability with the shortest possible response time, with the ability to run on commodity hardware, without the possibility to define a data structure, without the support of the relations of individual records and without the possibility of performing connecting operations (JOIN operations).

The relational database uses transactions with an emphasis on ACID; NoSQL uses the BASE approach with transactions (as explained below). This approach is not as strict as ACID. It allows for a temporary data inconsistency for increased availability and performance. As this does not block recording (unlike transactions in ACID), it manifests as a rapid response even at the cost of temporary inconsistencies in the stored data.

The descriptions of the basic types of NoSQL databases (document-oriented, the key-value type, BigTable type, graph databases and time series database type), as well as the advantages and problems will be discussed in the next section focused on BigData.

The high availability and scalability of data with NoSQL databases is closely related to the so-called. CAP theorem.

ACID vs. BASE

In the previous section we learned that each the relational and non-relational database uses a different approach to transactions. However, the distribution of approaches to ACID and BASE is not strictly dependent on the type of database model. Today, the approaches merge with some databases (with some relational databases level of transaction “isolation” can be set whereas some NoSQL databases allow for transactions with the ACID properties and definition of relations).

The acronym ACID:

  • Atomicity – The transaction will be done either completely, or not at all, i.e. it is atomic.
  • Consistency – The transaction will convert the database from one consistent condition to another one. The registration only is guaranteed with valid data without commanding integrity constraints.
  • Isolation – Changes within one transaction before its completion are not visible to other transactions.
  • Durability – Changes to a successfully completed transaction are stored in the database permanently.

The acronym BASE

  • Basically Available – It operates basically all the time, …
  • Soft-State – … may not be consistent throughout the period, …
  • Eventual consistency – … but ultimately it will be in a condition known to us.

Nice and well arranged comparison of the two approaches is outlined here

ACID BASE
strong consistency
isolation
orientation to commit
nested transactions
availability?
conservative (pessimistic)
complex evolution (e.g. .: diagram)
weak consistency (old data)
availability first
approximate answers are OK
simpler, faster
supply of data “as soon as possible”
aggressive (optimistic)
simpler evolution

CAP, also Brewer’s Theorem

The CAP theorem divides databases into three groups. Brewer says that a distributed system can meet up to two of the three properties:

  • Consistency – at one moment there are the same available data on all nodes (not to be confused with the consistency in the description of ACID database properties).
  • Availability – each requirement is served, either successfully or unsuccessfully within a negligibly short time.
  • Partition tolerance – functional, despite network failures or outages / outage of individual nodes.

This indicates that most relational databases are of the CA type, whereas NoSQL databases are more or less of the AP type.

Relational vs. N

The mentioned properties indicate that the relational database can be used wherever we work with the data with the structure with minimal changes. Distribution of relational databases into multiple parts is possible, but very inefficient because of the very characteristics of relational databases (data interrelation, JOINs and ACID transactions, blocking at recording).

Due to poorer properties at horizontal scaling it is recommended to work with an acceptable amount of data due to hardware performance and the consistency of the data should be the main requirement. With horizontal scaling, it is good to share data so that the nodes have amongst themselves as little as possible of shared links, and so that they are the most autonomous and independent. Logically interrelated data should be stored on a single node.

In contrast, NoSQL databases can access data very quickly, but at the cost of an inconsistency between nodes. Their almost unlimited horizontal scaling and maintaining linear performance growth is possible. Since there is a sharp increase in data that is needed to be stored, analyzed and processed, this feature is very important.

We presented relational and NoSQL database models with descriptions of their advantages and disadvantages. The question whether to choose relational or NoSQL database is a misleading question. The purpose of this description of properties only is to show the advantages and disadvantages of individual approaches and outline how to get the most out of both approaches. NoSQL databases cannot fully replace relational databases yet, but they offer a lot of new directions for working with data.

Database in BI

In the beginning of the article we mentioned the decision support system. This system can include a large number of tools and systems. It can include enterprise resource planning (ERP), accounting systems, customer relationship management (CRM), records of users’ movements on the Web or otherwise for the company analytically important and interesting data. Data from these systems can be appropriately transformed for subsequent analysis and if necessary complemented by the analyzed data with the added value.

It should be noted that we should never perform analysis over the production data. Production data are intended for the fast access to detailed information (purchase orders, invoices, goods, …) and for the required operations (update, inserting a new record, …). The difference is in the approach. The production database can be accessed by several thousand users at the same time, whereas the analytical data only can be accessed by a limited amount.

Furthermore, these data are not generally changed and can be supplemented appropriately. During the analysis we want to have the flexibility to change the criteria and have even difficult results available almost immediately.

The way the production data are dealt with is called OLPT (online transaction processing), whereas on the other hand the analytical approach is called OLAP (online analytical processing). The basic principle of this technology is the OLAP cube (multidimensional table) enabling fast and flexible changes in each dimension.

The online analysis can be appropriately used for two basic needs: reporting and viewing analytic data about the company; or in applications responding to the current requirements of the user. Let’s demonstrate it on two specific cases.

Analytical process in a transport enterprise

Imagine a company providing transportation, which has in every car a built-in instrument for its monitoring. Basic data will be completed with the geo-location data. In addition to the consumption and mileage we can also monitor the traffic situations in individual locations depending on the time or worse technical condition of the vehicle in case of higher consumption of automotive fuel per mileage. In the ad hoc reports we can respond flexibly and capture sudden fluctuations compared to normal condition.

We can implement a system of notifications and other processes of the automatic response.

Product recommendations by analyzing social networks

Product data of an online store can be expanded with a data of user behavior on social networks, which is an increasingly popular practice (description of data mining from social networks is not the subject of this article). Based on user-defined relationships we are able to offer to our clients products of close friends. In addition, a suitable analysis can be used to gain insight into the behavior and actions of our competitors; we can target advertising campaigns more appropriately and in a more personalized way and so on.

In the first example where we work with geolocation information it is appropriate to use a database with implemented geolocation functions (e.g.: non-relational MongoDB document-oriented database). For processing of data from social networks non-relational graph databases are perfect. For both presented examples both product relational databases and the use of non-relational database for the needs of advanced analysis and some internal processes is offered.

In conclusion

We can invent a great number of examples like this regarding the implementation of a combination of relational and NoSQL databases. However, this should be sufficient just to give you an idea. We will discuss the above mentioned kinds of non-relational databases, as well as other issues focusing on the concept of BigData next time.

BigData, NoSQL, Resource Utilisation and a Few Words in Conclusion

In previous text, I promised a closer look at the concept of BigData and individual kinds of non-relational databases. I will add a brief description of resource utilisation to this. In the conclusion, I will quickly summarise what I have mentioned in this three-part series about databases and database models. Unfortunately, due to the extent, I will not get to touch on all the aforementioned topics in this part. Therefore, I will focus on the products of the Apache Software Foundation, for easier data handling, and the promised surprise in the next post.

BigData

Although I am very interested in BigData, I did not find an exact definition of this term. Therefore, I prefer the definition of Gartner who define the term BigData using 3Vs – High Volume, High Velocity, High Variety (source 5). As you can see, it is not just about the amount of data, but also the speed of reading/writing and processing and about the variety of structures of stored data. An example for the term BigData may be the CERN Large Hadron Collider which should generate approximately 15 PB (1 PB = 250 B) of data per year (source 4). Although not all data are stored in databases, this is an enormous amount.

Models of non-relational databases

Developments in this sector have been very progressive. For ease of reference, we can divide non-relational databases depending on the database model used. Individual basic models are the document-oriented model, the key-value type model, BigTable model, graph model and the time-series model.

Document-oriented model

The basic element of this model is a document that contains semi-structured data. In addition to the data themselves, the description of their structure (metadata) is also stored in the document. Databases of this type are usually used for storing data in XML, YAML, JSON or BSON formats. The information in the database can be indexed and most databases do this automatically. This makes it possible to access either the key or the actual content of the document. In organising the data, it is possible to group the documents into logical sets (collections) and assign rules to them (such as access rights, etc.).

The document structure is independent and various documents with different records may be in one collection. Databases also allow the user to work with only parts of the document. Document-oriented databases are a highly flexible NoSQL solution suitable for storing unstructured data, but, if necessary, with the possibility to define the structure. It is appropriate to use them where emphasis is placed on a universal, scalable solution with the possibility of searching in large texts.

Key-value model

This model can be perceived as a dictionary or as the equivalent of a two-column table in RDBMS. The difference is in the approach to data. While the relational model allows access to both the key and the value, in the non-relational model, it was only possible to inquire the key (however, nowadays, even this rule is broken thanks to the secondary indices, e.g. Riak DB). The biggest advantage of this model is its simplicity and ease of horizontal scalability. There are many databases with the key-value model.

Some of them offer full support for ACID transactions and ensure consistency (e.g. HyperDex DB). Therefore, they may replace the relational database in some cases and add the advantage of easy scalability and higher performance to the existing solution. Other, due to their properties, can be used, for example, as a cache (e.g. MemCached).

BigTable model

Sometimes, it is also called a column-family model. Since these are mainly clones of the BigTable database (developed by Google), I personally prefer the designation of the BigTable model. Databases using this model store data as multidimensional maps. These maps can be imagined as a table where the key consists of an identifier of the “row” and “column”.

A timestamp is also added to this key and it is used to detect conflicts in the entry and for the expiration and a group, so-called column-family identifier. BigTable databases are perfectly scalable and can handle very large amounts of data. Another advantage is the easy adding of new records. Disadvantages are then lower performance when using a small number of nodes (minimum 3, source 1) and the impossibility of indexing. Thanks to this proposed model, searching without knowing the key is also very inefficient. Today, this model finds application, for example, as geolocation databases or databases to track user behaviour on the web.

Graph model

Graph databases differ from the aforementioned databases. This model represents the structure of the data as a graph. Records are nodes and links are edges, which are oriented and supplemented by attributes and evaluation of the relationship (based on the graph theory). The relationship (relation) between the nodes of the graph always has a certain direction that determines its meaning and attributes providing useful information. Direction can be imagined as relationships between people.

John likes Jane, but it does not need to be true that Jane likes John. In some required situations, we can ignore the direction of the relation.

Since the graph model focuses on the links between data, these databases can be extremely difficult to scale. When compared with relational databases, scaling with graph databases is more and more efficient given the handling and processing of data. NoSQL is therefore not only about scalability and data structure. It is suitable to use this kind of databases wherever we want to search for the shortest paths, analyse the links between objects and gain additional benefits resulting from graph analyses.

Time-series model

Time series are materially and spatially comparable observations (data) that are uniquely arranged in terms of time from the past to the present (source 4). We can distinguish between several types of time series – interval, instantaneous or implied. Data are usually referred to using name, tag and metadata. The value is usually limited to storing a simple data type without a complex structure. Databases support a list of basic statistics: sum, average, maximum and minimum.

Since the collected data have mostly a regular character, they can be aggregated and scaled very well. Given the nature of the data, it is possible to process a large number of records in near real time.

Another advantage of the time-series model is that it does not matter which type we use.

Resource utilisation

As mentioned above, the demand for the amount and speed of data processing for presentation is growing. It is best to present data in real time. Therefore, the speed and efficiency of resource utilisation is crucial. If we want to get data from the database, there is a big difference in the speed if the data are provided to us from the internal memory or from physical disks (solid-state or hard drives).

Generally, NoSQL databases try to keep as much relevant data as possible in the internal memory. With the appropriate settings and the appropriately chosen data distribution for scaling of the system, we can achieve very efficient retrieval of data when most of them are provided from the internal memory. However, even when using NoSQL databases, we cannot do without working with physically stored data (reading/writing). This is usually a stumbling block. Therefore, new demands for working with data on a physical storage emerge.

Performance improvements could be brought by new technologies in the field of file systems. Copy-on-write file systems are developed in the Solaris systems by Sun/Oracle (ZFS file system) or Lixun systems (Btrfs file system). These systems never modify the original blocks but they only modify their copies. They modify metadata in the same way. Upon completion, they invalidate the original blocks and validate the new blocks.

The file system is always consistent and avoids duplication of data entry, as is the case of conventional journaling file systems. However, we still have to wait some time to assess their actual contribution in resource utilisation.

Summary

In this mini-series about databases, I went over the history of the development and attempted to outline new possibilities brought to us by the development of this sector. Not everything old is bad and everything new is saving. However, with a suitable combination, it is possible to reach interesting results both in the speed of data processing as well as in resource utilisation savings.

New trends are gradually heading towards utilisation and linking the ideas of the non-relational and the relational models which we put into the context of business intelligence and can present a competitive advantage. Furthermore, they can give us savings in the context of the use of system and hardware resources, and they also show new possibilities in the attempt to get most from these resources.

Adhering only to the relational model or a complete transition to the non-relational model shows to be less effective in many cases than the connection of the two models. In the next article, in addition to becoming familiar with the Apache Software Foundation tools, I will describe the thought process that led my colleague and I to using a relational database combined with several types of non-relational databases.

I will then support some of my statements with graphs and numbers.

Distributed Data Processing Using Apache Software Foundation Tools

Now we will focus on the Apache Software Foundation technologies that make processing large amounts of data easier. I am going to describe the Apache Hadoop ecosystem, MapReduce model for distributed processing and its implementation used in Hadoop. Furthermore, I am going to introduce HDFS file system, Hbase column-oriented database, Hive data warehouse or centralized service for configuration management, distributed synchronization and group service called Zookeeper.

Apache Hadoop

Hadoop is a framework for parallel applications running designed for processing large amounts of data on ordinary PCs. It includes several tools that are independent of each other and that solve problems of a certain part of a process of data capturing, storing and processing (e.g. freely available Cassandra non-relational database). Hadoop was created in 2005 and its creators are Mike Cafarella and Doug Cutting.

It is designed to detect and solve errors at the application level in order to ensure higher reliability. Currently, it is a unique project for its functionality and popularity. A lot of commercial solutions are based on this project. The largest “players”, such as Microsoft, Yahoo, Facebook, Ebay, etc., use it in their solutions and are also involved in its development.

Hadoop Common

It manages a set of auxiliary functions and interfaces for distributed file systems and general input/output, which is used by all framework libraries (tools for serializing, Java RPC – Remote Procedure Call, robust data structures, etc.).

Hadoop YARN

It is a manager of planning and monitoring of tasks in the Hadoop framework. It introduces the concept of resource manager, which, based on information from thread planner and manager, handles distribution of tasks to individual threads (usually represented by single devices) in cluster. The resource manager receives information not only about condition of threads, but also about success or failure of individual tasks or threads.

It also serves as a monitoring of the thread condition on which Hadoop is running and in case of a failure of one thread, it forwards the request to another thread of the cluster.

Hadoop MapReduce

The task of processing large amounts of text data is basically very simple. However, if a main request is to finish a calculation in a reasonable time, it is necessary to divide it among more devices. This makes this relatively simple task complicated.

Therefore, the Google company came up with an abstraction that hides parallelization, load balancing and data splitting into the library with the MapReduce program schedule, which can be run on ordinary computers. The makers gained their inspiration for naming it from “primitives” of functional programming languages (Haskell, Lisp).

The “Map” in the title indicates mapping “unary function” on the list and it is used for dividing tasks into individual slave nodes. The “Reduce” means reducing these lists using “binary function” and serves to reduce the results in master node (removing duplicate indexes and lines).

It is evident that the principle of master and slave nodes is applied here. In principle, the master node receives a user request. It forwards a map function to perform by its slave servers. It gets the results from the slave servers, after that it reduces the results and gives the final result to the user.

Hadoop MapReduce is based on Hadoop YARN, thus it acquires the following characteristics:

  • Distributability – even distribution of tasks to individual computing nodes with regard to their performance.
  • Scalability – a possibility of adding additional nodes to the cluster.
  • Robustness – a task recovery in local failure.

Hadoop Distributed File System (HDFS)

HDFS is a distributed file system designed to store very large files. It is an application written in Java forming another layer on top of a system file of the operation system. The HDFS client provides application programming interface (API) similar to POSIX standards. However, the authors did not meet the exact POSIX standards, at the expense of performance enhancement. HDFS stores information about data and the data itself separately.

Metadata are stored on a dedicated server called “NameNode”, application data is located on multiple servers called “DataNode”. All servers are interconnected and communicate with each other using protocols based on TCP.

Briefly, this file system can be described as:

  • Economical – it runs on standard servers.
  • Reliable – it solves data redundancy and recovery in case of failure (without RAID).
  • Expandable – it automatically transfers data to new nodes when cluster is increased.

These tools are the basis for a wide range of solutions, which give us the freedom of choice in the most efficient access to large data. In the price/performance ratio, they allow us to get really high-end solutions that are trusted by the biggest companies. Now I would like to mention about couple of them that are, in my opinion, crucial.

Apache HBase

Apache HBase is a non-relational (column-oriented) distributed database. HBase uses HDFS as a storage and supports both batch-oriented computing using MapReduce and point queries (random read). The response speed is more important than temporal consistency and completeness.

Apache Hive

It is a distributed data warehouse, where data is stored in HDFS. Using HQL (query language based on SQL, which is translated on the fly to MapReduce tasks), Apache Hive allows to create tables, charts and reports directly above data enabling a structured approach.

Apache ZooKeeper

Apache ZooKeeper is a centralized service for configuration management, naming, providing distributed synchronization and providing group services. It simplifies management of the servers on which Hadoop is running and reduces the risk of inconsistent settings across servers.

Besides these solutions, a lot of other projects and extensions are based on Hadoop framework. A non-relational database Apache Cassandra reaches great popularity, as well as a user-friendly web interface for cluster managing and monitoring – Apache Ambari.

It is possible to achieve very good results with the Apache Hadoop family of tools at very low cost. Since Hadoop is designed to run on commodity hardware, there is no need for data processing in specialized computers designed to work with large data volumes.

Was this article helpful?

Support us to keep up the good work and to provide you even better content. Your donations will be used to help students get access to quality content for free and pay our contributors’ salaries, who work hard to create this website content! Thank you for all your support!

Reaction to comment: Cancel reply

What do you think about this article?

Your email address will not be published. Required fields are marked.