Need a perfect paper? Place your first order and save 5% with this code:   SAVE5NOW

Big Data Technologies Based on MapReduce and Hadoop

Since the introduction of the World Wide Web in 1994, much more information has been created globally. New information is constantly being produced because of the proliferation of online activities and content creation since the advent of the World Wide Web. These users and consumers of multimedia data require efficient systems that can swiftly get contextualized data from vast data stores.

What Is Big Data?

Data sets that are too huge and complicated to be processed using conventional methods are called “big data” (Elmasri& Navathe, 2017). Sensors, social media platforms, and other unstructured data sources commonly provide the raw material for big data sets. Several properties common to big data sets make them challenging to process with conventional methods, including high velocity, variety, veracity, and volume.

  • Volume: This measures how much information is created per day. Consider the daily influx of data produced by user postings and interactions on social media platforms.
  • Velocity is the rate at which new data is produced and gathered. Data from automobile sensors, for instance, can produce 1 GB each hour.
  • Variety: This relates to the various forms of information being produced. For instance, the data collected through social media platforms may consist of text, photographs, and videos.
  • Veracity: This metric assesses how reliable the information is. For instance, sensor data may be more reliable than human-reported data.

Introduction to MapReduce and Hadoop

MapReduce

Google is credited with popularizing the MapReduce programming model, a parallel programming approach. Map and reduce functions form the basis of many functional languages. MapReduce is a parallel computing framework for processing massive data collections (Elmasri& Navathe, 2017). This framework comprises a filtering and sorting Map function, and a processing Reduce function. MapReduce is employed for reversing web-link graphs, searching for patterns in data, and constructing inverted indexes.

Hadoop Releases

Hadoop is a framework for executing MapReduce applications across a distributed system. Different versions of Hadoop have been released, the most current being Hadoop 2.0. Hadoop 2.0 has a new MR runtime that operates atop the YARN resource manager. HDFS’s federation support and availability have also been upgraded.

Hadoop Distributed File System (HDFS)

HDFS Preliminaries

Hadoop Distributed File System (HDFS) was developed for a cluster of inexpensive machines. To provide real-time data access, it departs from POSIX and takes cues from the UNIX file system. By separating file system information and application data, HDFS speeds up access to massive datasets. Files are copied to many DataNodes for safety and to allow for the colocation of processing and storage.

Architecture of HDFS

HDFS follows a master-slave architecture. When managing the files in a network, the NameNode is the brain of the operation. The DataNode handles client read/write requests and generates, deletes, and replicates blocks per the NameNode’s instructions. Clusters can support thousands of HDFS clients and DataNodes simultaneously.

File I/O Operations and Replica Management in HDFS

HDFS uses an append-only approach, so files can only be read and not edited. Data is written in 64-KB packets with a checksum for each block when using a write pipeline. Data is fetched from the node nearest to the user to lessen the load on the network. Once a damaged block is identified, reconstructing starts, and the NameNode is informed.

HDFS Scalability

About 200 bytes per block were used by NameNode, plus another 200 bytes for each i-node. The amount of RAM needed to store the data associated with 200 million blocks referred to by 100 million files would be greater than 60 GB. For a network of 10,000 nodes to transport 60 PB of data, each node must be capable of holding 6 TB. Eight 0.75 TB drives are needed to get the job done.

The Hadoop Ecosystem

Examples of well-known Hadoop components are the MapReduce programming model, the Hadoop runtime environment, and the Hadoop distributed file system. Many supplementary projects that enhance the primary projects’ capabilities are part of the Hadoop ecosystem. Apache’s open-source projects tend to be popular and actively used. The high-level Hadoop interfaces Pig and Hive simplify data processing and analysis (Elmasri& Navathe, 2017). Pig is a dataflow language, while Hive offers a structured query language (SQL). Oozie is a program that helps manage work in one central location. Data can be moved between HDFS and relational databases with the help of Sqoop, a runtime environment and library. HDFS, a key-value store organized in columns, is the foundation upon which HBase is built.

Hadoop v2 alias YARN

Rationale behind YARN

Hadoop v1 has problems with reliability, inefficiency, and a lack of multitenancy that hinders its implementation in business settings (Elmasri& Navathe, 2017). The new Hadoop v2 architecture was developed to fix these problems while working with older projects. These needs motivated the development of Hadoop version 2, often known as YARN, which is now the standard method for running Hadoop programs.

YARN as a Data Service Platform

YARN’s goal is to make it possible for several data processing applications to share the same Hadoop cluster by using the cluster’s capabilities and the data locality provided by HDFS (Elmasri& Navathe, 2017). Hadoop’s flexibility as a data service platform allows for various use cases beyond the traditional MapReduce batch processing.

Challenges Faced by Big Data Technologies

While big data technologies are being extensively utilized for analytical purposes, experts have expressed concern over possible difficulties (Elmasri& Navathe, 2017). These challenges include but are not limited to information heterogeneity, privacy and secrecy, improved human interfaces and visualization, and inconsistent or missing data.

Summary

With the advent of new frameworks and systems, big data has significantly benefited from technological advancements. One such system created with the aid of YARN is Hadoop/MapReduce. For YARN to fulfil its potential as a general data services platform, it must face and conquer many obstacles. However, YARN’s potential is excellent, and it will significantly impact the big data industry in the next few years.

Reference

Anagnostopoulos, C., & Triantafillou, P. (2017). Query-driven learning for predictive analytics of data subspace cardinality. ACM Transactions on Knowledge Discovery from Data (TKDD), 11(4), 1-46.

Bhuiyan, M. Z. A., Zaman, A., Wang, T., Wang, G., Tao, H., & Hassan, M. M. (2018, May). Blockchain and big data to transform healthcare. In Proceedings of the International Conference on Data Processing and Applications (pp. 62-68).

Caicedo, C. V., Prieto, Y., Pezoa, J. E., Sobarzo, S. K., & Ghani, N. (2019). A Novel Framework for SDN teaching and research: A Chilean University case study. IEEE Communications Magazine, 57(11), 67-73.

Elmasri, R., & Navathe, S. B. (2017). Fundamentals of Database Systems 7 th Edition.

Karamoozian, A., Wu, D., Chen, C. P., & Luo, C. (2019). An approach for risk prioritization in construction projects using analytic network process and decision-making trial and evaluation laboratory. IEEE Access, 7, 159842-159854.

Mahmoud, M. M., Rodrigues, J. J., Ahmed, S. H., Shah, S. C., Al-Muhtadi, J. F., Korotaev, V. V., & De Albuquerque, V. H. C. (2018). Enabling technologies on a cloud of things for smart healthcare. IEEE Access, 6, 31950-31967.

Rizwan, A., Zoha, A., Zhang, R., Ahmad, W., Arshad, K., Ali, N. A., … & Abbasi, Q. H. (2018). A review on the role of nano-communication in future healthcare systems: A big data analytics perspective. IEEE Access, 6, 41903-41920.

Stankovic, J. A. (2016). Research directions for cyber-physical systems in wireless and mobile healthcare. ACM Transactions on Cyber-Physical Systems, 1(1), 1-12.

 

Don't have time to write this essay on your own?
Use our essay writing service and save your time. We guarantee high quality, on-time delivery and 100% confidentiality. All our papers are written from scratch according to your instructions and are plagiarism free.
Place an order

Cite This Work

To export a reference to this article please select a referencing style below:

APA
MLA
Harvard
Vancouver
Chicago
ASA
IEEE
AMA
Copy to clipboard
Copy to clipboard
Copy to clipboard
Copy to clipboard
Copy to clipboard
Copy to clipboard
Copy to clipboard
Copy to clipboard
Need a plagiarism free essay written by an educator?
Order it today

Popular Essay Topics