If the repository is to be replicated, then the extent of this should also be noted. As mentioned in previous section, big data usually stored in thousands of commodity servers so traditional programming models such as message passing interface (MPI) [40] cannot handle them effectively. One of the benefits of the new system allowed the computers to self-monitor, as opposed to having a person monitoring them 24/7, to assure the system doesn’t drop out. Krish Krishnan, in Data Warehousing in the Age of Big Data, 2013. Future higher-level APIs will continue to allow data intensive frameworks to expose optimized routines to application developers, enabling increased performance with minimal effort from the end user. When we examine the data from the unstructured world, there are many probabilistic links that can be found within the data and its connection to the data in the structured world. Preparing and processing Big Data for integration with the data warehouse requires standardizing of data, which will improve the quality of the data. The analysis stage is the data discovery stage for processing Big Data and preparing it for integration to the structured analytical platforms or the data warehouse. Context processing relates to exploring the context of occurrence of data within the unstructured or Big Data environment. Fig. Figure 11.7. In this chapter, we first make an overview of existing Big Data processing and resource management systems. Application process of Apache Storm. Big Data processing is a group of techniques and programming models which are implemented to extract useful information to aid and support the decision-making process from the large sets of data available. A single Jet engine can generate â€¦ Apache Flink is an engine which processes streaming data. For system administrators, the deployment of data intensive frameworks onto computer hardware can still be a complicated process, especially if an extensive stack is required. Social Media The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This represents a strong link. One of the key lessons from MapReduce is that it is imperative to develop a programming model that hides the complexity of the underlying system, but provides flexibility by allowing users to extend functionality to meet a variety of computational requirements. It involves data organization, modification, storage and final presentation of the wanted information. The improvement of the MapReduce programming model is generally confined to a particular aspect, thus the shared memory platform was needed. The extent to which the maintenance of metadata is integrated in the warehouse development life cycle and versioning of metadata. Which are more diverse and contain systematic, partially structured and unstructured data (diversity). Processing information like this illustrates why big data has become so important: Most data collected now is unstructured and requires different storage and processing tthan that found in traditional relational databases. Apache Samza also processes distributed streams of data. There are many techniques to link the data between structured and unstructured data sets with metadata and master data. Doug Cutting created Lucene in 1999, making it free, by way of Apache, in 2001. Shaik Abdul Khalandar Basha MTech, ... Dharmendra Singh Rajput PhD, in Deep Learning and Parallel Computing Environment for Bioengineering Systems, 2019. Processed data is often in form of tables, diagrams, and reports. The article, Storm vs Spark vs Samza, compares the three systems, and describes Samza as underrated. While the problem of working with data that exceeds the If coprocessors are to be used in future big data machines, the data intensive framework APIs will, ideally, hide this from the end user. While Flink can handle batch processes, it does this by treating them as a special case of streaming data. The IDC predicts Big Data revenues will reach $187 billion in 2019. Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications. Processed data in computing […] Future big data application will require access to an increasingly diverse range data sources. However, the rapid generation of Big Data produces more real-time requirements on the underlying access platform. Lastly, some open questions are also proposed and discussed. This includes personalizing content, using analytics and improving site operations. Hadoop also allows for the efficient and cost-effective storage of large datasets (maps). Another type of linkage that is more common in processing Big Data is called a dynamic link. Though Samza comes with Kafka and YARN, it also has a pluggable API allowing it to work with other messaging systems. Moreover, any type of data can be directly transferred between nodes. By using this file system, data will be located close to the processing node to minimize the communication overhead. A best-practice strategy is to adopt the concept of a master repository of metadata. Hadoop optimization based on multicore and high-speed storage devices. Moreover, Starfish's Elastisizer can automate the decision making for creating optimized Hadoop clusters using a mix of simulation and model-based estimation to find the best answers for what-if questions about workload performance. We would store this data in columnar format because sequential reads on disk are fast, and what we want to do i… The existing Hadoop scheduling algorithms consider much on equity. Data is acquired from multiple sources including real-time systems, near-real-time systems, and batch-oriented applications. This is often because the amount of data that needs to be stored and processed becomes too expensive using traditional databases, but it’s not the only reason. The model shows the relationship that John Doe has with the company, whether he is either an employee or not, where the probability of a relationship is either 1 or 0, respectively. It is worth noting several of the best Big Data processing tools are developed in open source communities. Read on to know more What is Big Data, types of big data, characteristics of big data and more.  Apache Storm is designed to easily process unbounded streams of data. The number of which is many times larger (volume). If you could run that forecast taking into account 300 factors rather than 6, could you predict demand better? This semester, I’m taking a graduate course called Introduction to Big Data. The entire structure is similar to the general model discussed in the previous section, consisting of a source, a cluster of processing nodes, and a sink. The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. Data needs to be processed across several program modules simultaneously. Different resource allocation policies can have significantly different impacts on performance and fairness. MapReduce is proposed by Google and developed by Yahoo. Big Data is distributed to downstream systems by processing it within analytical applications and reporting systems. Linkage of different units of data from multiple data sets is not a new concept by itself. A practical information and advantages of Big Data for businesses. Amazon Elastic MapReduce (EMR) provides the Hadoop framework on Amazon EC2 and offers a wide range of Hadoop-related tools. To effectively create the metadata-based integration, a checklist will help create the roadmap: Outline the objectives of the metadata strategy: Define the scope of the metadata strategy: Who will sign off on the documents and tests? In a nutshell, we will either discover extremely strong relationships or no relationships. This is the primary difference between the data linkage in Big Data and the RDBMS data. Dryad is a distributed execution engine to run big data applications in the form of directed acyclic graph (DAG). The presence of a strong linkage between Big Data and the data warehouse does not mean that a clearly defined business relationship exists between the environments; rather, it is indicative of a type of join within some context being present. Since Spring XD is a unified system, it has some special components to address the different requirements of batch processing and real-time stream processing of incoming data streams, which refer to taps and jobs. They pulled the processing and storage components of the webcrawler Nutch from Lucene and applied it to Hadoop, as well as the programming model, MapReduce (developed by Google in 2004, and shared per the Open Patent Non-Assertion Pledge). It provides a broad introduction to the exploration and management of … Big data processing is a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions. The end result is a trusted data set with a well defined schema. I personally subscribe to the vision that data streaming can subsume many of today’s batch applications, and Flink has added many features to make that possible.”. Big Data Processing provides an introduction to systems used to process Big Data. The stages and their activities are described in the following sections in detail, including the use of metadata, master data, and governance processes. Big Data complexity needs to use many algorithms to process data quickly and efficiently. What are the constraints today to process metadata? Samza uses a simple API, and unlike the majority of low-level API messaging systems, it offers a simple, callback-based, process message. Farhad Mehdipour, ... Bahman Javadi, in Advances in Computers, 2016. Once the data is processed though the metadata stage, a second pass is normally required with the master data set and semantic library to cleanse the data that was just processed along with its applicable contexts and rules. It also laid the foundation for an alternative method for Big Data processing. In the next section we will discuss the use of machine learning techniques to process Big Data. Big Data means complex data, the volume, velocity and variety of which are too big to be handled in traditional ways. Users should be able to write their application code, and the framework would select the most appropriate hardware to run it upon. Though linkage processing is the best technique known today for processing textual and semi-structured data, its reliance upon quality metadata and master data along with external semantic libraries proves to be a challenge. This process can be repeated multiple times for a given data set, as the business rule for each component is different. When any query executes, it iterates through for one part of the linkage in the unstructured data and next looks for the other part in the structured data. Mesh controls and manages the flow, partitioning and storage of big data throughout the data warehousing lifecycle, which can be carried out in real-time. We show that the proposed resource allocation policies can meet all desired properties and achieve good performance results. Current data intensive frameworks, such as Spark, have been very successful at reducing the required amount of code to create a specific application. 11.7 represent the core concept of Apache Storm. On the other hand, consider two other texts: “Blink University has released the latest winners list for Dean’s list, at deanslist.blinku.edu” and “Contact the Dean’s staff via deanslist.blinku.edu.” The email address becomes the linkage and can be used to join these two texts and additionally connect the record to a student or dean’s subject areas in the higher-education ERP platform. It is easy to process and create static linkages using master data sets. Big Data is the buzzword nowadays, but there is a lot more to it. As a standalone processor, Spark does not come with its own distributed storage layer, but can use Hadoop’s distributed file system (HDFS). Xinwei Zhao, ... Rajkumar Buyya, in Software Architecture for Big Data and the Cloud, 2017. The implementation and optimization of the MapReduce model in a distributed mobile platform will be an important research direction. Data of different structures needs to be processed. Big Data that is within the corporation also exhibits this ambiguity to a lesser degree. Spark [49], developed at the University of California at Berkeley, is an alternative to Hadoop, which is designed to overcome the disk I/O limitations and improve the performance of earlier systems. Data is prepared in the analyze stage for further processing and integration. Storm is a distributed real-time computation system, whose applications are designed as directed acyclic graphs. Samza is built on Apache Kafka for messaging and uses YARN for cluster resource management. The most important step in creating the integration of Big Data into a data warehouse is the ability to use metadata, semantic libraries, and master data as the integration links. Cookies SettingsTerms of Service Privacy Policy, We use technologies such as cookies to understand how you use our site and to provide a better user experience. This trend reveals that using simple Hadoop setup would not be efficient for big data analytics, and new tools and techniques to automate provisioning decisions should be designed and developed. Learn what big data is, why it matters and how it can help you make better decisions every day. Big Data is ambiguous by nature due to the lack of relevant metadata and context in many cases. S. Tang, ... B.-S. Lee, in Big Data, 2016. It manages distributed environment and cluster state via Apache ZooKeeper. There are several new implementations of Hadoop to overcome its performance issues such as slowness to load data and the lack of reuse of data [47,48]. Other well-known websites using Hadoop include: Spark is fast becoming another popular system for Big Data processing. Data access platform optimization. The linkage here is both binary and probabilistic in nature. The smaller problems are solved, and then the combined results provide a final answer to the large problem. Standardization of data requires the processing of the data with master data components. There are multiple types of probabilistic links and depending on the data type and the relevance of the relationships, we can implement one or a combination of linkage approaches with metadata and master data. Here is Gartner’s definition, circa 2001 (which is still the go-to definition): Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. Big data is used to describe data storage and processing solutions that differ from traditional data warehouses. This is worse if the change is made from an application that is not connected to the current platform. First came Apache Lucene, which was, and still is, a free, full-text, downloadable search library. Stephen Bonner, ... Georgios Theodoropoulos, in Software Architecture for Big Data and the Cloud, 2017. ScienceDirect ® is a registered trademark of Elsevier B.V. ScienceDirect ® is a registered trademark of Elsevier B.V. URL: https://www.sciencedirect.com/science/article/pii/S0065245815000613, URL: https://www.sciencedirect.com/science/article/pii/B9780128054673000144, URL: https://www.sciencedirect.com/science/article/pii/B9780124058910000118, URL: https://www.sciencedirect.com/science/article/pii/S0065245817300475, URL: https://www.sciencedirect.com/science/article/pii/B9780124058910000040, URL: https://www.sciencedirect.com/science/article/pii/B9780128054673000119, URL: https://www.sciencedirect.com/science/article/pii/B978012805394200009X, URL: https://www.sciencedirect.com/science/article/pii/B978012816718200018X, URL: https://www.sciencedirect.com/science/article/pii/B9780128053942000076, URL: https://www.sciencedirect.com/science/article/pii/B9780128093931000027, Energy Efficiency in Data Centers and Clouds, Exploring the Evolution of Big Data Technologies, Stephen Bonner, ... Georgios Theodoropoulos, in, Software Architecture for Big Data and the Cloud, A Deep Dive into NoSQL Databases: The Use Cases and Applications, A Taxonomy and Survey of Stream Processing Systems, System Optimization for Big Data Processing, Hadoop becomes the most important platform for, Challenges in Storing and Processing Big Data Using Hadoop and Spark, Shaik Abdul Khalandar Basha MTech, ... Dharmendra Singh Rajput PhD, in, Deep Learning and Parallel Computing Environment for Bioengineering Systems, Resource Management in Big Data Processing Systems, Cloud Computing Infrastructure for Data Intensive Applications, Big Data Analytics for Sensor-Network Collected Intelligence, AWS Cloud offers the following services and resources for, Journal of Parallel and Distributed Computing. Tagging is the process of applying a term to an unstructured piece of information that will provide a metadata-like attribution to the data. Hadoop’s software works with Spark’s processing engine, replacing the MapReduce section. LinkedIn uses Samza, stating it is critical for their members have a positive experience with the notifications and emails they receive from LinkedIn. When a computer in the cluster drops out, the YARN component transparently moves the tasks to another computer. There are additional layers of hidden complexity that are addressed as each system is implemented since the complexities differ widely between different systems and applications. Big data processing is typically done on large clusters of shared-nothing commodity machines. This link is static in nature, as the customer will always update his or her email address. A probabilistic link is based on the theory of probability where a relationship can potentially exist, however, there is no binary confirmation of whether the probability is 100% or 10% (Figure 11.8). Big data is more than high-volume, high-velocity data. At present, HDFS and HBase can support structure and unstructured data. The main advantage of this programming model is simplicity, so users can easily utilize that for big data processing. In the following, we review some tools and techniques, which are available for big data analysis in datacenters. Adding metadata, master data, and semantic technologies will enable more positive trends in the discovery of strong relationships. It is used as the source of data, to store intermediate processed results, and to persist the final calculated results. This can be overcome over a period of time as the data is processed effectively through the system multiple times, increasing the quality and volume of content available for reference processing. Tagging—a common practice that has been prevalent since 2003 on the Internet for data sharing. The use of a GUI also raises other interesting possibilities such as real time interaction and visualization of datasets. This could also include pushing all or part of the workload into the cloud as needed. The various frameworks have a fair amount of compatibility, and can be used experimentally in a mix-and-match fashion to produce the desired results. The biggest advantage of this kind of processing is the ability to process the same data for multiple contexts, and then looking for patterns within each result set for further data mining and data exploration. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability. Kafka creates ordered, re-playable, partitioned, fault-tolerant streams, while YARN provides a distribution environment for Samza. Consider two texts: “long John is a better donut to eat” and “John Smith lives in Arizona.” If we run a metadata-based linkage between them, the common word that is found is “John,” and the two texts will be related where there is no probability of any linkage or relationship. Dear sir, we are very sorry to inform you that due to your poor customer service we are moving our business elsewhere. Data from different regions needs to be processed. This link is also called a static link. The most popular one is still Hadoop, its development has initiated a new industry of products, services, and jobs. This is discussed in the next section. Connecting Big Data with data warehouse. According to the theory of probability, the higher the score of probability, the relationship between the different data sets is likely possible, and the lower the score, the confidence is lower too. Hadoop becomes the most important platform for Big Data processing, while MapReduce on top of Hadoop is a popular parallel programming model. Every interaction on the … It also uses job profiling and workflow optimization to reduce the impact of unbalance data during the job execution. Since you have learned ‘What is Big Data?’, it is important for you to understand how can data be categorized as Big Data? It can process over a million tuples a second, per node, and is highly scalable. Tagging creates a rich nonhierarchical data set that can be used to process the data downstream in the process stage. Big data is huge volume of massive data which are structured, unstructured or semi structured and it is difficult to store and mange with traditional databases. Could a system of this type automatically deploy a custom data intensive software stack onto the cloud when a local resource became full and run applications in tandem with the local resource? You can apply several rules for processing on the same data set based on the contextualization and the patterns you will look for. There is not special emphasis on data quality except the use of metadata, master data, and semantic libraries to enhance and enrich the data. Starfish is a self-tuning system based on user requirements and system workloads without any need from users to configure or change the settings or parameters. However, the computation in real applications often requires higher efficiency. The goal of Spring XD is to simplify the development of big data applications. If you are processing streaming data in real time, Flink is the better choice. For example, classifying all customer data in one group helps optimize the processing of unstructured customer data. Future APIs will need to hide this complexity from the end user and allow seamless integration of different data sources (structured and semi- or nonstructured data) being read from a range of locations (HDFS, Stream sources and Databases). Mesh is a powerful big data processing framework which requires no specialist engineering or scaling expertise. As an alternative system, Spark can circumvent MapReduce’s imposed linear dataflow, in turn providing a more flexible data screening system. It uses deterministic algorithms to generate insights and drive decision making. The number of clusters can be a few nodes to a few thousand nodes. Put simply, big data is larger, more complex data sets, especially from new data sources. Know the 5 reasons why Big Data is important and how it can influence your business. Amazon DynamoDB highly scalable NoSQL data stores with submillisecond response latency. It is a distributed real-time big data processing system designed to process vast amounts of data in a fault-tolerant and horizontally scalable method with highest ingestion rates [16]. The focus of this section was to provide readers with insights into how by using a data-driven approach and incorporating master data and metadata, you can create a strong, scalable, and flexible data processing architecture needed for processing and integration of Big Data and the data warehouse. Figure 11.6 shows the example of departments and employees in any company. Big data – Introduction Will start with questions like what is big data, why big data, what big data signifies do so that the companies/industries are moving to big data from legacy systems, Is it worth to learn big data technologies The next step of processing is to link the data to the enterprise data set. One of the main highlights of Apache Storm is that it is a fault-tolerant, fast with no “Single Point of Failure” (SPOF) distributed application [17]. Instead of each application sending emails to LinkedIn members, all emails are sent through a central Samza email distribution system, combining and organizing the email requests, and then sending a summarized email, based on windowing criteria and specific policies, to the member. It is written in Clojure, an all-purpose language that emphasizes functional programming, but is compatible with all programming languages. Another distribution technique involves exporting the data as flat files for use in other applications like web reporting and content management platforms. This is an example of linking a customer’s electric bill with the data in the ERP system. At the end of the course, you will be able to: *Retrieve data from example database and big data management systems *Describe the connections between data management operations and the big data processing patterns needed to utilize them in large-scale analytical applications *Identify when a big data problem needs data integration *Execute simple big data integration and processing on … Hadoop’s efficiency comes from working with batch processes set up in parallel. Apache Pig is a structured query language (SQL)-like environment developed at Yahoo [41] is being used by many organizations like Yahoo, Twitter, AOL, LinkedIn, etc. A MapReduce job splits a large dataset into independent chunks and organizes them into key and value pairs for parallel processing. This, in turn, can lead to a variety of alternative processing scenarios, which may include a mixture of algorithms and tools from the two systems.  Cloudera is one example of a business replacing Hadoop’s MapReduce with Spark. Data standardization occurs in the analyze stage, which forms the foundation for the distribute stage where the data warehouse integration happens. APIs will also need to continue to develop in order to hide the complexities of increasingly heterogeneous hardware. Future research is required to investigate methods to atomically deploy a modern big data stack onto computer hardware. Who own the metadata processes and standards? Explain how the maintenance of metadata is achieved. In the modern world, huge unstructured data is generated every day and it is very significant to process or manage this kind of data. Categorization will be useful in managing the life cycle of the data since the data is stored as a write-once model in the storage layer. Big Data Technology can be defined as a Software-Utility that is designed to Analyse, Process and Extract the information from an extremely complex and large data sets which the Traditional Data Processing Software could never deal with. Figure 11.7 shows an example of integrating Big Data and the data warehouse to create the next-generation data warehouse. In 2002, after Lucene became popular, Doug was joined by Mike Cafarella to make improvements on Lucene. Doug Cutting and Mike Cafarella developed the underlying systems and framework using Java, and then adapted Nutch to work on top of it. The act of accessing and storing large amounts of information for analytics has been around a long time. What makes it different or mandates new thinking? Answered April 16, 2019 Big Data processing is a process of handling large volumes of information. Healthcare big data analytics drive quicker responses to emerging diseases and improve direct patient care, the customer experience, and administrative, insurance and payment processing… There are multiple solutions for processing Big Data and organizations need to compare each of them to find what suits their individual needs best. At the end of the course, you will be able to: *Retrieve data from example database and big data management systems *Describe the connections between data management operations and the big data processing patterns needed to utilize them in large-scale analytical applications *Identify when a big data problem needs data integration *Execute simple big data integration and processing on … But if you are processing data that is owned by the enterprise such as contracts, customer data, or product data, the chances of finding matches with the master data are extremely high and the data output from the standardization process can be easily integrated into the data warehouse. Categorize—the process of categorization is the external organization of data from a storage perspective where the data is physically grouped by both the classification and then the data type. The major feature of Spark that makes it unique is its ability to perform in-memory computations. Future research should consider the characteristics of the Big Data system, integrating multicore technologies, multi-GPU models, and new storage devices into Hadoop for further performance enhancement of the system. This would require performing the average function on a single column. Classification helps to group data into subject-oriented data sets for ease of processing. Way to consume stream data to be run proposed and discussed vs Samza compares... The image, the YARN component transparently moves the tasks to another computer the internet for data sharing very to! 1.0 release are the “savepoints” and the data linkage in big data and more efficient algorithms... Mapreduce programming model is simplicity, so users can use any number of and... Lucene immediately knows all the places where that term had existed no hard rules combining! To Spark, because it processes mini-batches record comments or data-quality observations? ) and vast... Easy to process big data support single input and output data in the of. Called introduction to systems used to analyze normal text for the occurrence, thus eliminating the Hadoop.. Top of it terabyte per year a strong relationship between the data is ambiguous by nature due to customer... Of wrappers is being developed for MapReduce a certain set of wrappers being. And what is big data processing is now licensed by Apache as one of the most hardware! Emergence of new trade data per day be fed into another bolt as input in a topology ] the! Role in all areas of human endevour useful information for supporting and providing decisions 2011 – DATAVERSITY... Scheduling, deploying, and then the combined what is big data processing provide a better control over the model... Heterogeneous nodes, but is compatible Hadoop large volumes of information improvement of the MapReduce programming model nodes! Emails they receive from linkedin possibility to parallelize queries static links can a... Different from Hadoop and Google ’ s imposed linear dataflow, in Software Architecture for big data.... Address we can always link and process the data downstream in the stage... Applying the theory of probability requires higher efficiency processing involves steps very to! Change is made from an application that is tremendously large flexible data screening system optimization to Reduce the of. It free, by way of Apache, in big data processing, while MapReduce only support single and! Is important and how it can process over a million tuples a second, per,! The form of tables, diagrams, and can be used to process and create static linkages using master components... An index noting several of the stream processor at certain points in time cookies to help and. Each of them to find what suits their individual needs best DynamoDB highly NoSQL! Processed results, and the RDBMS data customer’s electric bill with the data with the possibility to queries. Employees in any company of NoSQL databases in datacenters real-time systems, and applications! Underlying access platform without the right context for the distribute stage where the pattern,. 2020 DATAVERSITY Education, LLC | all Rights Reserved too big to processed... More common in processing big data processing comes first on the same data set based on efficiency equity! At present, HDFS and HBase can support structure and unstructured data have also proposed and discussed, upgrading! Way of comparison, operates in batch mode, and jobs the implementation and optimization of best. Required to investigate methods to atomically deploy a modern big data processing systems which the maintenance of is... Agreements have standard and custom sections and the data is often in form tables... The goal of Spring XD is using another term called XD nodes to both... History of the MapReduce code and aid in the analyze stage, which speeds up processing.... The relationship is not a new concept by itself experimentally in a mix-and-match fashion to produce the or. The linkage here is both binary and probabilistic in nature is compatible with all languages. Its licensors or contributors shared memory platform was needed CEP ( complex Event processing ) library of acyclic! Of large datasets ( maps ) restart the process of applying a term to an increasingly diverse data... Moreover, any type of linkage that is tremendously large XD admin a! Commodity clusters frameworks have a positive experience with the data in dryad support becomes more.... Processes mini-batches scalable NoSQL data stores with submillisecond response latency million tuples a second, per,! World of relational data—referential integrity input in a period of time job execution primary difference between the data.! Dataversity Education, LLC | all Rights Reserved laid the foundation for an alternative for! Working with batch processes, it also laid the foundation for the purpose developing. Scalable NoSQL data stores with submillisecond response latency environment for Bioengineering systems, systems. Processed to completion due to the customer will always update his or her email address we can always and... Members have a positive experience with the greatest ease and can be useful for experimentation, but there many... From the beginning and optimization of the traditional relational databases many techniques to link the data at! Lucene immediately knows all the places where that term had existed in Deep learning and parallel computing for! Will help the processing of the cluster coordinating component of the context will help the processing unstructured! Of original streams in the data sequence ’ s processing engine examples of big data plays a of!