MapReduce is a popular programming model widely used in data services and cloud frameworks. It plays a central role in the processing of big data sets, using distributed algorithms and potentially massive parallel operations. Though it is based on quite simple principles, it is recognised by many engineers as one of the most important tech innovations in recent years. However, some of the originators of the technology have now abandoned it in favour of different data processing frameworks. So is MapReduce still vital as it was a few years ago?
MapReduce is used for a wide variety of data-intensive operations, vital for today’s cloud operations and rapidly developing machine learning applications. It can be used for searching, sorting, indexing and numerous other statistical procedures. Given the prevalence of cloud-based systems and frameworks, MapReduce is also well-suited to multi-cluster, parallel and other distributed environments where performance is key.
For this article, we’ve teamed up with leading technology company, Luxoft, to understand what MapReduce is and how it’s used, especially in big data and machine learning.
Introduction to MapReduce
In a world where parallel processing, cloud infrastructures and vast datasets have become the norm, MapReduce plays an essential role in helping engineers to sort through the masses of information. It has features that are well-suited for massive networked infrastructures.
What is MapReduce?
MapReduce replaces some functions that might previously be carried out by search operations in relational database management systems (RDBMS). But MapReduce is more than simply a replacement for databases. It is not a data storage system as such, and it can interact equally with databases or filesystems. It is, rather, an algorithmic framework for processing datasets into more usable structures.
Though there are several different implementations, the term ‘MapReduce’ was first used by Google for its own proprietary web indexing technology. Google used it to supersede a diverse set of algorithms that they had developed piecemeal over many years. MapReduce brought greater clarity and simplicity to their indexing and analysis. They filed a patent on it in 2004, but numerous open source implementations have since been developed to use it, including Apache Hadoop and CouchDB.
How it works
At its core, MapReduce is a relatively simple data processing framework that utilises a Split-Apply-Combine strategy. This approach, also known as ‘divide and conquer’, means breaking a problem down into small chunks, processing and then reassembling the pieces. For MapReduce, the problem domain is typically huge and often unstructured datasets, where it is necessary to find or create some order to make them usable.
The functionality can be broken down, as the name suggests, into two key processes, ‘map’ and ‘reduce’, that are commonly used in functional programming. Crucially, both these procedures can be carried out either sequentially or in parallel, yielding enormous speed advantages when implemented in cloud-based infrastructures.
Advantages and disadvantages
MapReduce has proved more broadly useful than initially intended. Google first introduced (and patented) it just for their indexing operations. However, it’s since proved invaluable for a whole host of large-scale data problems. Its logic is used to model everything from human genome decoding to natural language processing. And it powers the analysis of the huge datasets that are needed for today’s AI applications and machine learning systems. In each of these situations, its main benefit is its amenability to massively parallel processing architectures.
However, some have considered it too disk-oriented, potentially involving a lot of resource-intensive IO operations. It is also suggested that its tendency to rebuild indexes rather than make incremental changes may be unnecessary and wasteful. For these reasons, Google itself has moved on to utilise different frameworks like Percolator and MillWheel, which leverage streaming and updates rather than batch processing, avoiding full index reconstruction. However, Google’s use cases may not be typical. Not all data processors favour this approach and MapReduce still has many adherents.
Clearly, implementations need care with their partition functions, since excess data writing at this stage can seriously impact performance. But MapReduce still offers a simple model with excellent speed, scalability and flexibility as well as the resilience and security that are important for public-facing applications.
MapReduce and Big Data
Big Data is where Map Reduce really comes into its own. Wrangling huge datasets can be computationally expensive, so this is where it’s vital to make the best use of resources.
Is MapReduce the answer to your big data problem?
The modern digital landscape is comprised of vast amounts of data. Machine learning relies on detecting patterns found in these enormous datasets. So a technology that can bring order to unstructured data is imperative for effective systems. But such systems need to be able to work at scale. That means massively parallel processing and extensive cloud distributions must be leveraged. MapReduce makes use of distributed file systems rather than centralised processing to make the necessary performance gains.
Distributed algorithms: how they work
As we already noted, the key operations of the MapReduce framework are quite simple in principle and inherently suited to distributed processing. Processing can occur across nodes within a local network or may be distributed globally through cloud infrastructures. Often, each node has access only to data gathered locally, thus reducing transfer costs across the network. The following three steps then are the basis of the process:
- The processing nodes use the map function to sort or filter data based on chosen properties.
- Nodes localise data by keys generated by the map function. This is known as the shuffle function.
- Data is then processed with the reduce operation to generate summarised results like counts, averages or other output requirements.
Apache Hadoop is a good example that employs this approach. It’s an open source framework for distributed computing that uses the MapReduce model to enable scaling from a single machine to potentially thousands of cloud-based nodes. It supports distributed shuffles and works with parallel file systems to increase performance and manage node failures in the event of hardware problems.
How Big e-Commerce uses MapReduce
An important use case of MapReduce is e-commerce. Companies like Amazon, eBay and Alibaba use the framework along with cloud technologies to generate sales initiatives. A typical example is the identification of targetable products based on users’ interests or previous buying behaviour. Such assessments can draw on a wide array of data from many different sources. Amazon’s Elastic MapReduce (EMR) is a commonly used implementation for such applications.
EMR uses a cluster of Amazon EC2 nodes. The master node coordinates the distribution of tasks and data to the core and task nodes that carry out the processing. The data is typically stored as files on each node and passes sequentially through the processing stages. On completion, the results can be written to a location such as an Amazon S3 bucket. EMR thus provides an easily-deployed, efficient and scalable solution for big e-commerce data processing requirements. Crucially, EMS can also use server-side and client-side encryption to protect sensitive customer data.
Uses in Machine Learning and Multicore
Machine learning depends on large datasets for its effectiveness. But the processing involved can be computationally expensive. Traditionally, programmers looking for performance gains have endeavoured to speed up individual algorithms. While great ingenuity has yielded some impressive results, there are limits to how much can be done in a singular computational space.
We are by now very familiar with the concept of multicore processors, but they are only as useful as the algorithms developed to make good use of them. For many years, programmers working on real-time systems have needed to consider carefully how to manage multi-threaded applications. The solutions are not always obvious. In particular, it has taken some work to optimise machine learning processes to use multicore processes. This extends naturally to massive parallelisation across networks. MapReduce has been key in the development of such approaches.
It is an indication of MapReduce’s flexibility and multiple applicability that it can be used as much in multicore frameworks as in networked parallelisation. However, with multicore use, it requires less failover provision, so the architecture can be lighter. Using the MapReduce framework enables a number of algorithmic approaches such as locally weighted linear regression (LWLR), naive Bayes (NB) and neural networks, that have been important for the development of ML technologies.
As data collection continues to expand, amalgamation and processing techniques need to scale to match so that ML applications can benefit. MapReduce is still proving an invaluable framework in the cloud for ML and data analytics of all sorts that power today’s e-Commerce, IoT and numerous web services. To discover more, check out Luxoft to see how they use MapReduce in innovative real-world technologies.