Hadoop Uncovering Hidden Patterns to Make Intelligent Data Driven Decisions

Priyadarsanie Ramasubramanian, Head - Engineering and Technology, Tesco

Big data is the new competitive ad­vantage for large scale business and HADOOP is a very compelling technology solu­tion. The ability to examine large amounts of data to uncover hidden patterns, correlations and other in­sights is very key to stay ahead of the game and make intelligent data driven decisions. Organisations are interested in the capabilities of data preparation and discovery, ability to run advance analytics, in memory computing, parallelism and data security and Hadoop caters to all these needs. Whilst there are many big data solution choices, Hadoop Hortonworks wins as a 100 percent open source distributed computing platform. Hadoops bring significant cost advantages when it comes to storing large amounts of data.

Supply Chain at Tesco focuses on three KPIs; improving availability, reducing wastage and stockholding. The key to get these KPIs right is to be able to forecast sales accurately for which complex predictions are needed on a sea of historical ‘sales based’ data, multiple events that influence sales, like price, promotions, events, season and weather. With 7000+ stores across countries, several thousand products ranged in each store, and serving millions of customers a week, this data amounts to a really large volume. Predictions with such scale of data need a distributed computing system like Hadoop. We’ve started the journey on a Hadoop platform.

The Building Blocks of Hadoop are:

1) HDFS which enables the storage of such millions of records across multiple machines,

2) Mapreduce programming model for extracting some useful infor­mation from this data to process it in a timely manner by enabling parallel processing of data across multiple machines

3) YARN, the manager of resources which enables the infrastructure to scale with growth and manages resources, memory requirements, etc. efficiently.

Many small computers (nodes) come together and work together as a single entity (cluster) to achieve the need. Companies like Google, Facebook and Amazon are building these vast server farms where actual processing of data takes place in parallel. The typical cycle is that users define the map and reduce algorithms using Mapreduce APIs available, the jobs are triggered, and YARN decides where to run the jobs, executes them and stores the results back into HDFS or into a query store.

"Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building Data Science applications"

Dwelling a little more on MapReduce, the processing of data, is broken down into two stages; the map which runs on multiple nodes and the reduce which takes the output of the map phase and further process it to produce the desired result. The map process runs on each node for the data that is stored only on it. It runs on one record at a time and produces key-value pairs. The reduce phase collates this data and process it on a single node. It collates the key-value pairs and combines the data associated to the same key (sum, average, etc). The programmer would have to write only two functions (the map and the reduce) and Hadoop takes care of the rest (fault tolerance, replication, distribution).

Like at Tesco where the algorithms that are to be performed on the big data are really complex and large scale, Spark is preferred over Mapreduce. The difference in Spark is that it performs in-memory processing of data thus making it much faster since the time spent in moving the data/processes in and out of the disk is reduced leading to a significant reduction in latency. It is also more capable in supporting streaming data along with distributed processing. We are again spoilt for choice on the programming language for Spark – Scala, Python, Java & R. Python scores very well for out of the box machine learning and statics packages needed. Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building Data Science applications.

With more people using Hadoop and the need to provide more flexibility has led to the growth of several tools that work along with Hadoop like the below:

Hbase: A database on Hadoop

Hive: Provide a query interaface to data stored on HDFS

Pig: Converts unstructured data into structured data

Oozie: A workflow management system

Flume/Sqoop: Enables inserting and retrieving data from Hadoop

In Summary, Hadoop is a great platform to have for all business that need faster, better decision making provided by its speed and in-memory analytics. With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want.