Free trial *Internet Service Required

Running Hadoop Jobs on Windows Azure, Importing Data from Windows Azure Marketplace, and Analysing the Data with the Excel Hive Add-In

This tutorial shows how to run MapReduce programs in a cluster using Apache™ Hadoop™-based Services for Windows Azure in two ways and how to use Excel to analyze data imported into the cluster employing Hive-based connectivity.

The first way to run a MapReduce program is with a Hadoop jar file by using the Create Job UI. The second way is with a query by using the fluent API layered on Pig that is provided by the Interactive Console. The first approach uses a MapReduce program written in Java; the second uses a script written in Javascript. The tutorial also shows how to upload files to the HDFS cluster that are needed as input for a MapReduce program and how to read the MapReduce output files from the HDFS cluster to examine the results of an analysis.

The Windows Azure Marketplace collects data, imagery, and real-time web services from leading commercial data providers and authoritative public data sources. It simplifies the purchase and consumption of a wide variety of data that includes demographic, environment, financial, retail, and sports data. This tutorial shows how to upload these data into a Hadoop on Windows Azure and how to query it using Hive scripts.

A key feature of Microsoft Big Data Solution is the integration of Hadoop with components of Microsoft Business Intelligence (BI). A good example of Hadoop integration with Microsoft BI is the ability for Excel to connect to the Hive data warehouse framework in the Hadoop cluster. This tutorial shows how to use Excel via the Hive ODBC driver to access and view data in the cluster.

You will learn:

  • How to run a basic Java MapReduce program using a Hadoop jar file.
  • How to upload input files to the HDFS cluster and read output files from the HDFS cluster.
  • How to run a JavaScript MapReduce script with a query by using the fluent API on Pig that is provided by the Interactive JavaScript Console.
  • How to import data from DataMarket into an Hadoop on Windows Azure cluster by using the Interactive Hive Console.
  • How to use Excel to query data stored in an Hadoop on Windows Azure cluster.

This tutorial is composed of the following segments:

  1. How to run a basic Java MapReduce program using a Hadoop jar file with the Create Job UI.
  2. How to run a JavaScript MapReduce script using the Interactive Console.
  3. How to import data with the Hive Interactive Console from DataMarket.
  4. How to connect to and query Hive data in a cluster from Excel.

 

Setup and configuration

You must have an account to access Hadoop on Windows Azure and have created a cluster to work through this tutorial. To obtain an account and create an Hadoop cluster, follow the instructions outlined in the Getting started with Microsoft Hadoop on Windows Azure section of the Introduction to Apache Hadoop-based Sevice for Windows Azure topic.

 

How to run a basic Java MapReduce program using a Hadoop jar file with the Create Job UI

From your Account page, click on the Create Job icon in the Your Tasks section to bring up the Create Job UI. CreateJobUI

To run a MapReduce program, specify the Job Name and the JAR File to use. Parameters are added to specify the name of the MapReduce program to run, the location of input and code files, and an output directory.

To see a simple example of how this interface is used to run the MapReduce job, let's look at the Pi Estimator sample. Return to your Account page. Scroll down to the Samples icon in the Manage your account section and click it.

From your Account page, scroll down to the Samples icon in the Manage your account section and click it.

Click the Pi Estimator sample icon in the Hadoop Sample Gallery. PiEstimatorSample

On the Pi Estimator page, information is provided about the application and downloads that are available for Java MapReduce programs and the jar file that contains the files needed by Hadoop on Windows Azure to deploy the application.

Click the Deploy to your cluster button on the right side to deploy the files to the cluster. PiEstimatorCreateJob

The fields on the Create Job page are populated for you in this example. The first parameter value defaults to "pi 16 10000000". The first number indicates how many maps to create (default is 16) and the second number indicates how many samples are generated per map (10 million by default). So this program uses 160 million random points to make its estimate of Pi. The Final Command is automatically constructed for you from the specified parameters and jar file.

To run the program on the Hadoop cluster, simply click the blue Execute job button on the right side of the page.

The status of the job is displayed on the page and changes to Completed Successfully when it is done. The result is displayed at the bottom of the Output(stdout) section. For the default parameters, the result is Pi = 3.14159155000000000000 which is accurate to 8 decimal place, when rounded.

PiEstimatorSampleResult

 

How to run a JavaScript MapReduce script using the Interactive Console

This segment shows how to run a MapReduce job with a query by using the fluent API layered on Pig that is provided by the Interactive Console. This example requires an input data file. The WordCount sample that you use here has already had this file uploaded to the cluster. But the sample does require that the .js script be uploaded to the cluster and you use this step to show the procedure for uploading files to HDFS from the Interactive Console.

First download a copy of the WordCount.js script to your local machine. Store it locally to upload it to the cluster. Click here and save a copy of the WordCount.js file to your local ../downloads directory. In addition download the The Notebooks of Leonardo Da Vinci, available here.

To get to the Interactive JavaScript console, return to your Account page. Scroll down to the Your Cluster section and click the Interactive Console icon to bring up the Interactive JavaScript console. InteractiveJsPage

To upload the JavaScript.js file to the cluster, enter the upload command fs.put() at the js> console and select the Wordcount.js form your downloads folder, for the Destination parameter use ./WordCount.js/.

WordCountSampleUploadWindow
Click the Browse button for the Source, navigate to the ../downloads directory and select the WordCount.js file. Enter the Destination value as shown and click the Upload button.

Repeat this step to upload the davinci.txt file by using ./example/data/ for the Destination.

Execute the MapReduce program from the js> console by using the following command:

pig.from("/example/data/davinci.txt").mapReduce("WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10Words")

Scroll to the right and click view log if you want to observe the details of the job's progress. This log also provides diagnostics if the job fails to complete.

To display the results in the DaVinciTop10Words directory once the job completes, use the file = fs.read("DaVinciTop10Words") command at the js> prompt.

WordCountSampleReadTop10

 

How to import data with the Hive Interactive Console from DataMarket

Open the Windows Azure Marketplace page in a browser and sign in with a valid Windlows Live ID.

DataMarketFrontPage

Click the MyAccount tab and complete the Registration form to open a subscription account.

DataMarketRegistration

Note the value of the default Account key assigned to your account. Account keys are used by applications to access your Windows Azure Marketplace dataset subscriptions.

DataMarketAccountKeys

Click the Data menu icon in the middle of the menu bar near the top of the page. Enter "crime" into the search the marketplace box on the upper right of the page and Enter.

DataMarketCrimeDataSearch

Select the 2006-2008 Crime in the United States (Data.gov) date.

DataMarketCrimeData

Press the SUBSCRIBE button on the right side of the page. Note that there is no cost for subscribing. Agree to the conditions on the Sign Up page and click the Sign Up button.

DataMarketCrimeDataExplore

This brings up the RECEIPT page. Press the EXPLORE THIS DATASET button to bring up a window where you can build your query.

DataMarketQueryCrimeData

Press the RUN QUERY button on the right side of the page to run the query without any parameters. Note the name of the query and the name of the table, and then click the DEVELOP tab to reveal the query that was auto generated. Copy this query.

DataMarketQueryCrimeResult

Return to your Hadoop on Windows Azure Account page, scroll down to the Your Cluster section and click on Manage Cluster icon.
ManageClosterIconOnAccountPage

Select the DataMarket icon option for importing data from Windows Azure DataMarket.

ManageClusterDataMarketOption

Enter the subscription User name and passkey, Query, and Hive table name obtained from your DataMarket account. Your user name is the email used for you Live ID. The value for the passkey is the account key default value assigned to you when you opened your Marketplace account. It can also be found as the Primary Account Key value on your Marketplace Account Details page. After the parameters are entered, press the Import Data button.

ImportFromDataMarket

The progress made importing is reported on the Data Market Import page that appears.

CrimeDataImportProgress

When the import task completes, return to your Account page and select the Interactive Console icon from the Your Cluster section. Then press the Hive option on the console page.

crimedatahiveconsole

Enter this code snippet -

create table crime_results as select city, max(violentcrime) 
as maxviolentcrime from crime_data 
group by city order by maxviolentcrime desc limit 10

Then press the Evaluate button.

CrimeDataHiveConsoleWithQuery

 

How to connect to and query Hive data in a cluster from Excel

Return to your Account page and select the Open Ports icon from the Your Cluster section to open the Configure Ports page. Open the ODBC Server on port 10000 by clicking its Toggle button.

OpenODBCPort10000

Return to your Account page and select the Downloads icon from the Manage your account section and select the appropriate .msi file to install the Hive ODBC drivers and Excel Hive Add-In.

DownloadLinkHiveODBC

Select the Run anyway option from the SmartScreen Filter window that pops up.

RunawayHiveODBC

Open Excel after the installation completes. Under data menu click the Hive Panel.

ExcelDataMenuHivePanel

Press the Enter Cluster Details button on the Hive Query panel on the left.

ExcelHiveCredentials

Provide a description for your cluster, enter localhost for the Host value and 10000 for the port number. Enter your credentials for your Hadoop on Windows Azure cluster in the Unsername/Password option in the Authentication section. Then click OK.

ExcelODBCHiveSetupWindow

From the Hive Query panel, select crime_results from Select the Hive Object to Query menu. Then check city and maxviolentcrime in the Columns section.
Click on the HiveQL to reveal the query: select city, mazviolentcrime from crime_results limit 200

Click on Execute Query button.

ExcelExecuteHiveQuery

You can see the cities with the most violent crime. From the Insert menu, select the Bar option to insert a bar chart on to the page to obtain a visualization of the data.

ExcelHiveQueryResults


 

Summary

In this tutorial, you have seen two ways to run MapReduce jobs by using the Hadoop on Windows Azure portal. One used the Create Job UI to run a Java MapReduce program by using a jar file. The other used the Interactive Console to run a MapReduce job by using a .js script within a Pig query. You have also seen how to upload this data into Hadoop on Windows Azure and query it using Hive scripts from the Interactive Console. Finally, you have seen how to use Excel via the Hive ODBC driver to access and view data that is stored in the HDFS cluster.

Rss Newsletter