Running Hadoop Jobs on Windows Azure, Importing Data from Windows Azure Marketplace, and Analysing the Data with the Excel Hive Add-In
This tutorial shows how to run MapReduce programs in a cluster using Apache™ Hadoop™-based Services for Windows Azure in two ways and how to use Excel to analyze data imported into the cluster employing Hive-based connectivity.
The Windows Azure Marketplace collects data, imagery, and real-time web services from leading commercial data providers and authoritative public data sources. It simplifies the purchase and consumption of a wide variety of data that includes demographic, environment, financial, retail, and sports data. This tutorial shows how to upload these data into a Hadoop on Windows Azure and how to query it using Hive scripts.
A key feature of Microsoft Big Data Solution is the integration of Hadoop with components of Microsoft Business Intelligence (BI). A good example of Hadoop integration with Microsoft BI is the ability for Excel to connect to the Hive data warehouse framework in the Hadoop cluster. This tutorial shows how to use Excel via the Hive ODBC driver to access and view data in the cluster.
You will learn:
- How to run a basic Java MapReduce program using a Hadoop jar file.
- How to upload input files to the HDFS cluster and read output files from the HDFS cluster.
- How to import data from DataMarket into an Hadoop on Windows Azure cluster by using the Interactive Hive Console.
- How to use Excel to query data stored in an Hadoop on Windows Azure cluster.
This tutorial is composed of the following segments:
- How to run a basic Java MapReduce program using a Hadoop jar file with the Create Job UI.
- How to import data with the Hive Interactive Console from DataMarket.
- How to connect to and query Hive data in a cluster from Excel.
Setup and configuration
You must have an account to access Hadoop on Windows Azure and have created a cluster to work through this tutorial. To obtain an account and create an Hadoop cluster, follow the instructions outlined in the Getting started with Microsoft Hadoop on Windows Azure section of the Introduction to Apache Hadoop-based Sevice for Windows Azure topic.
How to run a basic Java MapReduce program using a Hadoop jar file with the Create Job UI
From your Account page, click on the Create Job icon in the Your Tasks section to bring up the Create Job UI.
To run a MapReduce program, specify the Job Name and the JAR File to use. Parameters are added to specify the name of the MapReduce program to run, the location of input and code files, and an output directory.
To see a simple example of how this interface is used to run the MapReduce job, let's look at the Pi Estimator sample. Return to your Account page. Scroll down to the Samples icon in the Manage your account section and click it.
From your Account page, scroll down to the Samples icon in the Manage your account section and click it.
Click the Pi Estimator sample icon in the Hadoop Sample Gallery.
On the Pi Estimator page, information is provided about the application and downloads that are available for Java MapReduce programs and the jar file that contains the files needed by Hadoop on Windows Azure to deploy the application.
Click the Deploy to your cluster button on the right side to deploy the files to the cluster.
The fields on the Create Job page are populated for you in this example. The first parameter value defaults to "pi 16 10000000". The first number indicates how many maps to create (default is 16) and the second number indicates how many samples are generated per map (10 million by default). So this program uses 160 million random points to make its estimate of Pi. The Final Command is automatically constructed for you from the specified parameters and jar file.
To run the program on the Hadoop cluster, simply click the blue Execute job button on the right side of the page.
The status of the job is displayed on the page and changes to Completed Successfully when it is done. The result is displayed at the bottom of the Output(stdout) section. For the default parameters, the result is Pi = 3.14159155000000000000 which is accurate to 8 decimal place, when rounded.
This segment shows how to run a MapReduce job with a query by using the fluent API layered on Pig that is provided by the Interactive Console. This example requires an input data file. The WordCount sample that you use here has already had this file uploaded to the cluster. But the sample does require that the .js script be uploaded to the cluster and you use this step to show the procedure for uploading files to HDFS from the Interactive Console.
First download a copy of the WordCount.js script to your local machine. Store it locally to upload it to the cluster. Click here and save a copy of the WordCount.js file to your local ../downloads directory. In addition download the The Notebooks of Leonardo Da Vinci, available here.
fs.put() at the js> console and select the Wordcount.js form your downloads folder, for the Destination parameter use ./WordCount.js/.
Click the Browse button for the Source, navigate to the ../downloads directory and select the WordCount.js file. Enter the Destination value as shown and click the Upload button.
Repeat this step to upload the davinci.txt file by using ./example/data/ for the Destination.
Execute the MapReduce program from the js> console by using the following command:
pig.from("/example/data/davinci.txt").mapReduce("WordCount.js", "word, count:long").orderBy("count DESC").take(10).to("DaVinciTop10Words")
Scroll to the right and click view log if you want to observe the details of the job's progress. This log also provides diagnostics if the job fails to complete.
To display the results in the DaVinciTop10Words directory once the job completes, use the
file = fs.read("DaVinciTop10Words") command at the js> prompt.
How to import data with the Hive Interactive Console from DataMarket
Open the Windows Azure Marketplace page in a browser and sign in with a valid Windlows Live ID.
Click the MyAccount tab and complete the Registration form to open a subscription account.
Note the value of the default Account key assigned to your account. Account keys are used by applications to access your Windows Azure Marketplace dataset subscriptions.
Click the Data menu icon in the middle of the menu bar near the top of the page. Enter "crime" into the search the marketplace box on the upper right of the page and Enter.
Select the 2006-2008 Crime in the United States (Data.gov) date.
Press the SUBSCRIBE button on the right side of the page. Note that there is no cost for subscribing. Agree to the conditions on the Sign Up page and click the Sign Up button.
This brings up the RECEIPT page. Press the EXPLORE THIS DATASET button to bring up a window where you can build your query.
Press the RUN QUERY button on the right side of the page to run the query without any parameters. Note the name of the query and the name of the table, and then click the DEVELOP tab to reveal the query that was auto generated. Copy this query.
Return to your Hadoop on Windows Azure Account page, scroll down to the Your Cluster section and click on Manage Cluster icon.
Select the DataMarket icon option for importing data from Windows Azure DataMarket.
Enter the subscription User name and passkey, Query, and Hive table name obtained from your DataMarket account. Your user name is the email used for you Live ID. The value for the passkey is the account key default value assigned to you when you opened your Marketplace account. It can also be found as the Primary Account Key value on your Marketplace Account Details page. After the parameters are entered, press the Import Data button.
The progress made importing is reported on the Data Market Import page that appears.
When the import task completes, return to your Account page and select the Interactive Console icon from the Your Cluster section. Then press the Hive option on the console page.
Enter this code snippet -
create table crime_results as select city, max(violentcrime)
as maxviolentcrime from crime_data
group by city order by maxviolentcrime desc limit 10
Then press the Evaluate button.
How to connect to and query Hive data in a cluster from Excel
Return to your Account page and select the Open Ports icon from the Your Cluster section to open the Configure Ports page. Open the ODBC Server on port 10000 by clicking its Toggle button.
Return to your Account page and select the Downloads icon from the Manage your account section and select the appropriate .msi file to install the Hive ODBC drivers and Excel Hive Add-In.
Select the Run anyway option from the SmartScreen Filter window that pops up.
Open Excel after the installation completes. Under data menu click the Hive Panel.
Press the Enter Cluster Details button on the Hive Query panel on the left.
Provide a description for your cluster, enter localhost for the Host value and 10000 for the port number. Enter your credentials for your Hadoop on Windows Azure cluster in the Unsername/Password option in the Authentication section. Then click OK.
From the Hive Query panel, select crime_results from Select the Hive Object to Query menu. Then check city and maxviolentcrime in the Columns section.
Click on the HiveQL to reveal the query:
select city, mazviolentcrime from crime_results limit 200
Click on Execute Query button.
You can see the cities with the most violent crime. From the Insert menu, select the Bar option to insert a bar chart on to the page to obtain a visualization of the data.
In this tutorial, you have seen two ways to run MapReduce jobs by using the Hadoop on Windows Azure portal. One used the Create Job UI to run a Java MapReduce program by using a jar file. The other used the Interactive Console to run a MapReduce job by using a .js script within a Pig query. You have also seen how to upload this data into Hadoop on Windows Azure and query it using Hive scripts from the Interactive Console. Finally, you have seen how to use Excel via the Hive ODBC driver to access and view data that is stored in the HDFS cluster.