top of page

Hadoop 3.3.6 Installation and Configuration on a standalone AMD64 running Ubuntu 20.04

Hadoop is a distributed data processing frameworks used for big data analytics.

  • Hadoop: Hadoop is based on the MapReduce processing model. It processes data in two phases: Map phase, where data is split into key-value pairs and processed in parallel across multiple nodes, and Reduce phase, where the results from the Map phase are aggregated and combined to produce the final output.

  • Spark: Spark also supports the MapReduce model, but it introduces the concept of Resilient Distributed Datasets (RDDs). RDDs are immutable distributed collections of objects that can be processed in parallel. Spark allows for in-memory data processing, reducing the need for costly disk I/O between stages, which can make it significantly faster than Hadoop's MapReduce for certain use cases.

The working of Hadoop Streaming

Mapper Phase:

Hadoop reads data from input files and splits them into fixed-size blocks called InputSplits.

For each InputSplit, Hadoop launches an instance of the external program (e.g., a Python script) as the Mapper. The external Mapper reads data from stdin, processes it, and writes intermediate key-value pairs to stdout.

Shuffling and Sorting:

Hadoop collects the intermediate key-value pairs from all Mappers and performs a shuffling and sorting phase. The output of the Mapper is sorted by keys so that all values for the same key are grouped together.

Reducer Phase:

For each unique key, Hadoop launches an instance of the external program as the Reducer.

The external Reducer reads data from stdin, processes the values associated with the key, and writes the final key-value pairs to stdout.

Output:

The final output from the Reducers is collected and stored in the Hadoop Distributed File System (HDFS) or other output destinations.


To install Hadoop on Ubuntu and start all Hadoop daemons, follow these steps:

1. Update the system: Open a terminal and run the following command to update the package list and upgrade installed packages:

sudo apt update && sudo apt upgrade -y

2. Install Java: Hadoop requires Java to run. Install OpenJDK using the following command

sudo apt install openjdk-8-jdk

The following additional packages will be installed:

openjdk-8-jdk-headless openjdk-8-jre openjdk-8-jre-headless

...

...

...

download the latest stable version of Hadoop is 3.3.6 (730 MB file.)

4. tar xfz hadoop-3.3.6-aarch64.tar.gz

5. sudo mv hadoop-3.3.6 /usr/local/hadoop

6. insert the following environmental variables in .bashrc and source it

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

7. cd /usr/local/hadoop/etc/hadoop

8. Add the following environmental variables in hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

9. start-all.sh


10. After successful installation, when you type

jps

you should see the following processes running

141399 ResourceManager

143044 NodeManager

184580 DataNode

184820 SecondaryNameNode

189725 NameNode

10.

To run Python word count application using Hadoop streaming , follow these steps:

0. In the folder /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE, see files

mapper.py, reducer.py, word_count_data.txt

chmod 777 mapper.py reducer.py

1. Before making a run remove the output folder from HDFS

hdfs dfs -rmdir /word_count_in_python/output

2. Run Hadoop streaming for the Python word_count application

hadoop jar /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/hadoop- streaming-2.7.3.jar -input /word_count_in_python/word_count_data.txt -output /word_count_in_python/output -mapper /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/mapper.py -reducer /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/reducer1.pyERROR namenode.SecondaryNameNode: Failed to start secondary namenode

java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): file:/// has no authority.

ChatGPT

The error message you encountered indicates that there is an issue with the configuration of the NameNode address (fs.defaultFS) in Hadoop. Specifically, it appears that the fs.defaultFS property is set to a file URI (file:///), which is incorrect for the NameNode address.


To resolve this error, you need to ensure that the fs.defaultFS property is correctly set to the appropriate HDFS URI that points to the active NameNode in your Hadoop configuration.


Here's how you can fix it:


Open the core-site.xml configuration file:

Navigate to the Hadoop configuration directory ($HADOOP_HOME/etc/hadoop) and open the core-site.xml file using a text editor:


bash

Copy code

cd $HADOOP_HOME/etc/hadoop

nano core-site.xml

Check and modify the fs.defaultFS property:

Look for the fs.defaultFS property within the core-site.xml file. It should be set to the HDFS URI of your active NameNode. For a pseudo-distributed setup on a single node, it would typically be:


xml

Copy code

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

Replace localhost with the hostname or IP address of your NameNode if you are running a distributed setup. The default port for the NameNode is 9000.


Save the changes and close the file.


Retry starting the secondary namenode:

After updating the core-site.xml file, try starting the secondary namenode again:


bash

Copy code

hdfs secondarynamenode

If everything is correctly configured, the secondary namenode should start without any errors.

3. Check whether files are written in HDFS

hdfs dfs -ls /word_count_in_python/output

4. display the output file from HDFS

hdfs dfs -cat /word_count_in_python/output/part-00000

5.Other HDFS commands: To transfer files from Unix filesystem to HDFS filesystem

hdfs dfs -put ~/example.txt /word_count_in_python/output4


Trouble Shooting


Various logs provided by Hadoop are rich sources of information for troubleshooting. They are kept in /usr/local/hadoop/logs


TS1:

JPS showed NameNode was not running

netstat -tuln | grep 9000 , returned empty


Check for NameNode Formatting: If this is the first time setting up Hadoop or if you suspect that the NameNode metadata might be corrupted, you can try formatting the NameNode using the following command (This will delete all HDFS data, so use with caution):


hdfs namenode -format

After formatting, try starting Hadoop again using start-dfs.sh.


start-dfs.sh

Starting namenodes on [dhruv]

netstat -tuln | grep 9000

tcp 0 0 192.168.0.102:9000 0.0.0.0:* LISTEN

Now the jps command lists all the running Hadoop processes

141399 ResourceManager

143044 NodeManager

184580 DataNode

184820 SecondaryNameNode

189725 NameNode

TS2:

If you get connection refused problem, refer

https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused


TS2

ERROR

ERROR namenode.SecondaryNameNode: Failed to start secondary namenode java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): file:/// has no authority.

Solution

Check and modify the fs.defaultFS property: Look for the fs.defaultFS property within the core-site.xml file. It should be set to the HDFS URI of your active NameNode. For a pseudo-distributed setup on a single node, it would typically be:


<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

Replace localhost with the hostname or IP address of your NameNode if you are running a distributed setup. The default port for the NameNode is 9000.

Retry starting the secondary namenode: After updating the core-site.xml file, try starting the secondary namenode again:


hdfs secondarynamenode

TS4

ERROR

Call From dhruv/192.168.0.102 to dhruv:9000 failed on connection exception: java.net.ConnectException: Connection refused;

Solution


TS6

Problem

Hadoop: the jps command doesn't show Namenode. Only 141399 ResourceManager 143044 NodeManager 184580 DataNode 184820 SecondaryNameNode are running

Solution

Start Hadoop NameNode: If the NameNode is not running, you can manually start it using the following command:


start-dfs.sh

Further if problem persists

Check for NameNode Formatting: If this is the first time setting up Hadoop or if you suspect that the NameNode metadata might be corrupted, you can try formatting the NameNode using the following command (This will delete all HDFS data, so use with caution):


hdfs namenode -format

After formatting, try starting Hadoop again using start-dfs.sh.

TS7

Error

Starting secondary namenodes [dhruv] dhruv: ERROR: Cannot set priority of secondarynamenode process 181356

Solution:

The error "Cannot set priority of secondarynamenode process" during the startup of the secondary namenode in Hadoop indicates that the process is unable to set its priority using the nice command due to insufficient permissions. Like the previous cases, this issue is usually caused by incorrect configurations or missing permissions for the user running Hadoop.

To resolve this error, follow the steps below:

  1. Ensure you are running Hadoop as a non-root user: Running Hadoop as the root user is not recommended for security reasons. Make sure you are logged in as a regular user with sufficient permissions to execute Hadoop processes.

  2. Grant permissions to set process priorities (using nice): You can grant the necessary permissions by adding the user to the sudo group and modifying the sudoers file to allow the specific command without a password prompt. Follow these steps:

sudo visudo

<username> ALL=(ALL) NOPASSWD: /usr/bin/nice

hdfs secondarynamenode


TS8

Cannot run program "/shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/reducer1.py": error=13, Permission denied

Solution

chmod 777 /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/reducer1.py References

bottom of page