Beginners Tutorial for Hadoop File System with Python


Introduction to HDFS

Hadoop Distributed File System (HDFS) is a first and the essential concept of Hadoop. It is a Java based distributed file system. The design of HDFS is based on Google file system and is used to store large amount of data on clusters of commodity hardware. Also known as the storage layer of Hadoop.

Features of HDFS:

  • Reliability : Hadoop file system provides data storage that is highly reliable. It can save up to 100s of petabytes of data. Data is stored in blocks that are further stored in racks on nodes in clusters. It can have up to N number of clusters and so data is reliably stored in the blocks. Replicas of these blocks are also created in the clusters in different machines in case of fault tolerance. Hence, data is quickly available to users without any loss.

  • Fault Tolerance : is how system handles all the unfavorable situations. Hadoop File System is highly tolerant as it follows the block theory for better configuration. The data in HDFS are divided into blocks and multiple copies of the blocks are created on different machines. This replication is configurable and is done to avoid the loss of data. If one block in a cluster goes down, the client can access the data from another machine having the copy of data node.

    HDFS has different racks on which replicas of blocks of data are created, so in case a machine fails user can access data from different rack present in another slave.

  • High Availability : Hadoop file system as high availability. The block architecture is to provide large availability of data. Block replications provide data availability when machine fails. Whenever a client wants to access data, they can easily retrieve information from the nearest node present in the cluster. During the time of machine failure data can be accessed from the replicated blocks present in another rack in another salve of the cluster.

  • Replication : This feature is the unique and essential feature of Hadoop file system. This feature is added to resolve data loss issues which occurs due to hardware failure, crashing of nodes etc. HDFS keeps on creating replicas on different machines in blocks in different clusters and regularly maintains the replications. The default replication factor is three i.e. in one cluster there are three copies of blocks.

  • Scalability : Hadoop file system is highly scalable. The requirement increases as we scale the data and hence the resources also increases like CPU, Memory, Disk etc. in the cluster. When data is high, number of machines are also increased in the cluster.

  • Distributed Storage : HDFS is a distributed file system. It stores files in the form of blocks of fixed sizes and these blocks are stores across clusters of several machines. HDFS follows a Master-Slave architecture in which the slave nodes (also called as the Data Nodes) form the cluster which is managed by the master node (also called as the Name Node).

Architecture of HDFS


As mentioned earlier, HDFS follows a Master-Slave architecture in which the Master node is called as the Name Node and the Slave node is called as Data Node. Name Node and Data Node(s) are the building blocks of HDFS.

There is exclusive one Name Node and number of Data Nodes. The Data Nodes contain the blocks of files in a distributed manner. Name node has the responsibility of managing the blocks of files and allocation/deallocation of memory for the file blocks.

  1. Master/Name NodeThe Name node stores the metadata of the whole file system, which contains information about where each block of file and its replica is stored, the number of blocks of data, the access rights for different users of the file system for a particular file, date of creation, date of modification, etc. All the Data nodes send a Heartbeat message to the Name node at a fixed interval to indicate that they are alive. Also, a block report is sent to the Name node which contains all the information about the file blocks on that particular Data node.

    There are 2 files associated with the Name node:

    • FsImage: It stores the image/state of the name node since the starting of the service

    • EditLogs: It stores all the current changes made to the file system along with the file, block, and data node on which the file block is stored.

      The Name node is also responsible for maintaining the replication factor of the block of files. Also, in case a data node fails, the Name node removes it from the cluster, handles the reallocation of resources and redirects the traffic to another data node.

  2. Slave/Data NodeData node stores the data in the form of blocks of files. All the read-write operations on files are performed on the data nodes and managed by the name node. All the data nodes send a heartbeat message to the name node to indicate their health. The default interval for that is set to 3 seconds, but it can be modified according to the need.

HDFS Commands

Give below are the basic HDFS commands:

  • HDFS get commandThis command is used to retrieve data from the Hadoop file system to local file system.

    Syntax: hdfs dfs -get <source > <local_destination>

    Example: hdfs dfs -get /users/temp/file.txt This PC/Desktop/

  • HDFS put commandThis command is used to move data to the Hadoop file system.

    Syntax: hdfs dfs -put <local source > <destination>

    Example: hdfs dfs -put /users/temp/file.txt This PC/Desktop/

  • HDFS ls commandThis command is used to list the contents of the present working directory.

    Syntax: hdfs dfs -ls

    Example: hdfs dfs -ls

  • HDFS mkdir commandThis command is used to build a latest directory.

    Syntax: hdfs dfs —mkdir /directory_nam

    Example: hdfs dfs —mkdir /my_new_directory

  • HDFS du commandThis command is used to check the file size.

    Syntax: hdfs dfs —du —s /path/to/file

    Example: hdfs dfs -du /my_new_directory/small_file

There are many more commands in HDFS. Given above are just the basic ones.

Read more - Leverage the power of python to process Big Data

Running HDFS commands using Python development

import subprocess as sp #subprocess library in Python allows the user to fork new processes, connect to their input/output/error and obtain their return codes. #method containing the functionality to run the HDFS command def run_hdfs_command(arguments): print (HDFS command:'.format(' '.join(arguments))) #the join() method connects the given arguments using the delimiter specified to it and returns a string command = sp.Popen(arguments, standard_output=sp.PIPE, standard_error=sp.PIPE) #the Popen() function call helps to start the process #using PIPE helps to redirect the output of the process to the specified file (here, standard_output and standard_error) (output, errors) = command.communicate() #The process.communicate() call reads input and output from the process return (output, errors) (output, error)= run_hdfs_command(['hadoop', 'fs', '-get', 'source, 'local_destination] #HDFS get command in Python (output, error)= run_hdfs_command(['hadoop', 'fs', '-put', 'local_source, 'destination] #HDFS put command in Python #We can give different HDFS commands to the function run_hdfs_command() in the format shown above.

Some of our clients