Enable Javascript

Please enable Javascript to view website properly

Toll Free 1800 889 7020

Looking for an Expert Development Team? Take 2 weeks Free Trial! Try Now

Combine HDFS and Linux Command to Query Data in Big Data Application

Introduction HDFS command and Linux command for text processing

Here We are presenting a tutorial about the combination of Hadoop Distributed File System Command (HDFS) And Linux Command using querying data in Big Data. Read the full post and you can try out it yourself.

HDFS command includes command which interacts with HDFS and another file system in Hadoop. For instance, we can use some basic commands to move, copy and delete the data folder in HDFS. The main purpose of this blog guide how to query, scan the data in HDFS with the HDFS command.

In Linux command, we use some basic commands to interact with text data such as grep, sort, uniq, etc. It is helpful with the local file but how we can use these commands with HDFS.

This blog will guide how to combine the advantages of commands to query the data in HDFS.


Java: JDK 1.7

Cloudera version: CDH4.6.0

Initial steps

1. We need to ready some input data file for user information about music, open a new file in the terminal of Linux:


Text some input data with format: id;music;viewTime;duration;price;musicType

1;Music 1;2;4;70;Pop 2;Music 2;3;5;66;Rock 3;Music 3;1;5;87;Classic 4;Music 4;3;5;90;Dance 5;Music 5;2;3;34;Rock 5;Music 5;2;3;34;Rock

2. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:

hadoop fs -mkdir -p /data/mysample/ hadoop fs -put file1/data/mysample/ Script and verify the result in line We will implement script for some queries below: List all user information who listen Rock and Pop hadoop fs -text /data/mysample/* | grep -E "(Rock|Pop)" List all user information who listenmusics with view time/ duration less than 0.3 hadoop fs -text /data/mysample/* | awk -F ";" '{if($3 !="" && $4 != "" && ($3/$4 < 0.3)) print $0;} ' select music, music type in the data set hadoop fs -text /data/mysample/* | cut -f2,6 Count how many duplication records in the data set hadoop fs -text /data/mysample/* | sort | uniq -c List first 5 records by natural order in data set hadoop fs -text /data/mysample/* | head -n 5 List last 5 records by natural order in data set hadoop fs -text /data/mysample/* | tail -n 5 List all records which price greater than 50 hadoop fs -text /data/mysample/* | awk -F ";" ' {if($5 !="" && ($5> 50)) print $0;}' Distinct duplication record in data set hadoop fs -text /data/mysample/* | sort | uniq

To run automation for the query above, we can add it all into one sh file and run it from the command file or add the sh file to a cron job to make everything in automation mode.

vi auto.sh

hadoop fs -text /data/mysample/* | grep -E "(Rock|Pop)" hadoop fs -text /data/mysample/* | awk -F ";" '{if($3 !="" && $4 != "" && ($3/$4 < 0.3)) print $0;} ' hadoop fs -text /data/mysample/* | cut -f2,6 hadoop fs -text /data/mysample/* | sort | uniq -c hadoop fs -text /data/mysample/* | head -n 5 hadoop fs -text /data/mysample/* | tail -n 5 hadoop fs -text /data/mysample/* | awk -F ";" ' {if($5 !="" && ($5> 50)) print $0;}' hadoop fs -text /data/mysample/* | sort | uniq

Hope that you guys can understand how to combine the Linux shell command and HDFS command to query the data in HDFS without Hive mapping. We use this approach in case we need some ad-hoc query.

All the instances and coding shared by Big Data Solutions experts from India & Hadoop architect for reference purposes only. You can ask your questions (if any) to them and get answers sooner.

Software Development Team
Need Software Development Team?

Thank you!
We will contact soon.

Oops! Something went wrong.

Recent Blogs


NSS Note
Trusted by Global Clients