How to deploy Spring-boot application in AWS EC2 server quickly?

article

Hello friends today in this tutorial, we will check how we can deploy our spring boot application in AWS EC2 easily. Now a days Spring boot and AWS has got lots of momentum and lots of projects are being deployed in AWS with underlying framework as spring boot. We will today see with the very simple application and check the steps in detail on the configuration part so that application got deployed smoothly.

What we will do in the hands-on section

Spring boot project can be easily deployed in AWS using an AWS service called Elastic Beanstalk, which is a PaaS provided by AWS which can take care of provisioning and deploying our application in AWS with few clicks.

Today we will not go the Elastic Beanstalk path, rather we will manually deploy the application using EC2 User data script to automate the environment setup and few more steps.

So, in today’s demo we will mainly do that following things to test our application

  1. Create Spring boot project from scratch
  2. Create one sample application, this could be any application, but for simplicity we will expose one simplest REST Api and deploy that in AWS EC2.
  3. Test the application in local
  4. Develop the required configuration components for AWS
  5. Deploy and test the application from AWS deployed URL

Few Key Terminologies

Before starting the actual hands on part, let’s first familiar with the technologies we will use today. Hope that the readers who have reached this section of the article, those terminologies are known to you, still below are some bullet points regarding those.

AWS – Amazon Web Service, it is the cloud offering from Amazon and currently very popular. Currently there are around 100 services available under AWS umbrella which covers many technical domains like IaaS, Paas, SaaS, Networking, Compute, Storage, Bigdata, Machine Learning, AI, etc. AWS currently the market leader in cloud offering domain and many companies are adopting AWS for cost effectiveness and other cloud parameters like easy scalability, High Availability, Fault Tolerance. AWS cost model is mostly pay per use, so customers have to pay only the services they have used based on the attributes of pricing for that service.

EC2 - Elastic Compute Cloud is the AWS offering for Infrastructure as service (IaaS). It provides virtual servers in cloud which is very cost effective and highly customizable based on the demand.

Spring Boot – Application framework based on Spring framework with lots of extra feature. Right now, one of the most popular framework for building web development specially Microservice.

So now let’s start our hands on to proceed with the steps mentioned above.

Creating Spring boot project

As a first step we need to Hire Java developers for one spring boot project which will be deployed in AWS EC2. The easiest way to do that is to

  1. Go to https://start.spring.io/, generate one spring boot project skeleton by providing the Artifact id, Group ID and required dependencies, for our testing we need to just select web starter dependency. After entering all that information, we can generate the project using Generate Project button like below.
  2. article
  3. Once we have downloaded the project, we need to unzip that in suitable location and import that project in eclipse or your favorite IDE as maven project. Now try one initial mvn clean install command on that project to ensure that all the spring boot related jars got downloaded properly.

Adding sample Rest Controller

Now we will write one simple REST endpoint for our testing. The code is very simple, it is just the famous hello world, so here it is

package com.example.awsspringbootdemo; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RequestParam; import org.springframework.web.bind.annotation.RestController; @RestController public class HelloWorldRestController { @GetMapping("/hello") public String sayHello(@RequestParam String msg) { return "Hello " + msg; } }

Test the Rest Controller in Local

Now we need to test once this Rest Api to verify that everything is working as expected. To do that just run the application as spring boot application from eclipse by running the @SpringBootApplication annotated class present in the same package which is automatically got generated initially while we created the project in the earlier step.

Once the application will run in local machine, it will by default run in 8080 port and access the below URL from browser, if everything goes well you should get response.

http://localhost:8080/hello?msg=sajal

Create Required Configurations for AWS deployment

As we have already highlighted at the beginning, we will not use Elastic Beanstalk to deploy the application in AWS, rather we will go with the conventional way to deploy the application manually by using one script that will be provided while we create the EC2 server.

For this I have created one script in file EC2_USER_DATA.md placed in the home directory of the attached project. Below is the content of that file. In short it will do the following steps while creating the EC2 Server.

  • Install required updates to the Ec2 operating system, so that we don’t face any OS related issue.
    • yum update -y
  • Install Java 1.8 and httpd server in EC2.
  • Remove old version of Java
  • Create a directory in Ec2 server
  • Copy the Spring boot application jar from S3 service to a directory in the newly created EC2 server
  • Give required permission to the newly copied application file.
  • Now do the necessary changes in HTTP to redirect 80 port traffic to 8080 as our actual application will run in 8080 but from EC2 we will access that from port 80 externally.
  • Then start the HTTP server as service, so that EC2 restart can again start the application automatically.
  • And finally start the spring boot as service as well, so that while EC2 restart the application comes up properly

We need to do another configuration in pom.xml to make the jar executable by adding this section in pom.xml

<build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <executable>true</executable> </configuration> </plugin> </plugins> </build>

Here is the User data script. User data is the textual script provided while creating Ec2 server, so that few things can be automated in the newly created server.

#!/bin/bash # install updates yum update -y # install apache httpd yum install httpd -y # install java 8 yum install java-1.8.0 -y # remove java 1.7 yum remove java-1.7.0-openjdk -y # create the working directory mkdir /opt/spring-boot-ec2-demo # create configuration specifying the used profile echo "RUN_ARGS=--spring.profiles.active=ec2" > /opt/spring-boot-ec2-demo/spring-boot-ec2-demo.conf # download the maven artifact from S3 aws s3 cp s3://springbootprojects/aws-spring-boot-demo-0.0.1-SNAPSHOT.jar /opt/spring-boot-ec2-demo/ --region=ap-south-1 # create a springboot user to run the app as a service useradd springboot # springboot login shell disabled chsh -s /sbin/nologin springboot chown springboot:springboot /opt/spring-boot-ec2-demo/aws-spring-boot-demo-0.0.1-SNAPSHOT.jar chmod 500 /opt/spring-boot-ec2-demo/aws-spring-boot-demo-0.0.1-SNAPSHOT.jar # create a symbolic link ln -s /opt/spring-boot-ec2-demo/aws-spring-boot-demo-0.0.1-SNAPSHOT.jar /etc/init.d/spring-boot-ec2-demo # forward port 80 to 8080 echo "<VirtualHost *:80> ProxyRequests Off ProxyPass / http://localhost:8080/ ProxyPassReverse / http://localhost:8080/ </VirtualHost>" >> /etc/httpd/conf/httpd.conf # start the httpd and spring-boot-ec2-demo service httpd start service spring-boot-ec2-demo start # automatically start httpd and spring-boot-ec2-demo if this ec2 instance reboots chkconfig httpd on chkconfig spring-boot-ec2-demo on

Now do the final build and Upload the file in S3 service

Now we will deploy this application to AWS, to do that we first need to upload the deployable artifact to S3 from where the above-mentioned script will copy the jar and place in EC2 server.

To do that

  • first build the artifact in local machine by maven command mvn clean install
  • Now we need to create one bucket called springbootprojects in AWS S3 and upload the jar file generated in the target directory of the project
  • Make sure you create the bucket in ap-south-1 region in S3, as our user data script is written as per that only, if you create it in different region then please change the region in the User data script, otherwise Ec2 will not be able to copy the file.
  • Here is the guideline on how to create and upload files in AWS S3. https://docs.aws.amazon.com/AmazonS3/latest/user-guide/upload-objects.html

Create EC2 role for S3

Now we need to create one IAM role for S3 which needs to assign to Ec2 server to access the S3 bucket that we have created earlier. This will enable Ec2 server user data script to copy the application executable jar file to EC2 directory for deployment. I am not mentioning it in details, AWS has very good documentation on this https://docs.databricks.com/administration-guide/cloud-configurations/aws/iam-roles.html Please follow and create once EC2 role for S3 read only access.

Create one Security Group for accessing the Application from outside AWS

Next prerequisite step is to open port 80 to outside world. This is kind of standard firewall open in networking terminology. In AWS this can be achieved by creating/modifying one Security Group and assign to the EC2 instance. In security group, we should only expose the required ports to the outside world for security reason. For our scenario, we will only expose 80 and 22 for HTTP access and TCP access so that log in to the Ec2 instance through terminal and also access the web application from browser/REST client/another outside client application.

AWS has very good documentations as well on how to create security group, you can follow the below links or any other links that are available online.

https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html

https://aws.amazon.com/premiumsupport/knowledge-center/connect-http-https-ec2/

EC2 instance and deploy the application

Now we have completed the required prerequisite like unloading the jar file in S3 and creating security group to open port for access application and finally the Ec2 role for S3 access.

Now we need to create one EC2 instance and deploy the application. To create EC2 instance we need to do the below steps

  • Login to AWS console and select EC2 service
  • Now create Launch Instance button from the EC2 landing page
  • article
  • Then select the AMI and go to the next page
  • article
  • Now select the instance type (Choose only the free tire eligible, else charges will apply)
  • article
  • Now in instance details page we need to select two important things, rest of the things you can keep as default like VPC, subnet etc.
    • In this page first select the IAM role that we have created earlier, in my case I named that role as AmazonS3FullAccess
    • article
    • Then we need to paste the USER DATA in the Advance Details section. You can copy the whole code as motioned above/attached code base.
    • article
  • Now choose/edit the security group to allow HTTP traffic to port 80
  • article
  • Finally choose one key pair and finish the instance creation step by clicking Launch Instance button
  • article
  • Now go back to the EC2 landing page, you will see instance is being launched. Wait till instance state is running and all status checks are completed. You can see those in the EC2 page itself
  • article
  • Now note down the public DNS/IP address from the instance description tab of the same page, we will use this DNS/IP to access the application
  • article

Access the application deployed in AWS

As we had noted the IP/DNS of the server where the application has been deployed, we can easily access the application with this IP/DNS.

article

Final Note –

Congratulations, if everything goes well you have now developed one sample spring boot REST web service and deployed in a single EC2 instance and able to access that application from outside AWS. This is one way of deploying application to EC2, there are obvious other ways as well which we will see in next articles.

Advanced Web Scraping Tutorials in Python

Web Scraping in Python

article

Introduction:

Some call it “screen scraping”, others “web scraping” and yet a few others call it simply “data extraction” from documents may be online or offline. These terms are used interchangeably, but they have slightly different meanings. Hence, the tactics used for extracting such data are slightly different, but for the course of this discussion we will concentrate on “web scraping”, which basically means extracting data from websites (i.e., HTML documents). Later on, we will take a peek at how to extract data from Microsoft Word and Excel documents, and PDF files.

Web Scraping using urllib, urllib2, and BeautifulSoup:

Let us dive in straight away into the topic of “web scraping”. Now, there are multiple ways of doing this using python, and we will take a look at each of them briefly, but our main focus will be on using a slew of the following modules: urllib, its half-brother urllib2, and BeautifulSoup (3.2.1). You may use BeautifulSoup version 4 too, it is called bs4, and I don’t like the name for obvious reasons, so I will stick with the 3.2.1 version. It is not very different from bs4, so if you want to use that, please go ahead and use it with almost the same code that we write using version 3.2.1.

Now, the actual workhorse for this is the module urllib2 which actually makes the HTTP(S) connections, extract the content (be it HTML or a Word or XLS or PDF), and stores it into a variable in your program. In case of a MS Word or Excel or PDF document, it will actually download the entire doc as a whole and store it somewhere. You need a python module to extract the content from it, and in this discussion we will see how to do that.

Let’s get to the code now. For now you might not understand every line of it, but don’t worry, by the end of this blog I will explain everything I put down in the code and you will be able to grasp it.

[Note: the code here is taken from the repository of python and perl code at https://github.com/supmit13, which is my personal repository of code that I have written over the course of past few years, so you can go ahead and have a look at other code in that repo. Not all of it is for use in production grade, since it is really a repo of code that I normally write to test stuff, but some of it is in production. However, I own the code, and it is in public domain, so you are free to make use of it, even without informing me. Just fork a repo if you want and you can start doing your stuff. However, please do not change anything on the original versions.]

Listing #1

import os, sys, re, time # we might not use all of them in our code here, but it is a good practice to have the basic tools imported at the outset

import urllib, urllib2 from BeautifulSoup import BeautifulSoup import StringIO import gzip

# Let’s start with something simple, like yellowpages.

url = "https://www.yellowpages.com/” # Clear that the URL beginning with https httpHeaders = {‘User-Agent’ : r'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.10) Gecko/20111103 Firefox/3.6.24', 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language' : 'en-us,en;q=0.5', 'Accept-Encoding' : 'gzip,deflate', 'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'Keep-Alive' : '115', 'Connection' : 'keep-alive', }

You don’t need to specify these many header parameters, but it might be a good thing to do if you are trying to make any log parser on the server side think that you are not a bot. Please note that the ‘User-Agent’ key has a value that is a fingerprint of the Firefox browser. So a person looking at the server logs will not say that your program was actually a bot. Of course there are other measures that you need to take in order to fool the guy on the server side, but more on that as we move on.

pageRequest = urllib2.Request(url, None, httpHeaders)

# In the above line, we create a http request object that we are going to use to scrape the yellowpages website.

# The second parameter is the data we want to send the website in the request, and since

# we are making a GET request, we are not sending any explicit data. So it is None for now.

# Later, we will see how to make POST requests, and in those requests, we will see what we

# send in the ‘data’ param.

try:

pageResponse = urllib2.urlopen(pageRequest)

except:

print “Error retrieving page pointed to by URL: %s Error: %s”%(url, sys.exc_info()[1].__str__())

# Now, at this point we should be able to see what the content of the pageResponse variable

# is, but it will be in a gzipped encoded format. Still let us see what it contains.

pageContent = pageResponse.read() print pageContent

The content of the pageResponse variable will be something like this:

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec\xbd\xebr\x1bG\x920\xfa\xdfOQ\xc6F\xc8\xe41\xba\xd9U} \'\r\xfbP\x105\xf2.8\xf6J6gl\xadB\x01\x82M\x02v\x13\xe0\x00 %RV\xc4\xf7\x0e\xdf\x1b\x9e\'9y\xadn\x90 D\xd9\x9a\x99\xdd\x8d\xb1\xccFuu\xdd++3+3+\xeb\xab\xcf\x9f|\xd7\xff\xe1\xa7\xef\x0f\xccxy^\x7f\xfd\x15>M=\x9c\x9e\xf5: \xd5\xb4\x03\xef\xd5\xf0\xe4\xeb\xaf\xce\xab\xe5\x10\x12,/\x82\xeao\x97\x93\xab^\xe7\xaf\xc1\x8f\xfbA\ x7fv~1\\N\x8e\xeb\xaacF\xb3\xe9\xb2\x9a.{\x9do\x0fz\xd5\xc9Y\x05Y\x17\xa3\xf9\xe4bi\x96\xd7\x17U\xef\x8be\ xf5v\xb9\xf3\xcb\xf0j\xc8\xb1_|\xfdf2=\x99\xbd\t\xff\xfc\xfc\xe0\xc7\xc3\xdf~\xdb\xa2\xdf\xde\xbb\xf7\xdb {\x14\n\'\xd3\xd3\x99\xe9\x99w\x9d\xe1\x19\x14\xdb\xd9\xedt\xba\x9d\xe3j\x08\xd5@\xf8xx\x1eN\xe7\xc1\xc9p9\x0c\ xa7\xd5\x12>U\xf3\xf9l\xfe\xf8\xfe\xef\xf5dTM\x17\xd5\x7fT\xd7\xf0y\x98\x1f\x8f"\x1b%\xd9\x08\xbe\x0c/.\xe0#tc6\ xfd\xf6\t|\xcc\x8b(\xb1Q\xd4\xe5_\xbb\x9a\xe2\x87\xc9y\xd5\xd9-\\\x18\xa5.\xb5\xb6\xdbY\xce\x87\xd3\xc5p\x84\ xdf\xfe<\xc4o\x9d\xc3\xb3\xe7/~~|\xf8\xf4\xaf\xbf\x1e>;\x1a\xff|\xf2\x97\xb3\xe5w\'?\x9e=\x7fZ\x1f=yzT <\xffS\xfc\xed\xd9\x9f\x9f\x17\xa7P\xf0\xdf.\xab\xcb\x8a\x8b\x8c\xa0\xa8\xe5\x9f.\'\'PB\x1c\x8d\xb2\xe3$rv\x98\ xc0\x7f\xd8\x00\x1c\x82\x1ff\xbf\xc2\x84\xecN/\xeb\xfa\xfd\x9e\xb9o\xf4\xba\xaf_O\xe7\xaf\xe78K\xf3\xaawz9\xa5\xa6mU\ xddiw\xb9\xfdN\xdf\xcd|\x0b\xde&\xa7[\x9fO_._m\xbf\xbb\x1a\xce\xcd\xac\x87\xe1\xde\xbb\xea\xed\xc5l\xbe\\\xec\xbe{\xff~\ xaf\x82\x98\x97\xd1\xabp4\xac\xeb\xadY(\x9f\xba\xbe\xd8\xa9f\xa5\x84\xf6\xd5\xcb\xe9\xab\xbdy\xb5\xbc\x9cc\r\xb3\xdf~\x9bn\xbf\xef\xce\xba> \xe3\xf6{\xf9\x885i\xe4{hFGK\xec\xf4z\x08.\xb3S\xd3\xee\xc7\xb6dk\xc7\xed\x9d\xce\xe6[\\y\xb47\xfbj\x19\xd6\xd5\xf4l9\xde\x9b}\xf9\xe56\xf4 \xee\xe5\xec\xd5\xb6o\xca\xfb\xadwv\xf7\xe5\x86\xc1\xd8~\xf7\xde\xbf\xcd\xf4\xbb\xe4\xf6\xd9|\xcc\x04R\xbc\x1c\x85\xd3\xd9\x9b\xadm\x18\x9b\ xd9\x14`c\xebrk8?\xbb<\x87\x89Zlow\xa7\xdf\xe0<\xed.\xc7\x93\x05\x14\x05\xafW\xb3\xc9\x89\x89(\xe2\xfd{l\xf6\xa4Wmu\xc6\xc3\xe9\t,\x9e\xed \xee\x10\xde\xe2\xed\xee%\xfc$\xdb\xddS\xfcVA|xV-\xb7\x10\xc6F\xd5\x1cR\x8d0\xbe\x9e\rO\xe8m\xd1\xa3Y\xdf\xeb\\NO\xaa\xd3\xc9\xb4:iFO\ xc0cZ\xbd\x99W\x00\xbb\x8f\x1emi\xb0\xb7\xd8\xde\xc3\xfa/z/;\x8bj\xf9=\xc0\xd6\xd1\xa4zC\xc0\xdb\xc5\x98\xfe\xe5b9;\xdf_.\xe7\x93\xe3\ xcb\xa5D\x1e\xe0\xf2zF\xad\x9dC\x0c\xd46Y\x8c\xa1B\x80\xcd\x93\x93\x1ff?`\x0b\xe1e2\xad\xa1\x1d\xcf&K\xfe\xf0\xbc\xaa\xab\xe1\xa2\xea\xb...’

For our purposes, this is garbage and we can’t do anything useful with it. So, in order to get the actual content (in english, not gibberish), we need to add the following code:

Listing #2

# Remember we imported StringIO – here is where we use it.

responseStream = StringIO.StringIO(pageContent) decodedContent = ""

try:

gzipper = gzip.GzipFile(fileobj=responseStream)

decodedContent = gzipper.read()

except:

# Maybe this isn't gzipped content after all....

decodedContent = pageContent

print decodedContent

# This will print the contents in English

Now, this is the type of code (the decoding part) we need on a routine basis. Hence it is best to create a function out of it.

Listing #3

def decodeGzippedContent(encoded_content): response_stream = StringIO.StringIO(encoded_content) decoded_content = "" try: gzipper = gzip.GzipFile(fileobj=response_stream) decoded_content = gzipper.read() except: # Maybe this isn't gzipped content after all.... decoded_content = encoded_content return(decoded_content)

Next time we encounter this scenario, we will call decodeGzippedContent with the encoded content as a param. Having done this, let us now concentrate on extraction of data. We will need BeautifulSoup here.

soup = BeautifulSoup(decodedContent) # Let’s say we are interested in all the anchor tags here... alllinks = soup.find("a") for link in alllinks: if link.has_key(“href”): print link[‘href’] else: pass

BeautifulSoup has enormous capabilities of extracting data and it would not be possible here to show them here in the narrow scope of this document. For example, it can extract data based on a tag name and an attribute of that tag. Let us suppose you want to get all the data contained in all “div” tags in an HTML document, but you want to consider only those div tags that have their “class” attribute set to “news”. In order to do that you could write the following code:

Listing #4

soup = BeautifulSoup(html) allnewdivs = soup.findAll("div", {'class' : 'news'}) content = “” for newsdiv in allnewsdivs: content += “\n\n” + newsdiv.text # Just put 2 newlines between the contents of each div.

To know more on BeautifulSoup, I would suggest you take a look at their documentation (which is exhaustive) and only that can give you a precise idea of handy it is in your daily scraping tasks. The link to their docs is: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Please note that it is version 4 documentation, but if you look around a bit you will find the docs to version 3.2.1 as well

Now let us get back to urllib and urllib2 for a more in depth discussion. As you might have noticed, in our last example, we used the method “urlopen” of the urllib2 module to make a request. “urlopen” occurs to be the default opener object, and you can modify it according to your requires. For example, in the following code, I am going to declare a class called “NoRedirectHandler” which basically handles all redirects that automatically takes place when you use a default opener object. Sometimes that is a convenient solution (you don't need to worry about a page redirection, since it happens automatically), but in certain conditions you might want to know what exactly is being done when the redirect happens. So here is the code below, and I will walk you through it .

Listing #5

import os, sys, re, time, gzip import urllib, urllib2, httplib from urlparse import urlparse, urlsplit from StringIO import StringIO class NoRedirectHandler(urllib2.HTTPRedirectHandler): def http_error_302(self, req, fp, code, msg, headers): infourl = urllib.addinfourl(fp, headers, req.get_full_url()) infourl.status = code infourl.code = code return infourl http_error_300 = http_error_302 http_error_301 = http_error_302 http_error_303 = http_error_302 http_error_307 = http_error_302 class Bot(object): def __init__(self, siteUrl): self.opener = urllib2.build_opener() self.no_redirect_opener = urllib2.build_opener(urllib2.HTTPHandler(), \ urllib2.HTTPSHandler(), NoRedirectHandler()) self.httpHeaders = { 'User-Agent' : r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36', 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language' : 'en-US,en;q=0.8', 'Accept-Encoding' : 'gzip,deflate,sdch', 'Connection' : 'keep-alive', 'Host' : 'www.courts.mo.gov' } # Code to create the pageRequest goes here... it is the same as in the earlier example. self.pageResponse = self.no_redirect_opener.open(self.pageRequest) # Some other code....

The point to note above is the definition of the NoRedirectHandler and the usage of it in pulling data. Using the “urllib2.build_opener()” method, you can create a custom opener object for yourself and use it. Yes, you need to write quite a bit of code for that, but on the positive side it also gives you flexibility and independence from using the cut and dried stuff which might not suit your purpose.

Other Libraries:

Hire Python developers have quite a few libraries that allow you to do the same in much lesser code, but that comes with quite an amount of constraints. Firstly, you need to learn the library and figure out what is does and how it does what it does, and then implement your solution accordingly. There are frameworks like “scrapy”, and libraries like “request” and “mechanize” that handle a lot of stuff for you. Again, I can only give you some pointers regarding how to make basic usage of these modules/frameworks, and I am also going to list down a few advantages and disadvantages of these entities.

Let's look at Scrapy first: This is a framework that was designed explicitly for web scraping.

Listing #6

from scrapy.http import Request from crawling.spiders.lxmlhtml import CustomLxmlLinkExtractor as LinkExtractor from scrapy.conf import settings from crawling.items import RawResponseItem from crawling.spiders.redis_spider import RedisSpider link_extractor = LinkExtractor( allow_domains=response.meta['allowed_domains'], allow=response.meta['allow_regex'], deny=response.meta['deny_regex'], deny_extensions=response.meta['deny_extensions'])

for link in link_extractor.extract_links(response):

# Do something with the links.

As you can see, scrapy hides the details of the technicalities, and provides the programmer with an infrastructure where she/he can focus on the functionality of the application under consideration. Again, however, the downside is that Scrapy doesn't really take care of every thing you need. For instance, if you are extracting links from a HTML document and you want to go 5 levels below the target page to extract links, Scrapy will efficiently do that for you. However, if you want to know which links came from which level, scrapy plays dumb. It doesn't keep track of the layer at which a certain link is found. And that can be a serious problem if you are trying to assign a score reflecting the relevance of the link, where links at the top pages will have more weightage than the others in the lower levels.

You may also use the “request” library, which is very easy to use:

Listing #7

import os, sys, re, time import requests resp = requests.get(“https://www.yellowpages.com”) pageContent = resp.text

# You can actually make any type of request using this module – like POST, PUT, HEAD...

Now, let us go back to urllib2 for one more time and see how POST requests are made. A POST request ideally contains data, and it can be a large volume of data. This might take time, so you might want to increase the server's timeout parameter to a value such that all the data is uploaded.

Let's get into the code:

Listing #8

pageRequest = urllib2.Request(requestUrl, encodedPageFormData, httpHeaders)

The variable “ encodedPageFormData” contains data in the following format:
param1=val1&param2=val2&param3=val3....
Now, what you can do is first collect your data and place it in a dictionary, like so:
d = {'param1' : 'val1', 'param2' : 'val2', 'param3': 'val3'...}
In order to get the data in the above mentioned format, you can do urllib.urlencode(d):
encodedPageFormData = urllib.urlencode(d)
The subsequent code is similar to the code we explained above.

If you are interested in scraping newspaper articles and their metadata (like the date on which the article was printed in the newspaper, the name of the author, his/her occupation, etc) can be achieved using a module called “newpaper”. You can easily install newspaper module using “pip install newspaper”. Once that is done, you may write the following code to extract the content of it.

Listing #9

from newspaper import Article

url = “https://indianexpress.com/article/opinion/columns/unemployment-really-cmie-surjit-bhalla-demonetisation-job-cuts-5399928/” article = Article(url) # Default language is English, 'en'. article.download() article.parse() article.nlp() # Uses Natural Language Processing, so that is wonderful stuff. print article.text # This will give you the textual content of the paper. “”” if you want the title, date published, a summary and stuff like that, you would need to make calls to the following methods: print article.title print article.summary

Scraping Sites that are behind an Authentication Mechanism:

In order to scrape contents from a website that is being an authentication mechanism (meaning you have login using your username and password), you need to send the login URL your username, password and any cookie sent by the server to the browser. In such a case, you need to keep track of the cookie(s) that are sent to the scraper/bot every