This is how I made my first Web Crawler!

Share Your Love

What’s a web crawler? What and why do we need for? Here, in this post, I am going to explain about to build the first web crawler.

What is a web crawler?

A web crawler is a bot that crawls the internet in order to index and download website content for scraping. Crawlers, often known as web spiders or crawling bots, are software that crawls the internet. A list of starting websites should be provided to a web crawler, from which it will index and crawl the links present in the indexed websites to find new pages.

The Library Analogy

As an example, consider all of the websites on the internet to be books in a library. A web crawler is a librarian whose duty it is to enter information about books into a catalogue so that they can be easily found when needed. To organise the books, the librarian will create a catalogue that includes the title, description, and category of the books. The same can be done with a web crawler. When a web crawler indexes all of the pages on the internet, it has achieved its purpose. Something impossibly difficult to attain!

Creating a Web Crawler

I’ll be coding in Python in this blog. Python includes a number of frameworks for web crawling and scraping. Scrapy is what I’m going to use.

Installing scrapy:

$ pip install scrapy

1. Create a python application using scrapy

To create a scrapy project run the following command. Here the name of my application is my_first_web_crawler

$ scrapy startproject my_first_web_crawler

This should produce scrapy boilerplate code and a folder structure that looks like this:

1*UCjcm2ihQ25ebSosWdd5bQ
Source: Medium

2. Creating a Web Crawler

The folder named spiders contains the files which scrapy uses to crawl the websites. I will create a file named spider1.py in this directory and write the following lines of code:

Source: Medium

You can find the above code here: https://github.com/gouravdhar/my-first-web-crawler/blob/main/test_spider.py

I’ve given you the URLs of the websites that you’ll be crawling. Links to my blogs can be found on these pages. Because this is a list, you can provide any number of URLs. I’ll be crawling the following URLs:

https://gourav-dhar.com 
https://gourav-dhar.com/profile

The code above crawls through and downloads the web pages provided in the URLs.

Run the following command to run the code:

scrapy crawl <your-spider-name>

My spider name is blogs (Defined in line 7 of the above code)

And tada!!! The data of the links have been downloaded in the project folder.

But that’s not enough; I also want to download the data from the links on this page. To do so, I’ll have to scrape and crawl through all of the links on the main page. A scrapy shell will be used to write scripts to scrape webpage information.

Note: Scrapy Shell is an interactive shell where you can try and debug scraping code very quickly.

To start scrapy shell, just write :

$ scrapy shell 'https://gourav-dhar.com'

i.e. scrapy shell followed by the url

Once the shell is opened, type response to confirm that you get a 200 response.

1*yV1c6NUvRfUM4OYfuFFO6w
Source: Medium

The referring URLs are usually found in the HTML’s a href class. I need to scrape all of the values in this URL, so I’ll type this to see what comes back.

>>> response.css('a::attr(href)')
1*Ph
Source: Medium

This is a list of a href classes on the page. To get a cleaned out result of only the links, we have to use the getall() function.

>>> response.css('a::attr(href)').getall()

The result should look like this :

Source: Medium

This will give me a list of all the href values.

To download all of the files in this list, I’ll change my parse function in the spider code to use the above command to acquire all of the links. This is how the change to parse function looks:

1*uU4dOmVPaiYwIjb4btX1hw
Source: Medium

The Github Link for the project can be found here : Github Link

Now again run the following command in the terminal:

$ scrapy crawl blogs

I was also able to extract the content of all of the links on my homepage. This function can be extended to create an unlimited loop that crawls over all of the internet’s websites.

Summarising Web Crawler

A web crawler is a sophisticated tool for storing and indexing online page content. It has a wide range of applications.

Note : You can also add filters as to who can crawl your site by mentioning the blacklisted/whitelisted domains in the robots.txt file of your site.

Web crawling is used by search engines to index and store meta titles and descriptions in their databases so that the results of the user’s query can be displayed fast. Google, Bing, Yahoo, and Duck Duck Go are examples of significant search engines. On top of these results, the search engines add their own recommendation system, making each search engine’s algorithms unique.

Copyright/plagiarism violations are also detected using web crawling. Web analytics and data mining are two of the most often used applications. It’s also capable of detecting web malware, such as phishing attacks. Whether you own facebook.com, you can scour the internet to see if anyone else is using a website that looks like facebook.com and could be used for phishing attempts.

The Github Link for the project can be found here: Github Link

To get updates on more such interesting topics follow me here. Feel free to comment and share your experiences and thoughts. Don’t forget to checkout my website at https://gourav-dhar.com for more tech-related blogs.

This article was first published On medium.com(Gitconnected) then I am publishing it on my site after impress. please follow Gourav Dhar on his website and on medium.com.

Share Your Love
Avatar photo
Lingaraj Senapati

Hey There! I am Lingaraj Senapati, the Founder of lingarajtechhub.com My skills are Freelance, Web Developer & Designer, Corporate Trainer, Digital Marketer & Youtuber.

Articles: 429

Newsletter Updates

Enter your email address below to subscribe to our newsletter