What’s a web crawler? What and why do we need for? Here, in this post, I am going to explain about to build the first web crawler.
What is a web crawler?
A web crawler is a bot that crawls the internet in order to index and download website content for scraping. Crawlers, often known as web spiders or crawling bots, are software that crawls the internet. A list of starting websites should be provided to a web crawler, from which it will index and crawl the links present in the indexed websites to find new pages.
The Library Analogy
As an example, consider all of the websites on the internet to be books in a library. A web crawler is a librarian whose duty it is to enter information about books into a catalogue so that they can be easily found when needed. To organise the books, the librarian will create a catalogue that includes the title, description, and category of the books. The same can be done with a web crawler. When a web crawler indexes all of the pages on the internet, it has achieved its purpose. Something impossibly difficult to attain!
Creating a Web Crawler
I’ll be coding in Python in this blog. Python includes a number of frameworks for web crawling and scraping. Scrapy is what I’m going to use.
$ pip install scrapy
1. Create a python application using scrapy
To create a scrapy project run the following command. Here the name of my application is
$ scrapy startproject my_first_web_crawler
This should produce scrapy boilerplate code and a folder structure that looks like this:
2. Creating a Web Crawler
The folder named
spiders contains the files which scrapy uses to crawl the websites. I will create a file named
spider1.py in this directory and write the following lines of code:
You can find the above code here: https://github.com/gouravdhar/my-first-web-crawler/blob/main/test_spider.py
I’ve given you the URLs of the websites that you’ll be crawling. Links to my blogs can be found on these pages. Because this is a list, you can provide any number of URLs. I’ll be crawling the following URLs:
The code above crawls through and downloads the web pages provided in the URLs.
Run the following command to run the code:
scrapy crawl <your-spider-name>
My spider name is
blogs (Defined in line 7 of the above code)
And tada!!! The data of the links have been downloaded in the project folder.
But that’s not enough; I also want to download the data from the links on this page. To do so, I’ll have to scrape and crawl through all of the links on the main page. A scrapy shell will be used to write scripts to scrape webpage information.
Note: Scrapy Shell is an interactive shell where you can try and debug scraping code very quickly.
To start scrapy shell, just write :
$ scrapy shell 'https://gourav-dhar.com'
scrapy shell followed by the url
Once the shell is opened, type
response to confirm that you get a 200 response.
The referring URLs are usually found in the HTML’s a href class. I need to scrape all of the values in this URL, so I’ll type this to see what comes back.
This is a list of
a href classes on the page. To get a cleaned out result of only the links, we have to use the
The result should look like this :
This will give me a list of all the href values.
To download all of the files in this list, I’ll change my parse function in the spider code to use the above command to acquire all of the links. This is how the change to parse function looks:
The Github Link for the project can be found here : Github Link
Now again run the following command in the terminal:
$ scrapy crawl blogs
I was also able to extract the content of all of the links on my homepage. This function can be extended to create an unlimited loop that crawls over all of the internet’s websites.
Summarising Web Crawler
A web crawler is a sophisticated tool for storing and indexing online page content. It has a wide range of applications.
Note : You can also add filters as to who can crawl your site by mentioning the blacklisted/whitelisted domains in the
robots.txtfile of your site.
Web crawling is used by search engines to index and store meta titles and descriptions in their databases so that the results of the user’s query can be displayed fast. Google, Bing, Yahoo, and Duck Duck Go are examples of significant search engines. On top of these results, the search engines add their own recommendation system, making each search engine’s algorithms unique.
Copyright/plagiarism violations are also detected using web crawling. Web analytics and data mining are two of the most often used applications. It’s also capable of detecting web malware, such as phishing attacks. Whether you own facebook.com, you can scour the internet to see if anyone else is using a website that looks like facebook.com and could be used for phishing attempts.
The Github Link for the project can be found here: Github Link
To get updates on more such interesting topics follow me here. Feel free to comment and share your experiences and thoughts. Don’t forget to checkout my website at https://gourav-dhar.com for more tech-related blogs.
This article was first published On medium.com(Gitconnected) then I am publishing it on my site after impress. please follow Gourav Dhar on his website and on medium.com.