Scraping & Crawling

Abstract
The crossing of information between multiple data sources, to have a targeted marketing and a more assertive business decision is happening thanks to the concept of Big Data, thus processes such as Scraping and Crawling have been essential for the collection and analysis of information for these purposes.

This article will present an introduction of Web Scraping and Crawling processes.

Introduction

Imagine that you have a store and you want to know if your prices are currently in the same range as your competitors. You’ll need to open each store’s website to check the prices. It will take a lot of time and would not scale. This manual process could be developed in an automated way using Web Scraping and Crawling.

In an ideal world, these processes would be not necessary if all the websites provided an API to consult its structured data. Some websites already provide APIs, not always enough though, since it is just restricted data and not all the site’s content.

Big Data analytics, machine learning, search engine indexing and many more fields of modern data operations require data crawling and scraping. Most of the scrapers are written in Python to ease the process of further processing collected data.

Web Crawling

Web Crawling in general terms defines a program that navigates to web pages on its own, without a well-defined purpose, exploring what a website has to offer. In another way, visiting websites, extracting links and putting them in a queue.

Figure 1 – Web Crawling Process

Referring to the store mentioned previously, where you want to know the prices of your competitors, the first step would be list the competitor’s sites and then, on each site, track all the links, which you need to check for prices later. This process, from a list of sites, gathering all the links you need to look at, is the web crawling.

A good way to map the links in the website is checking its sitemap file. It’s an efficient way to crawl a website, however it requires attention if this file is outdated or missing.

Web Scraping

Web scraping is the practice of gathering data through a program interacting with an API, or a web browser that requests data to extract needed information. The web scrapers are excellent at gathering and quickly processing large amounts of data.

Figure 2 – Web Scraping Process

Nowadays, some websites provide an API, which access their data repository in a structured way to be consumed by computer programs. However, it is not a pattern used by all. The mainly reason to build a web scraper to gather the data are:
● The website doesn’t provide an API
● The API provided isn’t free (and the website is)
● The API provided have limited number of access
● The API doesn’t expose all the data needed (and the website does)

Thinking of your store now, we already have the list of all tracked links by crawler that we need to check. It’s time to open each link, search for the specific information, extract and save in a structured way.

Before start developing a scraping, some points are important to check:

As we know, Web scraping, also known as web data extraction, is an automated software technique of extracting information from the web, whereas Web crawling just indexes the information using bots. However, some websites block web crawlers, so it is important to know who the owner is to check it and adjust your download rate. The python-whois module can help you to find who is the owner of a website.

It is important to know an estimate of the website size, since it will affect the crawling process, in terms of efficiency, and a quick way to do this is searching for the website in Google.

Identifying how the website is built will define the better way to crawl. For this, the builtwith Python module can help. From the site link, it returns the technologies used by this website.

The scraping complexity will depend on the amount of JavaScript to handle. Simple pages without JavaScript have mainly difficulty, identifying relevant URLs. Pages with partial JavaScript and simple pagination can be bypassed by changing the URL to load the specific information and increment the page in the link. For pages fully built in JavaScript where the data is only obtained running a JS, more sophisticated tools like Selenium should be used. CAPTCHAs can be a difficulty that can be handled by changing the IP of the scraper or using an Optical Character Recognition (OCR) process. Python has an OCR module named pytesseract that can help most part of the time, depending on the image quality.

Summary

Scrappers must be prepared to deal with any site’s complexity, so the data that is intended to be extracted should be analyzed previously.

There are many practical applications of having access to nearly unlimited data: market forecasting, machine-language translations, HR and employee analytics and even medical diagnostics have benefited from the ability to retrieve and analyze data from news sites.

Scraping data from websites needs to be done correctly, in a way the requests will not cause problems to the site, respecting the robots file and taking care of what can be republished. The legal part is still being established, so in any case, search for additional information in the website that you want to scrap.

References

[Practical Web Scraping for Data Science: Best Practices and Examples with Python] By Seppe Vanden Broucke, Bart Baesens
2018, Apress

[Web Scraping with Python: Collecting More Data from the Modern Web] By Ryan Miychell, 2nd Edition 2018, O’Reilly

[Big Data: What is Web Scraping and how to use it] https://towardsdatascience.com/big-data-what-is-web-scraping-and-how-to-use-it-74e7e8b58fd6

[Python-WhoIs Module] https://pypi.org/project/whois/

[Python-Builtwith Module] https://pypi.org/project/builtwith/

[Python-Pytesseract Module] https://pypi.org/project/pytesseract/

Post a Comment

* indicates required