Scrapy : Python Web Scraping & Crawling for Beginners from Udemy

What's inside

Learning objectives

3.5+ hours of full hd video material divided into 28 downloadable lectures
Scraping single or multiple websites with scrapy
Building powerful crawlers and spiders
Creating a web crawler for amazon from scratch
Bypass restrictions using user-agents and proxies

Logging into websites with scrapy
Storing data extracted by scrapy into sqlite3, mysql and mongodb databases
Exporting data extracted by scrapy into csv, xml, or json files
Understand xpath and css selectors to extract data
Access to our private facebook group available to only students of this scrapy course

3.5+ hours of full hd video material divided into 28 downloadable lectures
Scraping single or multiple websites with scrapy
Building powerful crawlers and spiders
Creating a web crawler for amazon from scratch
Bypass restrictions using user-agents and proxies
Logging into websites with scrapy
Storing data extracted by scrapy into sqlite3, mysql and mongodb databases
Exporting data extracted by scrapy into csv, xml, or json files
Understand xpath and css selectors to extract data
Access to our private facebook group available to only students of this scrapy course

Syllabus

Introduction to Scrapy and Web Scraping

In this video we understand the terms python web scraping, spiders and web crawling. We also see an example of amazon being scraped using scrapy.

In this video we look at the behind the scenes of web scraping a website and how does our scrapy python program goes to a website to extract data.

In this video we look at a secret file called as robots.txt file and how does scrapy treat that file to make sure you are following the policies of web scraping public websites. We also learn how to bypass the rules.

Installation Guide for Scrapy

In this video we learn how to install scrapy using my favourite IDE Pycharm.

In this video we install scrapy using the terminal so you can use it with Sublime text, VScode or any IDE.

Creating your first Spider

In this video we understand the project structure of scrapy and go into the different files like Items, Pipelines and Settings.

In this video we will create our very first spider/crawler using Scrapy!

In this video we will run our very first spider/crawler and finally scrape a website using Scrapy.

Extracting data with Scrapy

In this video we will scrape quotes from a website and select elements that need to be scraped using CSS Selectors. We will also learn about the tool called as Selector Gadget that is going to make your life so much easier!

There are two type of selectors 'CSS selectors' and 'XPATH selectors'. One of the main uses of xpath selectors is getting the value of html tags.

In this video we will be scraping quotes and authors from our website using the concepts we have learned in the previous python web scraping videos.

Storing the scraped data

In this video we are going to learning how to put that extracted data in containers called items.

Now why exactly do we need to put them in containers? Because we have already extracted the data. Can;t we just put them in some kind of database? The answer is yes. You can. But there might be a few problems when you are storing the data directly in the database when you are working on big/multiple projects.

Scrapy spiders can return the extracted data as Python dictionaries which we have already been doing right with our quotes project. But the problem with Python dictionaries is that it lacks structure. It is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.

So it's always a good idea to move the scraped data to temporary location called containers and then store them inside the database. So these temporary containers are called as items.

Now that we have successfully scraped data from quote website and stored them in these temporary containers we can finally go to the next step and learn how to store the scraped data in some kind of a database or a file system.

So in this video we are going to be learning how to store this extracted data in a JSON, XML and CSV file.

Now before we go on to learn about storing the scraped data in our database we got to learn about Pipelines.

So if we discuss the flow of our scraped data it somewhat looks like this. It first gets scraped by our spider then it is stored inside the temporary containers called items and then you can store them inside a JSON file. But if we want send this data to a database we have to add one more step to this flow. After storing them inside item containers we are going to send them to this pipeline where this process_item method is automatically called and the item variable will contain are scraped data.

Extracting data to Databases : SQLite3, MySQL & MongoDB

In this video we are going to learn about the basics of SQLite3 so that we can store the scraped data in a database.

In this video we will be integrating scrapy with sqlite3 and finally storing the data inside a database using pipelines

In this video we are going to learn how to store our scraped data inside a MySQL database. Now before watching this video make sure that you have watched previous two videos in which we cover how to store data inside an sqlite database because a lot of the concepts that I teach in those videos are going to be used in this video. And I don't want to go over them again.

Now the first thing we need to do is install MySQL on our computer. You can go to this link if you are on windows to install MySQL. But if you are using Linux you can check this link out to install and use MySQL. I am going to covering only the windows part of installation because the Linux installation is pretty easy.

Now just click on this link to start the installation. I am going through this installation pretty quickly because it's simple. While installing make sure that you choose the developer's default option because we want everything installed on our computer including Connectors, routers, servers and MySQL Work Bench which is a GUI software to connect and handle various connections.

Also when you are asked to choose the root password. You can choose whatever you want but make sure that you remember the root password because we are going to be using the same password everywhere. If you forget this password it's going to be difficult to reset it.

Steps -

1) Install MySQL https://dev.mysql.com/downloads/installer/

Linux - https://support.rackspace.com/how-to/installing-mysql-server-on-ubuntu/

- Make sure you go with default options.

- Remember the root password

2) Install MySQL-connector-python

3) Create a new connection using Mysql workbench

4) Create a new database myquotes using SQL workbench

5) Write the code in Pipelines

6) Scrape the data

7) Show the data in SQL Workbench

In this video we will be learning how to store the scraped data inside a MongoDB database using Python.

Instructions -

1) Install MongoDB - https://docs.mongodb.com/manual/administration/install-community/

Make sure you install everything including mongodb compass https://www.mongodb.com/products/compass

2) Create a folder /data/db

3) Run the mongod.exe once

3) Install Pymongo on pycharm

4) Make sure you pipleline is activated

5) Write the code in MongoDB

6) See the saved data in Mongo Compass

Web Crawling and Pagination

In this web crawling video we will learn how to follow links given on any webpage and also how to scrape multiple pages using Scrapy Python.

In this web scraping video we learn how to scrape multiple pages using URL's / websites with Pagination.

Logging into websites using Scrapy

In this video we are going to learn to login into websites using scrapy and we will be using the quotes.toscrape.com website to learn that. As you can see on the left there is a login button and clicking on it takes us to a form which contains the username and the password.

Now why are we exactly learning to login? A lot of websites will restrict the content that you might want to scrape behind a login page. So to scrape that restricted data. It's always a good idea to learn how to login into websites using scrapy.

Scraping Amazon.com & Bypassing Restrictions

So by this video you already have a very good understanding of Scrapy. Now just to internalize the concepts we have learned, we will be a working on a complete really life project of scraping amazon.com

We will be scraping the books department of amazon. More specifically the collection of books that were released in the last 30 days. Now if you are following along, you don't have to choose books. You can choose any department on amazon.

I already created the project 'AmazonTutorial' on pycharm and have installed scrapy. If you don't remember how to install scrapy you can always go back to my installing scrapy video.

Now before we run our spider, I just want you want to tell you that are program might not work. If you have scraped amazon before then it's not going to work but if this you first time then the above code will work. The reason for it not working is that amazon puts on restrictions when you are trying to scrape a lot of its data. So we are going to bypass those restriction by using something known as user agents. But before we get into that lets actually run our program

In the last video we scraped the book section of amazon and we used something known as user-agent to bypass the restriction. So what exactly is this user agent and how is it able to bypass the restrictions placed by amazon?

Whenever a browser like chrome or Mozilla visits a website, that website asks for the identity of your browser. That identity of you browser is known as a user-agent. And if we give the same identity to a website like amazon. It places restrictions and sometimes bans the computer from visiting amazon.

So there are two ways to trick amazon. First is to use user-agents that are allowed by amazon. For example, amazon has to allow Google to crawl it's website if it wants it's products to be shown on Google Search. So we can basically replace our user-agent with Google's user-agent which is known as Google bot and trick amazon into thinking that actually Google is crawling the website and not us. And this exactly what we did in the last video. We found out the Google's user-agent name by typing it in Google Search. And then we replaced our user agent with Google.

The other way is to keep rotating our user-agents. If amazon identifies our computer using our user-agent then we can probably use fake user-agents in rotation and trick amazon into thinking that a lot of browsers are visiting the website instead of just one and this is what we will be learning in this video.

In this last video we bypassed the scraping restrictions by using user-agents and in this video we will be learning how to bypass them by using something known as proxies.

Before we go into proxies, you need to understand what is an IP address. An IP address is basically an address of your computer. You can find your own IP address by going to google and typing in 'What is my IP'.

Whenever you connect to a website you are automatically telling them your IP address. A website like amazon can recognize your IP address and ban you if you try to scrape a lot of it's data. But what if used a another IP address instead of our own. And even better we can use a lot of IP addresses that our not our own, and put them in rotation. So we every-time we send a request to amazon. It's going to be with a different IP address.

When you use an IP address that is not your own. Then that other IP address is known as a proxy. If we look up the definition of proxy on google it says 'the authority to represent someone else'. So basically we are hiding our address and using someone elses.

In this last video we will scrap the rest of the pages of amazon.

Thankyou for joining me in this video series :)

BONUS : Classes, Objects and Inheritance

In this video we go into Object Oriented Programming (OOP) and how to use it to create classes and Objects. We also discover the difference between and instance and an object. And at the end we cover class variables and instance variables.

In this video we are going to learn about inheritance and how to inherit the properties like methods and attributes of one class to another class by creating a subclass. We are also going to cover nested inheritance. Let's get started.

Good to know

Know what's good

, what to watch for

, and possible dealbreakers

Develops skills and knowledge in web scraping and handling extracted data, which are core skills for data engineering, data science, data journalism, computer science, and software engineering

Taught by Attreya Bhatt, who is recognized for their work in web scraping

Develops Python programming, which is a highly relevant tool in industry and academia

Examines data extraction in the vast e-commerce landscape, which is highly relevant to e-commerce professionals and researchers

Suitable for beginners, as it provides a foundational understanding of web scraping

Requires learners to come in with some experience with Python, serving as a possible barrier to entry for absolute beginners

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Scrapy : Python Web Scraping & Crawling for Beginners with these activities:

Participate in Web Scraping Study Groups

Show steps

Participate in web scraping study groups to connect with other learners and benefit from peer support.

Browse courses on Web Scraping

Show steps

Find or create a web scraping study group
Set regular meeting times
Discuss web scraping topics and share knowledge

Follow Guided Web Scraping Tutorials

Show steps

Follow guided web scraping tutorials to learn specific techniques and best practices.

Browse courses on Web Scraping

Show steps

Find reputable web scraping tutorials
Follow the tutorials step-by-step
Experiment with the techniques covered in the tutorials

Practice Web Scraping Exercises

Show steps

Practice web scraping exercises to reinforce your understanding of the underlying concepts and techniques.

Browse courses on Web Scraping

Show steps

Find online web scraping exercises
Solve the exercises using Scrapy
Review your solutions and identify areas for improvement

Four other activities

Expand to see all activities and additional details

Show all seven activities

Practice Basic Web Scraping with Scrapy

Show steps

Practicing with basic web scraping tasks will help you solidify your understanding of the fundamental techniques covered in the course.

Show steps

Set up a Scrapy project.
Write a spider to extract data from a simple website.
Parse the extracted data using CSS selectors or XPath.

Create a Web Scraping Project

Show steps

Create a web scraping project to solidify your understanding of web scraping concepts and techniques covered in the course.

Browse courses on Web Scraping

Show steps

Choose a website to scrape
Identify the data you want to extract
Write a Scrapy spider to extract the data
Store the extracted data in a database or file
Visualize the extracted data

Contribute to a Web Scraping Open Source Project

Show steps

Contribute to an open source web scraping project to gain practical experience and learn from others.

Browse courses on Web Scraping

Show steps

Find a suitable open source web scraping project
Identify ways to contribute to the project
Submit a pull request with your contributions
Review feedback and make necessary changes

Build a Simple Web Crawler

Show steps

Building a web crawler will challenge you to apply your knowledge of web scraping and programming to a more complex task.

Show steps

Design the architecture of your web crawler.
Implement the functionality to crawl and scrape websites.
Handle errors and exceptions that may occur during the crawling process.

Career center

Learners who complete Scrapy : Python Web Scraping & Crawling for Beginners will develop knowledge and skills that may be useful to these careers:

Web Scraper

A Web Scraper is responsible for extracting data from websites. The course is designed to help people learn how to scrape data from websites, which is an essential skill for a Web Scraper to have.

See salaries and explore the career path for Web Scraper

Web Developer

A Web Developer specializes in the design and development of websites. They are responsible for coding, maintenance, and web design. This course is a great foundation for the skills that a Web Developer needs to succeed. It covers foundational concepts to the advanced concepts of web scraping and crawling, which are important for a Web Developer to master.

See salaries and explore the career path for Web Developer

Information Security Analyst

An Information Security Analyst is responsible for protecting an organization's data and systems from cyber threats. The course may be useful for someone who wants to become an Information Security Analyst, as it provides a strong foundation in web scraping, crawling, and data extraction.

See salaries and explore the career path for Information Security Analyst

Database Administrator

A Database Administrator is responsible for the management and maintenance of databases. The course may be useful for someone who wants to become a Database Administrator, as it provides a strong foundation in data extraction and web scraping. These skills can help a Database Administrator to understand how data is stored and how to extract it from websites.

See salaries and explore the career path for Database Administrator

Data Miner

A Data Miner is responsible for extracting knowledge from data. The course may be useful for someone who wants to become a Data Miner, as it provides a strong foundation in data extraction and web scraping. These skills can help a Data Miner to understand how data is stored and how to extract it from websites.

See salaries and explore the career path for Data Miner

Business Analyst

A Business Analyst is responsible for analyzing business processes and identifying opportunities for improvement. The course may be useful for someone who wants to become a Business Analyst, as it provides a strong foundation in data extraction and web scraping. These skills can help a Business Analyst to understand how data is stored and how to extract it from websites.

See salaries and explore the career path for Business Analyst

Market Researcher

A Market Researcher is responsible for conducting research and gathering data on consumer preferences. The course may be useful for someone who wants to become a Market Researcher, as it provides a strong foundation in data extraction and web scraping. These skills can help a Market Researcher to understand how data is stored and how to extract it from websites.

See salaries and explore the career path for Market Researcher

Product Manager

A Product Manager is responsible for the development and launch of new products. The course may be useful for someone who wants to become a Product Manager, as it provides a strong foundation in data extraction and web scraping. These skills can help a Product Manager to understand how data is stored and how to extract it from websites.

See salaries and explore the career path for Product Manager

Software Tester

A Software Tester is responsible for testing software applications for bugs and errors. The course may be useful for someone who wants to become a Software Tester, as it provides a strong foundation in web scraping and crawling. These skills can help a Software Tester to understand how websites are structured and how to test them for errors.

See salaries and explore the career path for Software Tester

Software Engineer

A Software Engineer designs, develops, and maintains software systems. The course may be useful for someone who wants to become a Software Engineer, as it provides a strong foundation in web scraping and crawling, which are essential skills for a Software Engineer to have.

See salaries and explore the career path for Software Engineer

Data Analyst

A Data Analyst is responsible for collecting, analyzing, and interpreting data. The course may be useful for someone who wants to become a Data Analyst, as it provides a strong foundation in data extraction and web scraping. These are essential skills for a Data Analyst to have.

See salaries and explore the career path for Data Analyst

Data Scientist

A Data Scientist is someone who has a strong understanding of data and is able to use that data to solve problems. This course may be useful for someone who wants to become a Data Scientist, as it provides a strong foundation in data extraction and web scraping. These are important skills for a Data Scientist to have.

See salaries and explore the career path for Data Scientist

Data Engineer

A Data Engineer is responsible for designing, building, and maintaining data pipelines. The course may be useful for someone who wants to become a Data Engineer, as it provides a strong foundation in data extraction and web scraping. These are essential skills for a Data Engineer to have.

See salaries and explore the career path for Data Engineer

Technical Writer

A Technical Writer is responsible for creating documentation for software products. The course may be useful for someone who wants to become a Technical Writer, as it provides a strong foundation in web scraping and crawling. These skills can help a Technical Writer to understand how websites are structured and how to write documentation for them.

See salaries and explore the career path for Technical Writer

Web Designer

A Web Designer is responsible for the visual design of websites. The course may be useful for someone who wants to become a Web Designer, as it provides a strong foundation in web scraping and crawling. These skills can be useful for a Web Designer to have, as they can help them to understand how websites are structured and how to design them in a way that is both visually appealing and functional.

See salaries and explore the career path for Web Designer