In early 2008, Scrapy was released into this world and it soon became the #1 Web Scraping tool for beginners. Why? It's because it's simple enough for beginners yet advanced enough for the pros. Here are some of the use cases -
In early 2008, Scrapy was released into this world and it soon became the #1 Web Scraping tool for beginners. Why? It's because it's simple enough for beginners yet advanced enough for the pros. Here are some of the use cases -
Ecommerce ( Amazon ) - Scrape product names, pricing and reviews
Data - Get a huge collection of data/images for Machine Learning
Email Address - Big companies scrape it and use it for Lead Generation
Come learn with me and I'll show you how you can bend Scrapy to your will. This course is great for beginners in Python at any age and any level of computer literacy.
The goal is simple: learn Scrapy by working on real projects step-by-step while we explain every concept along the way. For the duration of this course we will take you on a journey and you're going to learn how to:
Scrape Data from nearly Any Website
Build your own Spiders from scratch for all types of Web Scraping purposes
Transfer the data that you have scraped into Json, CSV and XML
Store the data in databases - SQLite3, MySQL and MongoDB
Create Web Crawlers and follow links on any web page
Logging in into websites
Bypassing restrictions & bans by using User-Agents and Proxies
Internalize the concepts by completely scraping amazon and get ready to scrape more advance website.
In this video we understand the terms python web scraping, spiders and web crawling. We also see an example of amazon being scraped using scrapy.
In this video we look at the behind the scenes of web scraping a website and how does our scrapy python program goes to a website to extract data.
In this video we look at a secret file called as robots.txt file and how does scrapy treat that file to make sure you are following the policies of web scraping public websites. We also learn how to bypass the rules.
In this video we learn how to install scrapy using my favourite IDE Pycharm.
In this video we install scrapy using the terminal so you can use it with Sublime text, VScode or any IDE.
In this video we understand the project structure of scrapy and go into the different files like Items, Pipelines and Settings.
In this video we will create our very first spider/crawler using Scrapy!
In this video we will run our very first spider/crawler and finally scrape a website using Scrapy.
In this video we will scrape quotes from a website and select elements that need to be scraped using CSS Selectors. We will also learn about the tool called as Selector Gadget that is going to make your life so much easier!
There are two type of selectors 'CSS selectors' and 'XPATH selectors'. One of the main uses of xpath selectors is getting the value of html tags.
In this video we will be scraping quotes and authors from our website using the concepts we have learned in the previous python web scraping videos.
In this video we are going to learning how to put that extracted data in containers called items.
Now why exactly do we need to put them in containers? Because we have already extracted the data. Can;t we just put them in some kind of database? The answer is yes. You can. But there might be a few problems when you are storing the data directly in the database when you are working on big/multiple projects.
Scrapy spiders can return the extracted data as Python dictionaries which we have already been doing right with our quotes project. But the problem with Python dictionaries is that it lacks structure. It is easy to make a typo in a field name or return inconsistent data, especially in a larger project with many spiders.
So it's always a good idea to move the scraped data to temporary location called containers and then store them inside the database. So these temporary containers are called as items.
Now that we have successfully scraped data from quote website and stored them in these temporary containers we can finally go to the next step and learn how to store the scraped data in some kind of a database or a file system.
So in this video we are going to be learning how to store this extracted data in a JSON, XML and CSV file.
Now before we go on to learn about storing the scraped data in our database we got to learn about Pipelines.
So if we discuss the flow of our scraped data it somewhat looks like this. It first gets scraped by our spider then it is stored inside the temporary containers called items and then you can store them inside a JSON file. But if we want send this data to a database we have to add one more step to this flow. After storing them inside item containers we are going to send them to this pipeline where this process_item method is automatically called and the item variable will contain are scraped data.
In this video we are going to learn about the basics of SQLite3 so that we can store the scraped data in a database.
In this video we will be integrating scrapy with sqlite3 and finally storing the data inside a database using pipelines
In this video we are going to learn how to store our scraped data inside a MySQL database. Now before watching this video make sure that you have watched previous two videos in which we cover how to store data inside an sqlite database because a lot of the concepts that I teach in those videos are going to be used in this video. And I don't want to go over them again.
Now the first thing we need to do is install MySQL on our computer. You can go to this link if you are on windows to install MySQL. But if you are using Linux you can check this link out to install and use MySQL. I am going to covering only the windows part of installation because the Linux installation is pretty easy.
Now just click on this link to start the installation. I am going through this installation pretty quickly because it's simple. While installing make sure that you choose the developer's default option because we want everything installed on our computer including Connectors, routers, servers and MySQL Work Bench which is a GUI software to connect and handle various connections.
Also when you are asked to choose the root password. You can choose whatever you want but make sure that you remember the root password because we are going to be using the same password everywhere. If you forget this password it's going to be difficult to reset it.
Steps -
1) Install MySQL https://dev.mysql.com/downloads/installer/
Linux - https://support.rackspace.com/how-to/installing-mysql-server-on-ubuntu/
- Make sure you go with default options.
- Remember the root password
2) Install MySQL-connector-python
3) Create a new connection using Mysql workbench
4) Create a new database myquotes using SQL workbench
5) Write the code in Pipelines
6) Scrape the data
7) Show the data in SQL Workbench
In this video we will be learning how to store the scraped data inside a MongoDB database using Python.
Instructions -
1) Install MongoDB - https://docs.mongodb.com/manual/administration/install-community/
Make sure you install everything including mongodb compass https://www.mongodb.com/products/compass
2) Create a folder /data/db
3) Run the mongod.exe once
3) Install Pymongo on pycharm
4) Make sure you pipleline is activated
5) Write the code in MongoDB
6) See the saved data in Mongo Compass
In this web crawling video we will learn how to follow links given on any webpage and also how to scrape multiple pages using Scrapy Python.
In this web scraping video we learn how to scrape multiple pages using URL's / websites with Pagination.
In this video we are going to learn to login into websites using scrapy and we will be using the quotes.toscrape.com website to learn that. As you can see on the left there is a login button and clicking on it takes us to a form which contains the username and the password.
Now why are we exactly learning to login? A lot of websites will restrict the content that you might want to scrape behind a login page. So to scrape that restricted data. It's always a good idea to learn how to login into websites using scrapy.
So by this video you already have a very good understanding of Scrapy. Now just to internalize the concepts we have learned, we will be a working on a complete really life project of scraping amazon.com
We will be scraping the books department of amazon. More specifically the collection of books that were released in the last 30 days. Now if you are following along, you don't have to choose books. You can choose any department on amazon.
I already created the project 'AmazonTutorial' on pycharm and have installed scrapy. If you don't remember how to install scrapy you can always go back to my installing scrapy video.
Now before we run our spider, I just want you want to tell you that are program might not work. If you have scraped amazon before then it's not going to work but if this you first time then the above code will work. The reason for it not working is that amazon puts on restrictions when you are trying to scrape a lot of its data. So we are going to bypass those restriction by using something known as user agents. But before we get into that lets actually run our program
In the last video we scraped the book section of amazon and we used something known as user-agent to bypass the restriction. So what exactly is this user agent and how is it able to bypass the restrictions placed by amazon?
Whenever a browser like chrome or Mozilla visits a website, that website asks for the identity of your browser. That identity of you browser is known as a user-agent. And if we give the same identity to a website like amazon. It places restrictions and sometimes bans the computer from visiting amazon.
So there are two ways to trick amazon. First is to use user-agents that are allowed by amazon. For example, amazon has to allow Google to crawl it's website if it wants it's products to be shown on Google Search. So we can basically replace our user-agent with Google's user-agent which is known as Google bot and trick amazon into thinking that actually Google is crawling the website and not us. And this exactly what we did in the last video. We found out the Google's user-agent name by typing it in Google Search. And then we replaced our user agent with Google.
The other way is to keep rotating our user-agents. If amazon identifies our computer using our user-agent then we can probably use fake user-agents in rotation and trick amazon into thinking that a lot of browsers are visiting the website instead of just one and this is what we will be learning in this video.
In this last video we bypassed the scraping restrictions by using user-agents and in this video we will be learning how to bypass them by using something known as proxies.
Before we go into proxies, you need to understand what is an IP address. An IP address is basically an address of your computer. You can find your own IP address by going to google and typing in 'What is my IP'.
Whenever you connect to a website you are automatically telling them your IP address. A website like amazon can recognize your IP address and ban you if you try to scrape a lot of it's data. But what if used a another IP address instead of our own. And even better we can use a lot of IP addresses that our not our own, and put them in rotation. So we every-time we send a request to amazon. It's going to be with a different IP address.
When you use an IP address that is not your own. Then that other IP address is known as a proxy. If we look up the definition of proxy on google it says 'the authority to represent someone else'. So basically we are hiding our address and using someone elses.
In this last video we will scrap the rest of the pages of amazon.
Thankyou for joining me in this video series :)
In this video we go into Object Oriented Programming (OOP) and how to use it to create classes and Objects. We also discover the difference between and instance and an object. And at the end we cover class variables and instance variables.
In this video we are going to learn about inheritance and how to inherit the properties like methods and attributes of one class to another class by creating a subclass. We are also going to cover nested inheritance. Let's get started.
OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.
Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.
Find this site helpful? Tell a friend about us.
We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.
Your purchases help us maintain our catalog and keep our servers humming without ads.
Thank you for supporting OpenCourser.