Web Scraping with Python

What is Web scraping?

Web scraping is a computer software technique of extracting information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet). Python has several options for HTML scraping. They are:
  1. BeautifulSoup
  2. Mechanize
  3. Scrapemark
  4. Scrapy

BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree . It helps you pull particular content from a webpage, remove the HTML markup, and save the information. Professionals can scrape information from web pages in the form of tables, lists, or paragraphs. Urllib2 is another library that can be used in combination with the BeautifulSoup library for fetching the web pages. Filters can be added to extract specific information from web pages . Urllib2 is a Python module that can fetch URLs. It commonly saves programmers hours or days of work.

Mechanize

Mechanize A very useful python module for navigating through web forms is Mechanize. It acts like a browser allowing you to do web scraping , functional testing of web sites and things no one has thought of yet.

Scrapemark

Scrapemark is a super-convenient way to scrape webpages in Python. It utilizes an HTML-like markup language to extract the data you need. You get your results as plain old Python lists and dictionaries. Scrapemark internally utilizes regular expressions and is super-fast.

Scrapy

Scrapy is a free and open source web crawling framework for large scale web scraping , written in Python. It gives you all the tools you need to efficiently extract data from websites , process them as you want, and store them in your preferred structure and format.