Web scrapping/web data extraction from website using python.

Scrapping table data from website

Web Scrapping: Web scrapping is technique to fetch data from websites. It is known as web data extraction.

 

We can do web scrapping using many ways. There are many ways to get web table data using urllib and BeautifulSoup, but its lengthy, complex and also time-consuming process.

 

To avoid these disadvantages, we use the pandas library. We can do same task using pandas library.

             read_html(): read_html() is the method in pandas library is a web scraping tool that extracts all the tables on a website by just giving the required URL as a parameter. Read HTML tables into a list of data frame objects.

  •  

             Syntax: pandas.read_html(url, other parameters)         






        Important Parameters:

  • match: This parameter is used for extracting the targeting table.    
  • header: The row to use to make as the column header.     
  • skiprows: Used to skip a number of rows. 
  • attr: This is a dictionary of attributes that you can pass to use to identify the table in the HTML

  • Finding the specific table: 
         If there are many tables on a website it will be a tedious task to find a particular table.
          
         1.    We need to add an extra parameter (match) included in the read_html(). match: This parameter is used for extracting the targeting table.  
 
         2.    We can find the specific tables in another way using the attr parameter included in the read_html(). attr: This is a dictionary of attributes that you can pass to use to identify the table in the HTML

      





How to inspect a table:  
Before use, the attr parameter needs to inspect a table. To inspect a table, follow the following steps. 

  1. Left-click on the targeting table.
  2. Select the inspect option.
  3. After inspecting find the related tag to that table, which identifies to targeting table.
Parsing date columns with parse_dates: The date column gets read as an object data type. To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.


·   Explicitly typecast with converters: Sometimes you may want to explicitly typecast to ensure dtype integrity. For these requirements, we can do explicitly typecast with the argument converters.

·      Handle missing values: By default, all empty strings are treated as missing values and read as NaN. To keep these empty strings, we can set the argument keep_default_na to False.


so Sometimes, you may have other character representations for missing values. If we know what kind of characters are used as missing values in the table, we can handle them using na_values parameters. 

·  Malicious web scrapping: Web scraping is considered malicious when data is extracted without the permission of website owners. The two most common use cases are price scraping and content theft.

 

   Conclusion: In this article, we learned how to scrap tabular data from websites with the read_html method. 


         Thank you very much for reading.😊  Please read other articles.


    



Comments

Popular posts from this blog

How to convert PDF file into audio file?

How to perform operations on emails and folders using imap_tools?

Pillow Libary in Python.