How to scrape a website with a searchbar

How do I scrape a website that basically looks like google with just a giant searchbar in the middle of the screen. From it you can search after various companies and their stats.

I have a list of 1000 companies I want to get information about. I want some bot to search each company from my list in the search bar, open the specific company's info window and extract a certain company code that exist on each page for each company.

Is there any easy and (of course) legal way to do it?

Topic scraping crawling data-mining

Category Data Science


Here are some scrapers that have Free Credits or Free Trial

https://www.scraping-bot.io

https://www.scrapingbee.com/

https://www.scraperapi.com/

https://www.octoparse.com/

There are also good companies that create scrapers with individual parameters for each client.

https://www.zyte.com/

https://apify.com/

https://data-ox.com/

https://www.diffbot.com/


I would suggest using a combination of rvest and rselenium, depending on the way the web page is set up.

  • Rselenium to navigate the page (if needed)
  • Rvest to scrape the data from the page

I think the easiest way to do that is by using some machine, some bots.

Recently I found one called Octoparse and the solution goes like this:

  1. Drop an “Loop Item” into the Workflow Designer in the bot.

  2. Then select a “loop mode” > Choose “text list”

  3. Enter the terms you want to search in the search bar.

  4. Next, click on the search box. Choose “Enter text value”.

  5. Drag “Enter text value” into the “Loop Item” box so that the program will loop to enter the keywords, and automatically search them in the search box.

  6. Then select “Use current loop text to fill the text box”. Then click "save".

  7. Next, capture the term entered. Click the search box and select “Extract value of this item”.

The search item you just captured will be added to the extracted result. 8. Click search button > choose “Click an item”.

  1. The information I want is on the detail page. So I need to create a list of item to get into that page. Click on the title > Select Create a list a item > Add current item to the list > Continue to edit the list.

Click on the send title > Add current item to the list again. Then click “loop”.

You may check the link to see if that's what you want.


Thanks guys but I found a program called Mozenda that even idiots like me understand :) you basically click on the searchbar, import an excellist of stuff you want to search and then just click on the datafield you want to extract.


  1. I would suggest reading about http query methods, specifically about GET and POST. You can pass parameters with query and open directly company page.

    For example:

    http://google.com/search?q=GET+and+POST

    where (q=GET+and+POST) is a parameter.

  2. Once you have page you can parse it with your favorite library. (for example beautifulsoup)

EXAMPLE:

Getting number of results from couple of google queries with python 3 and beautifulsoup:

from bs4 import BeautifulSoup
import urllib.request

# List with google queries I want to make
desired_google_queries = ['Word' , 'lifdsst', 'yvou', 'should', 'load', 'from']

for query in desired_google_queries:
    # Constracting http query
    url = 'http://google.com/search?q=' + query
    # For avoid 403-error using User-Agent
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
    response = urllib.request.urlopen( req )
    html = response.read()
    # Parsing response
    soup = BeautifulSoup(html, 'html.parser')
    # Extracting number of results
    resultStats = soup.find(id="resultStats").string
    print(resultStats)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.