Data scraping & NLP?

Question

Data scraping & NLP?

Shilpa Kancharla

2019年4月17日 13:43

I'm scraping data from Bing search results for (non-commercial purposes, of course) on Python using BeautifulSoup. I've entered an Indian dessert name, called 'rasmalai' as the word that I am focusing on. The code I'm using returns the title and a description of the web page. I've also extracted the links for the results. Here is the code I used:

from bs4 import BeautifulSoup
import urllib, urllib2

def bing_search(query):
    address = "http://www.bing.com/search?q=%s" % (urllib.quote_plus(query))

    getRequest = urllib2.Request(address, None, {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 Chrome/65.0.3325.162 Safari/537.36'})

    urlfile = urllib2.urlopen(getRequest)
    htmlResult = urlfile.read(200000)
    urlfile.close()

    soup = BeautifulSoup(htmlResult)

    [s.extract() for s in soup('span')]
    #unwantedTags = ['a', 'strong', 'cite']
    #for tag in unwatedTags:
        #for match in soup.findAll(tag):
           # match.replaceWithChildren()

    results = soup.findAll('li', {"class" : "b_algo" })
    for result in results: 
        print "# TITLE: " + str(result.find('h2')).replace(" ", " ") + "\n#"
        print "# DESCRIPTION: " + str(result.find('p')).replace(" ", " ")
        print "# ___________________________________________________________\n#"

    return results

if __name__ == '__main__':
    links = bing_search('rasmalai')

Now that I have the links, web page title, and a short description, I want to extract keywords using NLP. In the end, I'd like to produce a CSV file with the dish name and associated keywords. Could someone guide me to some resources on how to do this part?

Thank you so much in advance.

Topic scraping csv nlp python

Category Data Science

the_cat_lady · Accepted Answer · 2019年4月17日 13:43

NLTK

A great starting point for keyword extraction is the NLTK (natural language toolkit) library. To extract keywords, you probably need to tokenize your data, breaking each word out into a token, and ignore the most common or unimportant words known as "stopwords". Assuming you're searching for keywords across a large number of query results, identify the most important terms in each document using TF-IDF (term frequency–inverse document frequency). There are tools and tutorials for this in the NLTK documentation. Sort the resulting token-scores, choose the highest scoring tokens, and these are a good start at your keywords.

Data scraping & NLP?

About