Data scraping & NLP?
I'm scraping data from Bing search results for (non-commercial purposes, of course) on Python using BeautifulSoup. I've entered an Indian dessert name, called 'rasmalai' as the word that I am focusing on. The code I'm using returns the title and a description of the web page. I've also extracted the links for the results. Here is the code I used:
from bs4 import BeautifulSoup
import urllib, urllib2
def bing_search(query):
address = "http://www.bing.com/search?q=%s" % (urllib.quote_plus(query))
getRequest = urllib2.Request(address, None, {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 Chrome/65.0.3325.162 Safari/537.36'})
urlfile = urllib2.urlopen(getRequest)
htmlResult = urlfile.read(200000)
urlfile.close()
soup = BeautifulSoup(htmlResult)
[s.extract() for s in soup('span')]
#unwantedTags = ['a', 'strong', 'cite']
#for tag in unwatedTags:
#for match in soup.findAll(tag):
# match.replaceWithChildren()
results = soup.findAll('li', {"class" : "b_algo" })
for result in results:
print "# TITLE: " + str(result.find('h2')).replace(" ", " ") + "\n#"
print "# DESCRIPTION: " + str(result.find('p')).replace(" ", " ")
print "# ___________________________________________________________\n#"
return results
if __name__ == '__main__':
links = bing_search('rasmalai')
Now that I have the links, web page title, and a short description, I want to extract keywords using NLP. In the end, I'd like to produce a CSV file with the dish name and associated keywords. Could someone guide me to some resources on how to do this part?
Thank you so much in advance.
Category Data Science