Is it possible to build a regression model for predicting movie gross using sections on their wikipedia pages?

I got this as an assignment from a company recruiter and I've successfully scraped a dataset of about 650 movies with their 'Plot', 'Music' and 'Marketing' sections and gross. I've tried tfidf and count vectorizers and performed LSA/PCA to reduce the dimensions which originally are around 20k terms. This is really boggling me, due to less instances(650) I guess the no. features should be around 100 or atleast < 600 but that is a drastic reduction of dimensions using PCA …
Category: Data Science

How can I "automate" a search for global gross based on movie name and put my search results into my dataframe?

My team and I (first year cs uni students) are currently doing a project based on IMDB 5000 dataset obtained from kaggle : https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset While doing EDA, we realised that the gross entries are inconsistent, for example, some movies like "My Date with Drew" has a gross of 85222, which, we found, was based on its "opening weekend". Then we have movies like "Compliance" which has gross of 318622, but this refers to "Gross US & Canada". My question is, …
Category: Data Science

Automate downloading datasets via Colab

My desktop computer recently broke, and I'm currently working on a small laptop with barely 500mb of space left. I need to download about 100gb of files from the DFAUST dataset. I was wondering if there was a way to write a script that did this. Wget doesn't work because the downloads must be done on the website itself behind a login. Is there a way to use a form of data scraping to get behind this and automate the …
Category: Data Science

How to scrape a website with a searchbar

How do I scrape a website that basically looks like google with just a giant searchbar in the middle of the screen. From it you can search after various companies and their stats. I have a list of 1000 companies I want to get information about. I want some bot to search each company from my list in the search bar, open the specific company's info window and extract a certain company code that exist on each page for each …
Category: Data Science

How to scrape imdb webpage?

I am trying to learn web scraping using Python by myself as part of an effort to learn data analysis. I am trying to scrape imdb webpage. I am using BeautifulSoup module. Following is the code I am using: r = requests.get(url) # where url is the above url bs = BeautifulSoup(r.text) for movie in bs.findAll('td','title'): title = movie.find('a').contents[0] genres = movie.find('span','genre').findAll('a') genres = [g.contents[0] for g in genres] runtime = movie.find('span','runtime').contents[0] year = movie.find('span','year_type').contents[0] print title, genres,runtime, rating, year …
Category: Data Science

Face recognition - How to make an image classifier with large number of classes?

I am planning to make an image classifier that identifies the face of every player in the English Premier League. I have a couple of questions (since until now I have only worked with small or academic datasets). My questions: How do I download this many different images? Since it's pretty hard to manually download the pictures individually, is there a way to automate it? I'm following this platform and am required to make a different class for each player. …
Category: Data Science

LinkedIn web scraping

I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd like to get data on all employees of a given company, which you can do manually on the site but is not possible through the API. import.io would be perfect if it recognised the LinkedIn pagination (see end …
Category: Data Science

getting error while scrapping Amazon using Selenium and bs4

I'm working on a class project using BeautifulSoup and webdriver to scrap Disposable Diapers on amazon for the name of the item, price, reviews, rating. My goal is to have something like this where I will split this info in different column: Diapers Size 4, 150 Count - Pampers Swaddlers Disposable Baby Diapers, One Month Supply 4.0 out of 5 stars 1,982 $43.98 ($0.29/Count) Unfortunately, I get this message after the 50 data appears: message: no such element: unable to …
Category: Data Science

How to scrape a table from a webpage?

I need to scrape a table off of a webpage and put it into a pandas data frame. But I am not being able to do it. Let me first give you a hint of how the table is encoded into html document. <tbody> <tr> <th colspan="2">United States Total<strong>**</strong></th> <td><strong>15,069.0</strong></td> <td><strong>14,575.0</strong></td> <td><strong>100.0</strong></td> <td></td> <td></td> </tr> <tr> <th colspan="7">Arizona</th> </tr> <tr> <td>Pinal Energy, LLC</td> <td>Maricopa, AZ</td> <td>50.0</td> <td>50.0</td> <td>NA</td> <td>2012-07-01</td> <td>2014-03</td> </tr> <tr> <td colspan="2"><strong>Arizona Total</strong></td> <td>50.0</td> <td>50.0</td> <td>NA</td> <td></td> <td></td> …
Category: Data Science

How can you automate collecting curriculum vitae data?

I'm doing a machine learning project, for which I need data from thousands of curriculum vitae. For this, I need to collect data from the employees of some 50 specific companies. From each company, I require data from thousands of employees. This data simply consists of what positions they have previously held, and with which company; what qualifications they have (e.g. Computer Science BSc from University of Oxford); and what skills they have. Initially I thought about using a webscraper …
Category: Data Science

Data scraping & NLP?

I'm scraping data from Bing search results for (non-commercial purposes, of course) on Python using BeautifulSoup. I've entered an Indian dessert name, called 'rasmalai' as the word that I am focusing on. The code I'm using returns the title and a description of the web page. I've also extracted the links for the results. Here is the code I used: from bs4 import BeautifulSoup import urllib, urllib2 def bing_search(query): address = "http://www.bing.com/search?q=%s" % (urllib.quote_plus(query)) getRequest = urllib2.Request(address, None, {'User-Agent': 'Mozilla/5.0 …
Category: Data Science

How to do webscrapping in R on this webpage?

I am quite new to R and I am trying to learn webscraping. I basically need to extract documents from this website. Ideally, the data needs to be structured in three columns: YEAR, DATE, and INTRODUCTORYSTATEMENT_CONTENT. Can anyone help with the coding?
Topic: scraping r
Category: Data Science

Crawling customer reviews from Amazon

I want to know if there is any way that I can crawl customer reviews for particular products from amazon without being blocked. At the moment, my crawler is blocked after a few times. Any idea will be appreciated.
Category: Data Science

algorithm to auto-download articles from the internet

I have an issued in my homeworks and I thinked if there is an rxisted algorithm or if can i create new one that takes key words like "germany" and "polution" and parses in google scholar. It parses fpr example the 10 first reults and each time it finds the key words in a specific part from the article ( just in the introduction) it downloads it. Any one can help me ith any infrmation that can help me in …
Category: Data Science

Capture pattern in python

I would like to capture the following pattern using python anyprefix-emp-<employee id>_id-<designation id>_sc-<scale id> Example data strings = ["humanresourc-emp-001_id-01_sc-01","itoperation-emp-002_id-02_sc-12","Generalsection-emp-003_id-03_sc-10"] Expected Output: [('emp-001', 'id-01', 'sc-01'), ('emp-002', 'id-02', 'sc-12'), ('emp-003', 'id-03', 'sc-10')] How can i do it using python.
Category: Data Science

Periodically executing a scraping script with Python

Here is my idea and my early work. My target Fetch 1-hour resolution air pollution data from China's goverment continuously. The website's data which collected from the monitor sites over the country update per hour . My Code Now, I can grab the useful information for a single hour. Here is my code: Input the website links for different pollution(co,no2,pm10, etc) html_co = urllib.urlopen("http://www.pm25.in/api/querys/co.json?city=beijing&token=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8') html_no2 = urllib.urlopen("http://www.pm25.in/api/querys/no2.json?city=beijing&token=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8') html_pm10 = urllib.urlopen("http://www.pm25.in/api/querys/pm10.json?city=beijing&token=5j1znBVAsnSf5xQyNQyq").read().decode('utf-8') Get the content of the html doc. soup_co = BeautifulSoup(html_co) …
Category: Data Science

Complex HTMLs Data Extraction with Python

Does anybody know a way of extracting data with python from more convoluted website structures? For example, I'm trying to extract data from the players in the ATP profiles, but it's just so complicated I quit. I think they're pulling data from some database in the script and I suspect that even if I tried I wouldn't be able to get it. I then started using a specialized software called ParseHub, which pulls the data somewhat visually. It's a pretty …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.