How to arrange web scraped data in a table using R?

Original Code library(netstat) library(RSelenium) library(tidyverse) obj<-rsDriver(browser="chrome",chromever="101.0.4951.15",verbose=F,port=free_port()) remDr<-obj$client remDr$navigate('https://www.imdb.com/search/title/?year=2022&title_type=feature&') Title<-remDr$findElements(using='css','.lister-item-header a') lapply(Title,function(x) { x$getElementText()%>% unlist() }) o/p: [[1]] 1 "Doctor Strange in the Multiverse of Madness" [[2]] 1 "Senior Year" My attempts to arrange data in tabular form- 1.movies=data.frame(Title,stringsAsFactors=FALSE) view(movies) **Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ‘structure("webElement", package = "RSelenium")’ to a data.frame** 2.movies=data.frame(x,stringsAsFactors=FALSE) view(movies) **Error in data.frame(X, stringsAsFactors = FALSE) : object 'X' not found** 3.Part of original code tweaked- lapply(Title,function(x) { **t<-list(x$getElementText()%>% unlist())** }) l=data.frame("movie"=t,stringsAsFactors …
Category: Data Science

What kind of hypothesis testing in Python can be used to validate that 4 job titles are significantly different based on their skillset?

I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …
Category: Data Science

Use Python to Scape Data on website ends with aspx

https://blue.kingcounty.com/Assessor/eRealProperty/default.aspx This is the website I want to scrape. For example, I input 500 4TH AVE I just want to get Parcel Number and Lot size. I was trying to edit the link https://blue.kingcounty.com/Assessor/eRealProperty/search=5004THAVE.aspx I think it did not work, please demonstrate. Thank you.
Topic: web-scraping
Category: Data Science

Extracting Public Page Posts from Facebook

I am a data science student working on my capstone research project. I am trying to collect posts from a number of public organization pages from facebook. I am looking at the graphs api and it does not appear to have an endpoint for this use case. The page/feed requires either moderation access of the pages(which i do not have) or public page read access which requires a business application. The CrowdTangle process is not accepting researchers outside of very …
Category: Data Science

Beautifulsoup iterating through scraped data

I have this html code that is repeating multiple times: <div class="Company_line-logo image-loader-target" data-image-loader-height="47" data-image-loader-height-mobile="47" data-image-loader-src="/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo.png" data-image-loader-src-mobile="/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo_mobile.png" data-image-loader-width="189" data-image-loader-width-mobile="189" style='background-image:url("http://en.50partners.fr/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo.png");'></div> <h5 class="Company_line-title">Woleet</h5> <div class="Company_line-description"> By using: for blocks in soup: block = soup.find('a', class_='Company_line logo-contains-name').find('h5').get_text() I can get what I want that is "Woleet" in between the h5 tags. I tried to iterate this to get all of them: block = soup.find('a', class_='Company_line logo-contains-name') for name in block: names = block.find_all('h5') it returns only 1 h5 class="Company_line-title">Woleet</h5 whereas I should …
Category: Data Science

How to Scrape Website Usage/User Data for a Website that Does Not Have an API

I am an aspiring Data Analyst and I was wondering how do I web scrape website usage/user data such as: total people who visited that website, total visitors per page, how long did each visitor stay on each page, etc. The website I am interested in scraping does not have an API in case that effects your answer. I have searched online, watched many YouTube tutorials, etc., but it doesn't pertain to the data I need. I don't know the …
Category: Data Science

Data extraction using crawlers

I have a rather simple data scraping task, but my knowledge of web scraping is limited. I have a excel file containing the names of 500 cities in a column, and I'd like to find their distance from a fixed city, say Montreal. I have found this website which gives the desired distance (in both km and miles). For each of these 500 cities, I'd like to read the name in the excel file, enter it in the "to" box, …
Category: Data Science

Find business vertical of a website just by its URL or cluster similar website by its url

I have been exploring this problem a lot about just using the website url to tag or cluster them as per their business domain. For example: amazon.com => e-commerce bbc.co.uk => news Adidas.com => sports apparel I have read through some research papers which try to cluster using different unsupervised learning clustering algorithm like CLUE link here One way to think is to create a repository of labeled websites and then create a model to tag similar websites using this …
Category: Data Science

Suggestions for studying *Clickstream* data

I've essentially been handed a dataset of website access history and I'm trying to draw some conclusions from it. The data supplied gives me the web URL, the datetime for when it was accessed, an the unique ID of the user accessing that data. This means that for a given user ID, I can see a timeline of how they went through the website and what pages they looked at. I'd quite like to try clustering these users into different …
Category: Data Science

looking for public dataset for stock market

I want to do some modelling and data visualization on historical stock data, including price, volume, financials, etc. Is there an public dataset available for stock price history? I looked at a few, but either they have a high cost, or not sure they would be reliable. Free would be preferred, also established and reliable. If not, what are some good options for collecting the data myself? Maybe web scraping, or public api's etc.
Category: Data Science

Good database/plug-in to scrape for academic paper info?

I am trying to scrape the web to find information about a set of academic papers. Unfortunately, for many of the papers, I only have the author's name, a year, and part of a title. (For example, [BANDURA A, 1997, SELF EFFICACY EXERCI]) I have tried using a web driver to look up the papers on Scopus but many of the papers are not there. Does anyone know of a good academic paper database and/or plug-in that I could search …
Category: Data Science

Best way to store scraped web pages?

I'm going to scrape the HTML code from a large number of ULRs and store them on my computer, for machine learning purposes (basically, I'm going to use Python and PyTorch to train a neural network on this data). What is the best way to store the HTML code for all the web pages? I want to be able to see which URLs I have already scraped, so that I don't have to scrape them again, and for each piece …
Topic: web-scraping
Category: Data Science

Beautifulsoup find_all returns TypeError: list indices must be integers or slices, not str

This should have an easy solution but I can't understand how to avoid it. content1 = soup.find('div', class_='fusion-text fusion-text-6') content2 = soup.find('div', class_='fusion-text fusion-text-7') for para in content1: comp_name = para.find_all('a')['href'] print(comp_name) The error comes because I have a list at ['href'] together with find_all. comp_name = para.find('a')['href'] doesn't return an error and gives the right output (an URL), but just the first one. Since I want to scrape all of them inside my content1 I wanted to use find_all …
Category: Data Science

Python writing to Excel file: writerow() takes no keyword arguments

I have this script: import requests from requests import get from bs4 import BeautifulSoup import csv import pandas as pd f = open('olanda.csv', 'wb') writer = csv.writer(f) url = ('https://www......') response = get(url) soup = BeautifulSoup(response.text, 'html.parser') type(soup) table = soup.find('table', id='tablepress-94').text.strip() print(table) writer.writerow(table.split(), delimiter = ',') f.close() When it writes to a CSV file it writes everything in a single cell like that: Sno.,CompanyLocation,1KarifyNetherlands,2Umenz,Benelux,BVNetherlands,3TovertafelNetherlands,4Behandeling,BegrepenNetherlands,5MEXTRANetherlands,6Sleep.aiNetherlands,7OWiseNetherlands,8Healthy,WorkersNetherlands,9&thijs,|,thuis,in,jouw,situatieNetherlands,10HerculesNetherlands, etc. I wanted to have the output in a single column and each value (separated …
Category: Data Science

Web scraping using Beatiful Soup

I have this code and I wanna extract holidays, petrol and temperature but I don't know where is the problem. I need your help as soon as possible, please. I want to add this extraction to my dataset that is based on date columns, so comparing the scraping data with the dates that I have in my dataset. I also wanna test the impact of each variable holidays, temperature... import requests import re import json import datefinder from googletrans import …
Category: Data Science

Scraping financial web data

I recently started working as a data scientist and I am starting a web scraping and NLP project using Python. The idea is to create a program that searches for public information on the company's clients. These information can come from various sources: annual reports, income statements, articles.... I will have to deal with two types of formats: HTML and PDFs. For now I will focus on retrieving the revenue of the company. After a month of research and tests, …
Category: Data Science

Web Scraping: Multiple small files or one large file?

I plan to scrape some forums (Reddit, 4chan) for a research project. We will scrape the newest posts, every 10 minutes for around 3 months. I am wondering how best to store the JSON data from each scrape, so that pre-processing (via Python) later would be as simple as possible. My options are the following: Dump data from each scrape into a fresh file (timestamp as filename). Resulting in 12,960 files of approx. 150kb each OR Maintain 1 single large …
Category: Data Science

Scraping mixed elements and passing to SQL

I'm running webscrapes via python which are retrieving data from csv's hosted on the web. I'd like to pass the data into a MSSQL database. An issue I have is the mixed elements/data types in the csv. Here is an example of the data Item Val1 Val2 A 100 200 B 101 201 C Null -2/2(%) D Null 2019-Nov-18 I would like to import all of this data into the db, but the critical data is in the "Val2" column. …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.