I have 4 job titles, for each of which I scraped hundreds of job descriptions and classified them by if they contain words related to a predefined list of skills. For each job description, I now have a True/False parameter if they mention one of the skills. How can I validate that there is a significant difference between job descriptions that represent different job titles? I'm very new to this topic and all I could think of is using dummy …
https://blue.kingcounty.com/Assessor/eRealProperty/default.aspx This is the website I want to scrape. For example, I input 500 4TH AVE I just want to get Parcel Number and Lot size. I was trying to edit the link https://blue.kingcounty.com/Assessor/eRealProperty/search=5004THAVE.aspx I think it did not work, please demonstrate. Thank you.
I am a data science student working on my capstone research project. I am trying to collect posts from a number of public organization pages from facebook. I am looking at the graphs api and it does not appear to have an endpoint for this use case. The page/feed requires either moderation access of the pages(which i do not have) or public page read access which requires a business application. The CrowdTangle process is not accepting researchers outside of very …
I have this html code that is repeating multiple times: <div class="Company_line-logo image-loader-target" data-image-loader-height="47" data-image-loader-height-mobile="47" data-image-loader-src="/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo.png" data-image-loader-src-mobile="/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo_mobile.png" data-image-loader-width="189" data-image-loader-width-mobile="189" style='background-image:url("http://en.50partners.fr/var/fiftyPartners/storage/images/startups/woleet/3261-1-fre-FR/Woleet_company_line_logo.png");'></div> <h5 class="Company_line-title">Woleet</h5> <div class="Company_line-description"> By using: for blocks in soup: block = soup.find('a', class_='Company_line logo-contains-name').find('h5').get_text() I can get what I want that is "Woleet" in between the h5 tags. I tried to iterate this to get all of them: block = soup.find('a', class_='Company_line logo-contains-name') for name in block: names = block.find_all('h5') it returns only 1 h5 class="Company_line-title">Woleet</h5 whereas I should …
I am trying to build a scraper that will run continuously and save the tweets from a list of users instantaneously or within seconds of the user tweeting it. It could save the tweet details to a continuously updated csv file.
I want to scrape images from several Facebook groups some of them are public and some or not I am new to web scraping but I tried to look for solutions with selenium or BeautifulSoup or scrappy but I didn't find anything
I am an aspiring Data Analyst and I was wondering how do I web scrape website usage/user data such as: total people who visited that website, total visitors per page, how long did each visitor stay on each page, etc. The website I am interested in scraping does not have an API in case that effects your answer. I have searched online, watched many YouTube tutorials, etc., but it doesn't pertain to the data I need. I don't know the …
I have a rather simple data scraping task, but my knowledge of web scraping is limited. I have a excel file containing the names of 500 cities in a column, and I'd like to find their distance from a fixed city, say Montreal. I have found this website which gives the desired distance (in both km and miles). For each of these 500 cities, I'd like to read the name in the excel file, enter it in the "to" box, …
I have been exploring this problem a lot about just using the website url to tag or cluster them as per their business domain. For example: amazon.com => e-commerce bbc.co.uk => news Adidas.com => sports apparel I have read through some research papers which try to cluster using different unsupervised learning clustering algorithm like CLUE link here One way to think is to create a repository of labeled websites and then create a model to tag similar websites using this …
I've essentially been handed a dataset of website access history and I'm trying to draw some conclusions from it. The data supplied gives me the web URL, the datetime for when it was accessed, an the unique ID of the user accessing that data. This means that for a given user ID, I can see a timeline of how they went through the website and what pages they looked at. I'd quite like to try clustering these users into different …
I want to do some modelling and data visualization on historical stock data, including price, volume, financials, etc. Is there an public dataset available for stock price history? I looked at a few, but either they have a high cost, or not sure they would be reliable. Free would be preferred, also established and reliable. If not, what are some good options for collecting the data myself? Maybe web scraping, or public api's etc.
I am trying to scrape the web to find information about a set of academic papers. Unfortunately, for many of the papers, I only have the author's name, a year, and part of a title. (For example, [BANDURA A, 1997, SELF EFFICACY EXERCI]) I have tried using a web driver to look up the papers on Scopus but many of the papers are not there. Does anyone know of a good academic paper database and/or plug-in that I could search …
I'm going to scrape the HTML code from a large number of ULRs and store them on my computer, for machine learning purposes (basically, I'm going to use Python and PyTorch to train a neural network on this data). What is the best way to store the HTML code for all the web pages? I want to be able to see which URLs I have already scraped, so that I don't have to scrape them again, and for each piece …
This should have an easy solution but I can't understand how to avoid it. content1 = soup.find('div', class_='fusion-text fusion-text-6') content2 = soup.find('div', class_='fusion-text fusion-text-7') for para in content1: comp_name = para.find_all('a')['href'] print(comp_name) The error comes because I have a list at ['href'] together with find_all. comp_name = para.find('a')['href'] doesn't return an error and gives the right output (an URL), but just the first one. Since I want to scrape all of them inside my content1 I wanted to use find_all …
I have this script: import requests from requests import get from bs4 import BeautifulSoup import csv import pandas as pd f = open('olanda.csv', 'wb') writer = csv.writer(f) url = ('https://www......') response = get(url) soup = BeautifulSoup(response.text, 'html.parser') type(soup) table = soup.find('table', id='tablepress-94').text.strip() print(table) writer.writerow(table.split(), delimiter = ',') f.close() When it writes to a CSV file it writes everything in a single cell like that: Sno.,CompanyLocation,1KarifyNetherlands,2Umenz,Benelux,BVNetherlands,3TovertafelNetherlands,4Behandeling,BegrepenNetherlands,5MEXTRANetherlands,6Sleep.aiNetherlands,7OWiseNetherlands,8Healthy,WorkersNetherlands,9&thijs,|,thuis,in,jouw,situatieNetherlands,10HerculesNetherlands, etc. I wanted to have the output in a single column and each value (separated …
I have this code and I wanna extract holidays, petrol and temperature but I don't know where is the problem. I need your help as soon as possible, please. I want to add this extraction to my dataset that is based on date columns, so comparing the scraping data with the dates that I have in my dataset. I also wanna test the impact of each variable holidays, temperature... import requests import re import json import datefinder from googletrans import …
I recently started working as a data scientist and I am starting a web scraping and NLP project using Python. The idea is to create a program that searches for public information on the company's clients. These information can come from various sources: annual reports, income statements, articles.... I will have to deal with two types of formats: HTML and PDFs. For now I will focus on retrieving the revenue of the company. After a month of research and tests, …
I plan to scrape some forums (Reddit, 4chan) for a research project. We will scrape the newest posts, every 10 minutes for around 3 months. I am wondering how best to store the JSON data from each scrape, so that pre-processing (via Python) later would be as simple as possible. My options are the following: Dump data from each scrape into a fresh file (timestamp as filename). Resulting in 12,960 files of approx. 150kb each OR Maintain 1 single large …
I'm running webscrapes via python which are retrieving data from csv's hosted on the web. I'd like to pass the data into a MSSQL database. An issue I have is the mixed elements/data types in the csv. Here is an example of the data Item Val1 Val2 A 100 200 B 101 201 C Null -2/2(%) D Null 2019-Nov-18 I would like to import all of this data into the db, but the critical data is in the "Val2" column. …