parsing

RAM crashed for XML to DataFrame conversion function

Ishan Dutta

2022年5月31日 20:08

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code? Code #Libraries import pandas as pd import xml.etree.cElementTree as ET #Function to convert XML file to Pandas Dataframe def xml2df(file_path): #Parsing XML File and …

Topic: dataframe python-3.x parsing pandas python

Category: Data Science

How to get elements from bs4.ResultSet?

Zhamshidbek Abdulkhamidov

2022年2月24日 19:05

I have a list of books and I want to parse genres of books from famous online stores in our country. def add_plus(title): return "+".join(title.split(" ")) def get_category(title): url_search = "https://www.kitapal.kz/search?search=" + add_plus(title) request_page_name = requests.get(url_search) soup = BeautifulSoup(request_page_name.content, "html.parser") url_book = soup.find("div", class_="item-books").parent.get('href') url_book = "https://www.kitapal.kz" + url_book request_book_page = requests.get(url_book) soup_book = BeautifulSoup(request_book_page.content, "html.parser") soup_category = soup_book.find_all(lambda tag:tag.name=="span" and "Сериясы" in tag.text) return soup_category When I call function get_category("Как я выиграл жизнь") I get this result: [<span class="d-block">Сериясы: …

Topic: parsing python

Category: Data Science

Why not rule-based semantic role labelling?

thesofakillers

2022年1月8日 11:34

I have recently found some interest in automatic semantic role labelling. Most introductory texts (e.g. Jurafsky and Martin, 2008) present approaches based on supervised machine learning, often using FrameNet (Baker et al. 1998) and PropBank (Kingsbury & Palmer, 2002). Intuitively however, I would imagine that the same problem could be tackled with a grammar-based parser. Why is this not the case? Or rather, why would these supervised solutions be preferred? Thanks in advance. References Jurafsky, D., & Martin, J. H. …

Topic: text parsing language-model nlp

Category: Data Science

What machine learning algorithms to use for unsupervised POS tagging?

Tido

2021年9月23日 01:45

I am interested in an unsupervised approach to training a POS-tagger. Labeling is very difficult and I would like to test a tagger for my specific domain (chats) where users typically write in lower cases etc. If it matters, the data is mostly in German. I read about about old techniques like HMM, but maybe there are newer and better ways?

Topic: unsupervised-learning parsing nlp machine-learning

Category: Data Science

How to write a simple rule-based datetime range parser in python?

Zing

2021年5月24日 21:39

The dateparser package fails to detect texts like the following and generate a date range 'last 2 weeks of 2020': Should return 18th December 2020 - 31st December 2020 'first three quarters of 2018': Should return 1st January 2018 - 30th September 2018 'last 3 days of September 2020': Should return 28th September - 30th September 2020 Is there a simple rule-based parser in python that can detect the words last, first, 2, 3, 2020, 2018 etc. and give a …

Topic: regex parsing nlp

Category: Data Science

Parsing and storing a large amount of HTML data

nainometer

2020年10月18日 21:08

I have a data chunk (~30k) in which I have htmls pages and pngs saved in a folder for websites. These folders are titled based on some randomly generated hashes. My supervisor wants me to crunch through this data chunk and extract some attributes out of each HTML page and store it in a DB for future use. Attributes to be extracted comprises of page titles and copyright section from the HTML. As per my understanding this data is unstructured …

Topic: parsing bigdata data-mining

Category: Data Science

Sentiment data for Emoji

Erich Schubert

2020年8月6日 11:03

For experimenting we'd like to use the Emoji embedded in many Tweets as a ground truth/training data for simple quantitative senitment analysis. Tweets are usually too unstructured for NLP to work well. Anyway, there are 722 Emoji in Unicode 6.0, and probably another 250 will be added in Unicode 7.0. Is there a database (like e.g. SentiWordNet) that contains sentiment annotations for them? (Note that SentiWordNet does allow for ambiguous meanings, too. Consider e.g. funny, which is not just positive: …

Topic: classification parsing machine-learning

Category: Data Science

Constituency vs Dependency Parsing: What is more effective for Sentiment Analysis?

Leevo

2020年6月22日 16:46

Parsing is often used to understand the sentiment of complex sentences filled with double negations or very articulated. There are two main ways of parsing a sentence: Constituency and Dependency Parsing. What is the most successful application for Sentiment Analysis?

Topic: sentiment-analysis parsing nlp

Category: Data Science

Fastest way to parse regex in R

Luisda

2020年6月11日 18:33

I need to parse around 1.6k REGEX expressions such as the pair I am writing below. I have also around 7k documents (1/2 page long each in average) that need to be parsed according to the REGEX expressions. Right now I am using library(rebus) library(stringr) regex_exp <- rebus::or1("(?i-mx:\\b(?:actroid\\b))", "(?i-mx:\\b(?:robot\\*w\\b)))") regex_exp <- BOUNDARY %R% regex_exp %R% BOUNDARY stringr::str_extract_all("This is my text talking about technology, but also about the actroid", regex_exp) to found matches, but it takes approx. 3.5 minutes per file, …

Topic: regex parsing r

Category: Data Science

How to segregate resume layouts into different types?

Sai Kumar

2020年5月7日 12:20

I'm looking for any suggestions on how to segregate resume layout into different types. How do one proceed with such a task? I mean resumes are usually available as pdf or docx format and when we parse text from documents we lose a lof of information regarding layout or metadata. So how one could build a system to segregate resumes based on layouts. It'll be really helpful if you have any suggestions.

Topic: ocr deep-learning text-mining parsing nlp

Category: Data Science

What is ChunkParserI in nltk.chunk ? What exactly it has been called for?

Payal Bhatia

2019年9月13日 08:02

from nltk.chunk import ChunkParserI from nltk.chunk.util import conlltags2tree from nltk.corpus import gazetteers class LocationChunker(ChunkParserI): def __init__(self): self.locations = set(gazetteers.words()) self.lookahead = 0 for loc in self.locations: nwords = loc.count(' ') if nwords > self.lookahead: self.lookahead = nwords What is ChunkParserI in nltk.chunk ? What exactly it has been called for? Also, please explain the code. What is the difference between chunking and parsing?

Topic: nltk parsing nlp

Category: Data Science

Information extraction with reinforcement learning, feasible?

mic

2019年3月12日 09:27

I was wondering if one could use Reinforcement Learning (as it is going to be more and more trendy with the Google DeepMind & AlphaGo's stuff) to parse and extract information from text. For example, could it be a competitive approach to structured prediction such as Named Entity Recognition (NER), i.e. the task of labelling New York by "city", and New York Times by "organization" Part-of-speech tagging (POS), i.e. classifying words as determinant, noun, etc. information extraction, i.e. finding and …

Topic: reinforcement-learning named-entity-recognition text-mining parsing

Category: Data Science

simplifying AND OR Boolean Expression

Bipul

2019年1月18日 10:34

My problem is turning a string that looks like this. "a OR (b AND c)" into a OR bc if the expression is like "a AND (b OR c)" then gives ab OR ac I can't able to design a correct set of loops using REGEX matching. The crux of the issue is that the code has to be completely general because I cannot assume how long the string pattern will be , nor exact places of OR AND in …

Topic: tokenization regex parsing python

Category: Data Science

parsing a simple sentence using syntaxnet (on Ubuntu)

Oliver

2018年12月23日 05:01

could someone help me in with the following. I would appreciate it. I would like to use syntaxnet to analyze a sentence. I installed Ubuntu on windows following this link. What do I have to do next to make syntaxnet to produce result. The sentence is for example, "David put a book on shelf". Thank you in advance.

Topic: parsing nlp

Category: Data Science

How to extract important phrases (which may contain company name) from resume?

khirod

2018年8月19日 19:59

I have thousands of CV / resumes with me. We want to build a parser which can extract company names from resume. So far we have tried Maintained a list of common words present in companies (Eg. Org, Ltd, Limited, Technologies etc.) and use them to identify probable companies. But this list is limited and many times many companies don't get extracted. Using HTML of CV we have tried to give more score to probable companies which have a certain …

Topic: parsing nlp data-mining machine-learning

Category: Data Science

parse pdf into Json or Xml

H.Mateur

2018年8月18日 19:06

I want to create a neural net that can obtain some specific words from a pdf document into JSON or XML. For example let's assume that I have a pdf containing some information about countries and i want to recuperate the countries name and population to obtain something like this : <countries> <country> <name> France </name <population> 70m </population </country> . . . </countries> Should I build a neural net and train it myself? If so can you give a …

Topic: neural-network parsing

Category: Data Science

Decode SQLite database blobs: how to start?

Aart Goossens

2018年4月3日 14:40

Updated: see bottom of this question. Not sure if I am at the right channel here but I have good hopes someone here might be able to help me. I am trying to process and analyse data from a Footscan system that is exported from the Footscan 9 Gait Essentials software to a .rsdb database. This database is a standard SQLite database with a different file extension. I have no problems accessing the data via Python but the most interesting …

Topic: parsing databases

Category: Data Science

Is parsing files an application of machine learning?

Myles

2018年1月30日 06:56

I presently receive files from a device in a semi-csv format. I have a written a simple recursive descent parser for getting information out of these files. Every time the device updates firmware, I have a new version of the parser for the changes the update brings. Down the road, we will be taking data from other devices, which means another parser and more updates to firmware. I'm wondering if I could define a basic structure of "this is the …

Topic: parsing machine-learning

Category: Data Science

Passing variables and values from an R script to a shell script

Ka_Papa

2018年1月19日 19:20

I'm working with a shell script(#!/bin/sh) and I wanted to know if there is a way to call variables with their values from an Rscript that I have called in my Shell script. If that doesn't make sense I want to create, for example a data frame data=data.frame(a=seq(1,5), b=seq(1,5)) in a script called test.r and then call that variable, with it's content in my shell script, i.e to print it with an echo: echo $data

Topic: parsing r

Category: Data Science

Chunker/shallow parser for spoken language

errantlinguist

2017年10月12日 20:57

I'm trying to extract NPs from transcribed spoken text, such as um it's the bl- it's the blue one in the right no left hand corner which contains e.g. fillers (e.g. um) and disfluencies (e.g. bl-, right no left hand corner) that are not commonly seen in written text. Ideally, I'd like to get something like the three sequences it, the blue one and the left hand corner (or at the very least the right no left hand corner). I'm …

Topic: stanford-nlp preprocessing parsing nlp

Category: Data Science

About