I have a structured dataset with rows as different samples and columns as different attributes of the samples. Interestingly, the attributes are highly inter-correlated (i.e. a complex system). I want to understand the system by training many classifier mdoels, with each model taking a column as the target and all the other columns as the features (which here I call such modeling "all-to-all"). Because the attributes and targets are highly correlated, many models should perform at reasonable accuracies. Before actually …
Association rule mining is considered to be an old technique of AI. Rules are mined on statistical support. How can deep learning be applied to this? What are approaches for structured data (in a graph format like XML)? XML documents are structured by tags. My goal is to extract a rule that says that tag x is often combined with tag y and z. Then, I later want to apply these rules and if a tag y and z is …
I want to do an RL project in which the agent will learn to drop duplicates in a tabular data. But I couldn't find any examples of RL being used that way - checked the RL based recommendation systems if they use a user-item interaction matrix as in collaborative filtering. I am wondering if it's really possible and how to define the problem (e.g. if episodic; episode terminates when the agent is done iterating over all data samples etc.). Can …
Convert natural language text to structured data. I'm developing a bot to help user assist in identifying Apparels. The problem is to convert natural language text to structured data (list of apparels) and query the store's inventory to find the closest match for each item. For example, consider the following user input to the bot. "I would like to order regular fit blue jeans with hip size 32 inches" and the desired output will be the following [ { "quantity": …
I wonder are there significant differences that ought to be known when preprocessing nominal vs ordinal vs interval vs ratio. Intuitively, it seems like encoding ordinal values should be performed using one-hot encoding to not introduce ordering assumptions artificially, and ordinal data (bad, better, best) using ordinal encoding (1,2,3) to preserve the order (although it does introduce scale, effectively making ordinal data into interval data it appears). Also, scaling the data seems problematic - if I were to encode labels …
I am analysing tweets and have collected them in an unstructured format. What is the best way to structure this data so I can begin the data mining processes? Somebody suggested using python packages such as spacy but not sure how to go about using this.
I have a body of PDF documents of differing vintage. Our group had exported the documents as text to feed them into a natural-language parser (I think) to pull out subject-verb-predicate triples. This hasn't performed as well as hoped so I exported the documents as XML using Acrobat Pro, hoping to capture the semantic document structure in order to pass it in as a hint to the text parser. One document looked pretty good (something like this): <TaggedPDF-doc> <bookmark-tree>...</bookmark-tree> <Sect>...</Sect> …
What are some of the systematic ways to categorise variables into categorical or numeric? I believe using only intuition in such scenarios can many-a-times lead to major irreversible errors. What are the best strategies when categorising variables? For example, the dataframe I'm working has several categorical variables such as is_holiday that has labels for several holidays. However certain variables like visibility_in_miles suggest that those too need to be treated as categorical. part of the reason is that while most variables …