Building a graph out of a large text corpus

I'm given a large amount of documents upon which I should perform various kinds of analysis. Since the documents are to be used as a foundation of a final product, I thought about building a graph out of this text corpus, with each document corresponding to a node. One way to build a graph would be to use models such as USE to first find text embeddings, and then form a link between two nodes (texts) whose similarity is beyond …
Category: Data Science

Data Analytics how to read ECDF graph

Hi there, My question is about how to read ECDF graphs. I am still quite unsure what the jumps / zig-zags in the graph mean and what is happening when there is a horizontal line and so on. I would be happy if someone can explain me how I am suppose to read this graph and what information I can get from it. Thank you
Category: Data Science

Task of regression on graphs

Which tools are available to extract features from a graph. After that, I would like to perform regressions on those features. Initially, I was thinking about using the adjacency matrix of the graph. But maybe there is a smarter way of doing feature extraction on graphs.
Category: Data Science

Are there any graph embedding algorithms like this already?

I wrote an algorithm for generating node embeddings based on the graph's topology. Most of the explanation is done in the readme file and the examples. The question is: Am I reinventing the wheel? Does this approach have any practical advantages over existing solutions for embeddings generation? Yes, I'm aware there are many algorithms for this based on random walks, but this one is pure deterministic linear algebra and it is quite simple, from my perspective. In short, the algorithm …
Category: Data Science

How can I store sources, effective dates, and confidence for every property in a knowledge graph?

What I am wanting to do is ensure that every property in a knowledge base comes from at least one source. I would like to ensure that every edge is spawned (or at least explained) by some event, like a "claim" or "measurement" or "birth." I'd like to rate on a scale the confidence that some property is correct, which could also be inherited from the source's confidence rating. Finally, I want to ensure that effective date(s) are known or …
Category: Data Science

How to apply K-Medoids in many CFG?

I am having around 1000 DAG(Directed Acyclic Graph) of different files showing java.io.BufferedReader usage. Following is representation of one of the graphs digraph G { 9 [ label="9 : ROOT:setup()#0" ]; 10 [ label="10 : START IF" ]; 12 [ label="12 : java.net.URL.openConnection()#1" ]; 11 [ label="11 : END IF" ]; 13 [ label="13 : java.net.URL.openConnection()#0" ]; 14 [ label="14 : START IF" ]; 16 [ label="16 : java.net.HttpURLConnection.setRequestProperty()#2" ]; 15 [ label="15 : END IF" ]; 17 [ label="17 …
Category: Data Science

Growth Edge in Link Prediction

I have 2 CSV files representing edge in social networks in 2 consecutive generations. I am trying to predict future edges. My initial tough is to train a linear regression on the first generation with some indicators like Adar Index or Cosine Similarity between the node of the edge I am trying to predict. I can not run all the combinations possible between 2 nodes, so I was wondering how many edges I need to add between 2 generations? Is …
Category: Data Science

Return the gradient and y intercept (m, b) to create two lines to best fit the data

I have been working on this task for a few hours now and have been unsuccessful with getting the target result. I have tried using multiple methods of trying to split the dataset using different clustering methods and logistical regression with no luck. I thought noncontinuous piecewise linear regression may work however found no good resources on how to implement it. The taks is given a 2D NumPy array of x, y data points determine the gradient and y-intercept for …
Category: Data Science

How to tell how much information I lose when I simplify the graph data structure with respect to unsimplified graph?

I have the following problem: I have some sort of data (that I can't publish here, but they are in the form of points with XYZ coordinates) and I can represent them as a collection of graphs i.e. $Q = \{G_1, G_2 ... G_t\}$, where for every node there is an associated set of features, e.g. node $u_i$ has feature vector $\mathcal{F}_i$ and the features are changing between graphs (but graph structure does not). The resulting graphs are big in …
Topic: pca graphs
Category: Data Science

How to perform node classification using Graph Neural Networks

I'm am trying to perform node classification using graph neural network methods. My initial plan was to convert my graphs to adjacency matrices and train my network on that, with the node features being my target. However, my graphs all have a different number of nodes, so I believe adjacency matrices will not work. I then found information about node embeddings and applications in biology (see here). It infers here that embedding your nodes no longer matters about graph size. …
Category: Data Science

Using iGraph to build a Distribution Model

I would like to analyze the distribution of the Customers from a Shop, if the Shop is closed or terminated. Consider the following sample data-set; | ShopID | MonthlyCVisitCount | Lat | Lng | -------------------------------------------------------- | A1 | 15000 | 39.84349 | 116.33986 | | A2 | 24560 | 39.84441 | 116.33995 | | A3 | 14789 | 39.84615 | 116.34012 | | A4 | 35479 | 39.84891 | 116.34039 | I would like to build a distribution model using …
Category: Data Science

Why is sliding window evaluation important in time series analysis?

I have been working dynamic graph neural newtork survey, and what I realized is that all of the well known paper (from pretegious university) do not use sliding window evaluation on dynamic graph model. They only use simple train-test splits. I find this very confusing. Then I start asking question why sliding window is important in time analysis in the first place. From my own experience, I know for a fact that dynamic graph models are VERY VERY sensitive to …
Category: Data Science

How to perform inductive train/test split for GraphSAGE classification

Let's say I have a network that consists of a single weakly connected component. From various papers I've seen that if you want to use inductive GNNs like GraphSAGE, it is advisable to split your train/test data into two separate graphs or components. Since I've seen that there are different approaches for node classification and link prediction tasks, I am specifically interested in node classification tasks, possible multiclass classification. So the train/test split graphs would need to ensure some sort …
Category: Data Science

Semi supervised learning on graphs

I have the following semi-supervised problem: I have a graph of persons and their relations. Some of those persons have a predefined risk classification. Classify the risk of the other nodes. I know risk is kind of arbitrary that's why I'm open to any ideas. An example is, suppose I have a person with classification critical (10) and I wanted to find the risk classification of their neighborhood. I thought on doing something like for every node, for every fixed …
Category: Data Science

Algorithms for Vertex or Node Correspondence

Given a graph G, and another graph with the same number of vertices G’, one can define a vertex correspondence function f, from the vertex set of G to the vertex set of G’. The correspondence function f needs to be bijective, and it’s purpose is to give information about the relationship between the two graphs. One example of this would be given two isomorphic graphs G and G’, the actual isomorphism would serve as the vertex correspondence function. I …
Category: Data Science

How to approach mapping families of vectors on a lattice and forecast resulting value

I describe here a model to describe how neighbours influence a node. I wish to implement it to attempt forecasting to values associate nodes; I post here asking for suggestions on mathematical model and machine learning techniques that could have already considered a similar approach, but I am not aware of, and hints for their implementation (python). Suppose you a have a squared 2D lattice (a grid of 9 squares for simplicity), and: for each time t from each cell …
Category: Data Science

What are graph embedding?

I recently came across graph embedding such as DeepWalk and LINE. However, I still do not have a clear idea as what is meant by graph embeddings and when to use it (applications)? Any suggestions are welcome!
Topic: graphs
Category: Data Science

Fraud risk propagation in large scale network

What's the best approach to do some graph analytics and risk propagation in a network using python where multiple accounts are connected through a relationship and few of the accounts in the network are marked as bad accounts and the rest are unknown? I tried using networkx but it seems to run forever. I have about 8MM edges and 40K nodes
Category: Data Science

Probability distributions for Directed Cyclic Graphs

Given a directed cyclic graph where vertex A is 'infected', and there are different infection probabilities between each node, what is the best approach towards computing the conditional probability $p(F|A)$? Do I have to transform it into asyclic graph and use bayesian net-methods? How would I proceed in order to design an algorithm for computing probabilities like this one, and are there approaches to this that are computationally feasible for very large networks?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.