Azure Cloud SQL - Querying large number of rows with Python

I have a Python Flask application that connects to an Azure Cloud SQL Database, and uses the Pandas read_sql method with SQLAlchemy to perform a select operation on a table and load it into a dataframe. recordsdf = pd.read_sql(recordstable.select(), connection) The recordstable has around 5000 records, and the function is taking around 10 seconds to execute (I have to pull all records every time). However, the exact same operation with the same data takes around 0.5 seconds when I'm selecting …
Category: Data Science

How can I store sources, effective dates, and confidence for every property in a knowledge graph?

What I am wanting to do is ensure that every property in a knowledge base comes from at least one source. I would like to ensure that every edge is spawned (or at least explained) by some event, like a "claim" or "measurement" or "birth." I'd like to rate on a scale the confidence that some property is correct, which could also be inherited from the source's confidence rating. Finally, I want to ensure that effective date(s) are known or …
Category: Data Science

Database System for Manual Entry

I'm in charge of setting up a patient register (100K+ patients) for a non-profit project with little money. This register should provide the basis for later datascience. I'm not sure how a good database solution can work the long run. It must be possible for various clinics to enter the data manually into the system. Since I have experience with Django I have developed a webapp prototype with Django and an SQLite DB (it is not expected that many users …
Topic: databases
Category: Data Science

Treating highly correlated features to the label feature

We work on a dataset with >1k features, where some elements are temporal/non-linear aggregations of other features. e.g., one feature might be the salary s, where the other is the mean salary of four months (s_4_m). We try to predict which employees are more likely to get a raise by applying a regression model on the salary. Still, our models are extremly biased toward features like s_4_m, which are highly correlated to the label feature. Are there best practices for …
Category: Data Science

Feature extraction from relational database

In order to build a classifier, I need to extract a few features from the data stored on a MySQL database. I need to join multiple tables and it is taking a lot of time. I have joined 2 tables at one time and have got results in multiple cases. I need to combine them. Writing a script will be the best option? How do people extract features from large relational databases? Am I missing something? Thanks.
Category: Data Science

ML model deployment architecture?

I came from a software development background and we have separate servers of the same database (dev, test, prod). The reason for this is because we develop our apps against the dev DB, run tests against the Test DB, and prod is prod. This is so we create a clear separation and won't bring down prod trying to build our app. Do you guys train your models the same way? Have 3 environments of the same database and as your …
Category: Data Science

I Have Issues Installing Basemap

I tried to install Basemap and it gives me this: preparing transaction: done verifying transaction: done executing transaction: failed ERROR conda.core.link:_execute(507): An error occurred while uninstalling packag e 'defaults::conda-4.5.12-py37_0'. PermissionError (13, Access is denied) Attempting to roll back Rolling back transaction: done PermissionError (13, Access is denied) Question: What should I do next? I will appreciate your response as I have been on this for some time now. Thanks. NOTE: I have also tried to install cartopy but I ran …
Category: Data Science

Data source for financial data mining

I plan to do data modeling in the financial area for my master's dissertation. I am thinking of finding the connection between a certain company or country characteristics ( x values) and their creditworthiness (here I am still looking for a y variable such as credit score, bankruptcy occurrence, etc.). Do you know in which databases I could find required input_ Company data would be great, however, most likely country data might be more accessible then I could also do …
Category: Data Science

Solusion to discover/inference the usage/meanings of tables in unkown database?

This is a usual situation I meet recently that customers gave me a database with many tables they don't quite understand too, then ask me to make a model predict the future revenue, classify which user may be valuable or something else. To be honest, extracting useful data from an unknown database made me exhausted. For example, I need to figure out which table is the user table, product table, or transaction table ... which column can use to join(there …
Category: Data Science

How to strategize model training with new data coming in every day?

I have a mysql database in which new records are added every day to raw data. This raw data is cleaned and a ML model is trained with it once a week. What should be the best strategy to capture new data in model without fetching entire records( old & new) and retraining from scratch. Im saving the models every week with pickle , can I just fit the previously saved model on new records. Is this an efficient methodology …
Category: Data Science

Decision Tree taking too long to execute

I am training a Decision Tree Regressor on a relatively small data. The dimensions of my train and test sets are (34164, 10) and (8514, 10). Here is the relevant code: y = np.log(data2['price']) data2.drop(['price'], axis = 1, inplace = True) num_cols = [cname for cname in data2.columns if data2[cname].dtype in ['int64', 'float64']] cat_cols = [cname for cname in data2.columns if data2[cname].dtype == 'object'] num_trans = SimpleImputer(strategy = 'mean') cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')), ('onehotencode', OneHotEncoder(handle_unknown = …
Category: Data Science

Microsoft Access Partial Unique Index

In many databases (MongoDB comes to mind) there's a way to specify a partial unique index, which expresses the sentiment: "Please make sure no two records in this table are duplicates with respect to this set of fields, as long as this condition on the record holds true (otherwise don't consider this record in the uniqueness constraint)." Does Microsoft Access have a way of expressing this kind of a constraint?
Category: Data Science

Connect Orange 3.20 to postgresql database

I installed orange 3.20 on windows 7. It works so far, the problem is connecting it to a server-based Postgres database. While the connection can be made at the moment, when you try to load a table the message "missing extension quantile" comes up. A few problems are coming up with this message. It seems like it is not possible to install this extension on a windows server without a lot of stress. The extension seems not to be actual …
Category: Data Science

Do I need to read an entire database for a recommendation system?

Let's say I have a database with approx 100000 rows. I want to build a content-based recommendation system. Do I really need to read the entire database to calculate similarity? That would be very expensive to do it hosted on AWS, Azure, etc. Additionally, my data is always changing (new data being added, old removed), so I can't just use a constant file. Is there a more cost-effective way?
Category: Data Science

Best image recognition API to implement for eCommerce Lifestyle/Sculpture site

I'm planning an eCommerce site currently. We are likely running WooCommerce and looking to implement Algolia for our search features. We feel that for our particular purposes, a visual search would be a crucial feature to implement, due to our product types. For the purpose of my question, I will use the example of sculptures and ceramics, with various forms both abstract and utilitarian, textures, colors, and so forth. The idea is a customer can upload a photo of their …
Category: Data Science

What is the ideal database that allows fast cosine distance?

I'm currently trying to store many feature vectors in a database so that, upon request, I can compare an incoming feature vector against many other (if not all) stored in the db. I would need to compute the Cosine Distance and only return, for example, the first 10 closest matches. Such vector will be of size ~1000 or so. Every request will have a feature vector and will need to run a comparison against all feature vectors belonging to a …
Category: Data Science

How to work with hundreds of CSVs with millions of rows in each?

So I'm doing a project on the COVID-19 Tweets dataset from the IEEE port and I plan to analyse the tweets over the time period from March 2020 till date. The thing is there's more than 300 CSVs for each data with each having millions of rows. Now I need to hydrate all of these tweets before I can go and filter through them. Hydrating just 1 CSV alone took more than two hours today. I wanted to know if …
Category: Data Science

How can I create a table from an existing table in SQL but using cells from the old table as columns in the new table?

I have a table, and I want to create a new table such as the one below (from the table above) In SQL, I tried using the following commands. I am able to generate a table with only one column like this, CREATE TABLE table2 AS SELECT balance FROM table1 WHERE balance='currency' But if I try to do multiple WHERE clause's it doesn't seem to work. I tried to do, CREATE TABLE table2 AS SELECT balance, category FROM table1 WHERE …
Topic: sql databases
Category: Data Science

DBMS or Software for privacy sensitive data

We have a dataset of very privacy sensitive people data and want to build a database with it. The data protection department in our company doesn't like the idea that the data scientists are able to see any data specific to a person (even if anonymized). We can't preaggregate the data in the database because there are hundreds of different possible aggregations that could be interesting. Is there a software or DBMS that could ensure that users can only query …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.