scala

unsupervised anomaly detection for univariate fast frequency time series data?

user10296606

2022年5月18日 05:07

I have a univariate time series (there is a value for each time sampling) (sampling time: 66.66 micro second, number of samples/sampling time=151) coming from a scala customer This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an …

Topic: pipelines unsupervised-learning anomaly-detection scala time-series

Category: Data Science

Multidimensional regression in Scala

user1207727

2022年4月9日 13:05

I have a queuing model written in Scala where different categories of people end up a different queues. We have a dataset providing a map of features to the numbers of people ending up at each queue, ie multiple inputs to multiple outputs (continuous values) I have some experience using mllib for single value predictions in Scala but I can't see that multiple outputs are supported. It doesn't even look to me that mllib has continuous value output support as …

Topic: scala neural-network machine-learning

Category: Data Science

Plotting libraries for Scala on Zeppelin

Rami

2022年3月5日 04:19

My main question is it looks like Zeppelin limit the display of the results to on 1000, I know that I can change this number but when I change it Zeppelin become slow. And it looks like the default plotting tool of Zeppelin also plot the first 1000 results. Is there a configuration or a way to make the plotting tool plot all the data? If no, is there any equivalent for Matlibplot to Scala on Zeppelin?

Topic: scala visualization

Category: Data Science

Outlier Elimination in Spark With InterQuartileRange Results in Error

joesan

2022年1月5日 21:31

I have the following function that is supposed to calculate the outlier for a given dataset. def interQuartileRangeFiltering(df: DataFrame): DataFrame = { @scala.annotation.tailrec def inner(cols: List[String], acc: DataFrame): DataFrame = cols match { case Nil => acc case column :: xs => val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config println(s"$column ${quantiles.size}") val q1 = quantiles(0) val q3 = quantiles(1) val iqr = q1 - q3 val lowerRange = q1 - 1.5 * iqr …

Topic: scala apache-spark

Category: Data Science

Spark Dataframe APIs vs Spark SQL

Punter Vicky

2021年9月24日 05:08

I have a relatively complex query which runs against a database and contains multiple join statements, lead/lag functions, subquery, etc. These tables are available as individual files in my object store. I am trying to run a Spark job to perform the same query. Is it advisable to try and convert the SQL query into Spark SQL (which I was able to do by making few changes) or is it better to use dataframe APIs to reconstruct the query and …

Topic: data-engineering scala pyspark apache-spark sql

Category: Data Science

Find a column by name in a row in scala spark

Shanif Ansari

2021年9月3日 15:00

I have a Seq[Row].Each row is an Array of Struct.Struct has four fields: a,b,c and d all of which are String.The data in a particular row is something like this: [{"a":"ahahk","b":"ridj","c":"qpsj",d":"qmdjdh"},{"a":"lyev","b":"ehsa","c":"pkeg",d":"apht"}] I want to check that if there is a field with name 'c' and 'a' when I loop over the Seq of Rows. What are some possible solutions for such a scenario if I want to create a udf in spark which takes Seq[Row] and finds the presence of …

Topic: dataframe sequence scala apache-spark bigdata

Category: Data Science

Issue while reading a csv file through Spark

Ritesh Satapathy

2021年9月3日 11:25

I am trying to read a csv file through Spark. However one of the columns has the data in the below format and because of comma it is being split into multiple columns. The input csv file is a comma delimited file. "[{"code": "100", "name": "CLS1", "type": "PRIMARY"}]" could you please help me how to parse this column in spark scala. I tried using option("escape","") and option("quoteMode","ALL"). didn't work as expected

Topic: json csv scala apache-spark

Category: Data Science

Has anyone succeeded in finding a good Scala/Spark kernel for Jupyter?

Varun Gawande

2021年8月5日 03:09

The ones I've tried so far Almond: Works very well for just Scala, but you have to import dependencies, and it gets tedious after a while. And unfortunately can't run when using Spark with YARN instead of Local. Spylon-kernel: Kernels connects, but gets stuck in the initializing stage. Apache Toree: I would've loved this so much only if it worked. Lots of language support, magics, incubated by apache. However, this kernel doesn't connect. Get's stuck on the "Kernel Connecting" stage. …

Topic: kernel jupyter scala apache-spark

Category: Data Science

determining size of batch, time of sending and memory in to send from scala to ML section

user10296606

2021年7月30日 00:53

I have a time series (sampling time: 66.66 micro second, number of samples/sampling time=151), I would like to determine some anomalies in them, the inputs are made by scala customer message bus. would like to know how I can determine size of batch, time of sending and memory in Scala customer or ML/AL?

Topic: pipelines memory scala time-series machine-learning

Category: Data Science

Saving Large Spark ML Pipeline to HDFS

Thomas Cleberg

2021年5月28日 12:47

I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size: scala> val mod = pipeline.fit(df) mod: org.apache.spark.ml.PipelineModel = pipeline_936bcade4716 scala> mod.write.overwrite().save(modelPath.concat("model")) 18/01/08 10:00:32 WARN TaskSetManager: Stage 8 contains a task of very large size (755610 KB). The maximum recommended task size is 100 KB. org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 2606:0 was …

Topic: scala apache-spark apache-hadoop

Category: Data Science

how to calculate the cosine similarity between two files?

salm

2021年5月26日 14:45

I am using spark and scala to implement an issue. files contain phrases or sentences. I want to use domain based method to calculate the cosine similarity between tags.I convert two files into a string and then calculate the similarity. code val lines=Source.fromURL(Source.getClass().getResource("file:///usr/loca/spark/dataset/algorithm3/comedy")).mkString("\n") val lines2=Source.fromURL(Source.getClass().getResource("file:///usr/local/spark/dataset/algorithm3/funny")).mkString("\n") val result=textCosine(lines,lines2) println("The cosine similarity score: "+result) } def module(vec:Vector[Double]): Double ={ math.sqrt(vec.map(math.pow(_,2)).sum) } def innerProduct(v1:Vector[Double],v2:Vector[Double]): Double ={ val listBuffer=ListBuffer[Double]() for(i<- 0 until v1.length; j<- 0 until v2.length;if i==j){ if(i==j){ listBuffer.append( v1(i)*v2(j) ) } } …

Topic: implementation cosine-distance scala apache-spark recommender-system

Category: Data Science

Data Science Tools Using Scala

sheldonkreger

2021年3月11日 20:14

I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?

Topic: scala scalability

Category: Data Science

How to install Polynote on Windows?

Pluviophile

2020年8月19日 09:49

I've been searching around the Internet for a while but I have not been able to find detailed instructions on how to install Polynote (the polyglot notebook with first-class Scala support) for Windows with mixing multiple languages, Python and Scala. Github Link for Polynote. Official Website. According to the official website: Polynote is currently only tested on Linux and MacOS, using the Chrome browser as a client. We hope to be testing other platforms and browsers soon. Feel free to …

Topic: ipython windows scala python

Category: Data Science

Scala RDD operation

Jaishree Rout

2020年5月3日 09:39

I am new in scala. I have a csv file stored in hdfs. I am reading that file in scala using val salesdata = sc.textFile("hdfs://localhost:9000/home/jayshree/sales.csv") Here is a small sample of data sales. a is the customerid, b-transanctionid, c-itemid, d-itemprice. a b c d 5 199 1 500 33 235 1 500 20 249 3 749 35 36 4 757 19 201 4 757 17 94 5 763 39 146 5 763 42 162 5 763 49 41 6 824 …

Topic: scala apache-spark

Category: Data Science

Can I use model which is trained using Keras to Scala?

zirubak

2020年2月18日 08:19

I train the NLP classification model using Keras by Python. But to deploy the trained model to a platform I need to use Scala. Can I use Scala with Keras model what I trained using Python? Thanks

Topic: keras scala python

Category: Data Science

Scala vs Java if you're NOT going to use Spark?

Hack-R

2019年11月5日 19:31

I'm facing some indecision when choosing how to allocate my scarce learning time for the next few months between Scala and Java. I would like help objectively understanding the practical tradeoffs. The reason I am interested in Java is that I think some of my production, frequently refreshed, forecasts and analyses at work would run much faster in Java (compared to R or Python) and by becoming more proficient in Java I would enable myself to work on interesting side …

Topic: java scala

Category: Data Science

Which language to learn for Machine Learning?

Siraj

2019年7月18日 14:39

I am currently working in BigData with Spark-Scala framework. I want to learn Machine learning from scratch. Which language would be better to learn for machine learning, Scala or Python?

Topic: scala python machine-learning

Category: Data Science

Is there a way to use a pom.xml file to update spark configuration?

Jay

2019年6月22日 01:47

I am trying to update my spark configuration to solve some dependency problems. This pom.xml file seems to be useful for this purpose. I am using a spark docker image. ls /spark/conf Gives: docker.properties.template slaves.template fairscheduler.xml.template spark-defaults.conf.template log4j.properties.template spark-env.sh.template metrics.properties.template I've searched pom.xml in the container by using find / -name "pom.xml" and got nothing. Is there a way to use a pom.xml file to update spark configuration?

Topic: scala apache-spark bigdata

Category: Data Science

XGBoost not learning

Jacob B

2019年5月29日 21:38

I have developed a train set for XGBoost to apply a learning to rank function on top of with the following parameters: eta = 0.5 estimators = 150 max_depth = 5 objective = rank:pairwise gamma = 1.0 eval = ndcg And applied in this function to train: def trainSearchModel(trainingDataPath: String, modelPath: String) = { val trainMat: DMatrix = new DMatrix(trainingDataPath) val round: Int = 200 val watches = new mutable.HashMap[String, DMatrix] watches += "train" -> trainMat watches += "test" -> …

Topic: xgboost scala machine-learning

Category: Data Science

Spark Scala concatenate 2 different data frames

3nomis

2019年5月17日 09:44

I have 2 different Spark data frames and I want to concatenate them together by columns with no join operations. How can I do it using Scala?

Topic: dataframe scala apache-spark bigdata

Category: Data Science

About