unsupervised anomaly detection for univariate fast frequency time series data?

I have a univariate time series (there is a value for each time sampling) (sampling time: 66.66 micro second, number of samples/sampling time=151) coming from a scala customer This time series contains some time frame which each of them are 8K (frequencies)*151 (time samples) in 0.5 sec [overall 1.2288 millions samples per half a second) I need to find anomalous based on different rows (frequencies) Report the rows (frequencies) which are anomalous? (an unsupervised learning method) Do you have an …
Category: Data Science

Multidimensional regression in Scala

I have a queuing model written in Scala where different categories of people end up a different queues. We have a dataset providing a map of features to the numbers of people ending up at each queue, ie multiple inputs to multiple outputs (continuous values) I have some experience using mllib for single value predictions in Scala but I can't see that multiple outputs are supported. It doesn't even look to me that mllib has continuous value output support as …
Category: Data Science

Plotting libraries for Scala on Zeppelin

My main question is it looks like Zeppelin limit the display of the results to on 1000, I know that I can change this number but when I change it Zeppelin become slow. And it looks like the default plotting tool of Zeppelin also plot the first 1000 results. Is there a configuration or a way to make the plotting tool plot all the data? If no, is there any equivalent for Matlibplot to Scala on Zeppelin?
Category: Data Science

Outlier Elimination in Spark With InterQuartileRange Results in Error

I have the following function that is supposed to calculate the outlier for a given dataset. def interQuartileRangeFiltering(df: DataFrame): DataFrame = { @scala.annotation.tailrec def inner(cols: List[String], acc: DataFrame): DataFrame = cols match { case Nil => acc case column :: xs => val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config println(s"$column ${quantiles.size}") val q1 = quantiles(0) val q3 = quantiles(1) val iqr = q1 - q3 val lowerRange = q1 - 1.5 * iqr …
Category: Data Science

Spark Dataframe APIs vs Spark SQL

I have a relatively complex query which runs against a database and contains multiple join statements, lead/lag functions, subquery, etc. These tables are available as individual files in my object store. I am trying to run a Spark job to perform the same query. Is it advisable to try and convert the SQL query into Spark SQL (which I was able to do by making few changes) or is it better to use dataframe APIs to reconstruct the query and …
Category: Data Science

Find a column by name in a row in scala spark

I have a Seq[Row].Each row is an Array of Struct.Struct has four fields: a,b,c and d all of which are String.The data in a particular row is something like this: [{"a":"ahahk","b":"ridj","c":"qpsj",d":"qmdjdh"},{"a":"lyev","b":"ehsa","c":"pkeg",d":"apht"}] I want to check that if there is a field with name 'c' and 'a' when I loop over the Seq of Rows. What are some possible solutions for such a scenario if I want to create a udf in spark which takes Seq[Row] and finds the presence of …
Category: Data Science

Issue while reading a csv file through Spark

I am trying to read a csv file through Spark. However one of the columns has the data in the below format and because of comma it is being split into multiple columns. The input csv file is a comma delimited file. "[{"code": "100", "name": "CLS1", "type": "PRIMARY"}]" could you please help me how to parse this column in spark scala. I tried using option("escape","") and option("quoteMode","ALL"). didn't work as expected
Category: Data Science

Has anyone succeeded in finding a good Scala/Spark kernel for Jupyter?

The ones I've tried so far Almond: Works very well for just Scala, but you have to import dependencies, and it gets tedious after a while. And unfortunately can't run when using Spark with YARN instead of Local. Spylon-kernel: Kernels connects, but gets stuck in the initializing stage. Apache Toree: I would've loved this so much only if it worked. Lots of language support, magics, incubated by apache. However, this kernel doesn't connect. Get's stuck on the "Kernel Connecting" stage. …
Category: Data Science

determining size of batch, time of sending and memory in to send from scala to ML section

I have a time series (sampling time: 66.66 micro second, number of samples/sampling time=151), I would like to determine some anomalies in them, the inputs are made by scala customer message bus. would like to know how I can determine size of batch, time of sending and memory in Scala customer or ML/AL?
Category: Data Science

Saving Large Spark ML Pipeline to HDFS

I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size: scala> val mod = pipeline.fit(df) mod: org.apache.spark.ml.PipelineModel = pipeline_936bcade4716 scala> mod.write.overwrite().save(modelPath.concat("model")) 18/01/08 10:00:32 WARN TaskSetManager: Stage 8 contains a task of very large size (755610 KB). The maximum recommended task size is 100 KB. org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 2606:0 was …
Category: Data Science

how to calculate the cosine similarity between two files?

I am using spark and scala to implement an issue. files contain phrases or sentences. I want to use domain based method to calculate the cosine similarity between tags.I convert two files into a string and then calculate the similarity. code val lines=Source.fromURL(Source.getClass().getResource("file:///usr/loca/spark/dataset/algorithm3/comedy")).mkString("\n") val lines2=Source.fromURL(Source.getClass().getResource("file:///usr/local/spark/dataset/algorithm3/funny")).mkString("\n") val result=textCosine(lines,lines2) println("The cosine similarity score: "+result) } def module(vec:Vector[Double]): Double ={ math.sqrt(vec.map(math.pow(_,2)).sum) } def innerProduct(v1:Vector[Double],v2:Vector[Double]): Double ={ val listBuffer=ListBuffer[Double]() for(i<- 0 until v1.length; j<- 0 until v2.length;if i==j){ if(i==j){ listBuffer.append( v1(i)*v2(j) ) } } …
Category: Data Science

Data Science Tools Using Scala

I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?
Category: Data Science

How to install Polynote on Windows?

I've been searching around the Internet for a while but I have not been able to find detailed instructions on how to install Polynote (the polyglot notebook with first-class Scala support) for Windows with mixing multiple languages, Python and Scala. Github Link for Polynote. Official Website. According to the official website: Polynote is currently only tested on Linux and MacOS, using the Chrome browser as a client. We hope to be testing other platforms and browsers soon. Feel free to …
Category: Data Science

Scala RDD operation

I am new in scala. I have a csv file stored in hdfs. I am reading that file in scala using val salesdata = sc.textFile("hdfs://localhost:9000/home/jayshree/sales.csv") Here is a small sample of data sales. a is the customerid, b-transanctionid, c-itemid, d-itemprice. a b c d 5 199 1 500 33 235 1 500 20 249 3 749 35 36 4 757 19 201 4 757 17 94 5 763 39 146 5 763 42 162 5 763 49 41 6 824 …
Category: Data Science

Scala vs Java if you're NOT going to use Spark?

I'm facing some indecision when choosing how to allocate my scarce learning time for the next few months between Scala and Java. I would like help objectively understanding the practical tradeoffs. The reason I am interested in Java is that I think some of my production, frequently refreshed, forecasts and analyses at work would run much faster in Java (compared to R or Python) and by becoming more proficient in Java I would enable myself to work on interesting side …
Topic: java scala
Category: Data Science

Is there a way to use a pom.xml file to update spark configuration?

I am trying to update my spark configuration to solve some dependency problems. This pom.xml file seems to be useful for this purpose. I am using a spark docker image. ls /spark/conf Gives: docker.properties.template slaves.template fairscheduler.xml.template spark-defaults.conf.template log4j.properties.template spark-env.sh.template metrics.properties.template I've searched pom.xml in the container by using find / -name "pom.xml" and got nothing. Is there a way to use a pom.xml file to update spark configuration?
Category: Data Science

XGBoost not learning

I have developed a train set for XGBoost to apply a learning to rank function on top of with the following parameters: eta = 0.5 estimators = 150 max_depth = 5 objective = rank:pairwise gamma = 1.0 eval = ndcg And applied in this function to train: def trainSearchModel(trainingDataPath: String, modelPath: String) = { val trainMat: DMatrix = new DMatrix(trainingDataPath) val round: Int = 200 val watches = new mutable.HashMap[String, DMatrix] watches += "train" -> trainMat watches += "test" -> …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.