Suppose there is a state S with two transitions under action A but both transited states are S'. But the tricky part is that the two rewards are different. In this case, how should I construct the probability and reward matrix?
Say we've previously used a neural network or some other classifier C with $N$ training samples $I:=\{I_1,...I_N\}$ (that has a sequence or context, but is ignored by C) the, belonging to $K$ classes. Assume, for some reason (probably some training problem or declaring classes), C is confused and doesn't perform well. The way we assign a class using C to each test data $I$ is: $class(I):= arg max _{ {1 \leq j \leq K} } p_j(I)$, where $p_j(I)$ is the …
I have some time-series data, which I need to use to predict a continuous value for a given time-stamp. I was initially doing it using a Multivariate Regression Model but I later figured that a time-series based problem could be better solved using Hidden Markov Models. The dataset consists of a time-stamp label, around 30 features collected from IoT sensors and then there is one target class which is a continuous variable. The problem is how do I determine the …
Ok. What is wrong with you code! I am trying to calculate transition probabilities for each leg. The code works for small array but for the actual dataset I got memory error. I have 64 g version python and maximized the memory usage so i believe need help to code efficiently. import numpy as np # sequence with 3 states -> 0, 1, 2 arr = [0, 1, 0, 0, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0, …
What is the difference between Reinforcement Learning (RL) and Supervised Learning? Does RL hava more difficulty in finding a stable solution? Does Q-learning have more difficulty in finding a stable solution? Does getting stuck in a local minimum happen more in supervised learning? Is this figure correct saying that Supervised Learning is part of RL?
I have data of each page visited by a customer in a session, my objective is to find out the most optimal path where we see the maximum conversion rate. My idea Is to use Markov Chain to identify that and probably use a mixture of Markov models to avoid bias towards any set of customers. Please let me know in case I am heading in the wrong direction.
I have a sequential data from time T1 to T6. The rows contain the sequence of states for 50 customers. There are only 3 states in my data. For example, it looks like this: T1 T2 T3 T4 T5 T6 Cust1 C B C A A C My transition matrix X looks like this: A B C A 0.3 0.6 0.1 B 0.5 0.2 0.3 C 0.7 0.1 0.2 Now, we see that at time T6 the state is at …
Right now I am trying to better understand how Bayesian modeling works with just the basics. I found through reading tutorials that some very basic Bayesian models like Bayesian Hierarchical Modeling use something called the "Gibbs sampling algorithm", which is a Markov Chain Monte Carlo Method algorithm. I know that, if I am going to do anything with Markov Chains, then I have to test a data or parameter violates the assumption of memoryless. However, I am uncertain what exactly …
The instruction of the question: State A is absorbing. Transition to A from state 1 or 4 yields an immediate reward of 12. All other transitions incur a reward of 1. Transitions are deterministic (i.e. each action maps a state s to a unique successor state s0). For the remainder of this question, we will assume = 1. On this MDP, consider a policy that assigns transition probabilities as indicated in the gure below. E.g.: (move to Aj currently in …
I would like to find some good courses but also a quick response on how to model transition matrix given the states. Imagine having 4 states and the following array [1,2,4,1,3,4,2 etc etc]. What calculations are possible with only an array of states? You can make the array as long as you want, I just gave a random exemple. Python ; Excel; Blog solutions are welcome
Let's start with the following hypothetical preconditions: There is traffic: normal and anomaly. Each traffic sample contains a list of events (of variable size) Events happen in order, the possible events set size is ~40000 elements Should run on relatively small amounts of memory and processing power Having a traffic sample (of size 1000 events max), what is the best machine learning algorithm, that fits the preconditions, to identify whether it's an anomaly? Given my limited knowledge in machine learning …
So what I'm looking for is the best approach to predict a future state. Say we have three states: A, B, C. I want to predict if in the next time-interval (f.e. a day or a week) the state will become C. My (historical) data looks like this: ID Date State 1 2021-12-01 A 1 2021-12-02 B 1 2021-12-06 A 1 2021-12-24 C 2 2021-12-05 A 2 2021-12-12 B 2 2021-12-27 C For a new ID The history could look …
This is the value function expression for a stochastic policy: $\displaystyle v_{\pi}(s)=\sum_{a \in \mathcal{A}}\pi(a|s)\bigg(\mathcal{R}_s^a+\gamma \sum_{s' \in \mathcal{S}} \mathbb{P}_{ss'}^a v_{\pi}(s')\bigg) $ Question: What is the form of the value function when the policy is deterministic?
In Sutton & Barto Book: Reinforcement Learning: An Introduction, there is the following problem: I have this question: why are the policies to be considered here are deterministic?
I have seen several exampled of deploying RL agents in deceptive environnement or games and the agent learns to perform its tasks regardless. What about the other way around? Can RL be used to create deceptive agents? An example could be asking an agent a question "What color is this?" and it replies with a lie for example. I am interested on a higher level of "deception" and not a simple if-else program that doesn't tell you what you need …
For the above Markov decision process under given action policy $a_1$, how can I determine the value of state $s_1$ using the state-value definition $v(s)=E[G_t| S_t=s]$ where $G_t$ is the return? Assume that no discount (i.e., $\gamma=1$).
I am studying reinforcement learning and I am working methodically through Sutton and Barto's book plus David Silver's lectures. I have noticed a minor difference in how the Markov Decision Processes (MDPs) are defined in those two sources, that affects the formulation of the Bellman equations, and I wonder about the reasoning behind the differences and when I might choose one or the other. In Sutton and Barto, the expected reward function is written $R^a_{ss'}$, whilst in David Silver's lectures …
I've essentially been handed a dataset of website access history and I'm trying to draw some conclusions from it. The data supplied gives me the web URL, the datetime for when it was accessed, an the unique ID of the user accessing that data. This means that for a given user ID, I can see a timeline of how they went through the website and what pages they looked at. I'd quite like to try clustering these users into different …