Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why there are two calls to the policy, also where is the non intrinsic characteristic of intrinsic reward? #13

Open
mehdimashayekhi opened this issue Jan 10, 2019 · 1 comment

Comments

@mehdimashayekhi
Copy link

mehdimashayekhi commented Jan 10, 2019

Hi, Thanks for sharing. I was wondering if you can explain why do we need two calls for apply_policy in the can_gru_policy_dynamics.py, here

self.apply_policy(self.ph_ob[None][:,:-1],
and here
self.apply_policy(self.ph_ob[None],

Also, I have another question. Based on the paper, intrinsic reward, should be non episodic but extrinsic reward is treated as episodic, I couldn't find where this "non episodic" charactersitic has been addressed for intrinsic reward in the implementation. Shouldn't we also add this episodic reward (i.e., eprews) to the external reward (i.e., rews_ext)?!

eprews = MPI.COMM_WORLD.allgather(np.mean(list(self.I.statlists["eprew"])))

really appreciate your responses

@mehdimashayekhi mehdimashayekhi changed the title apply policy call why there is two call to the policy, also where is the non intrinsic characteristic of intrinsic reward? Jan 12, 2019
@mehdimashayekhi mehdimashayekhi changed the title why there is two call to the policy, also where is the non intrinsic characteristic of intrinsic reward? why there are two calls to the policy, also where is the non intrinsic characteristic of intrinsic reward? Jan 12, 2019
@harri-edwards
Copy link
Contributor

There are two graphs created for the policy / predictor, one for rollout and one for optimization. This is because at rollout time the time dimension has size 1 and is better treated separately.

If you look at

self.I.buf_advs = self.int_coeff*self.I.buf_advs_int + self.ext_coeff*self.I.buf_advs_ext
you'll see the intrinsic and extrinsic advantages are combined there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants