intro

고한규연사님

카이스트 컴퓨터 공학과
에트리 근무끝 -LG전자 인공지능연구소 근무 중
교수님 직장상사/면접관/사수님/
거의 모든 인공지능 이론을 알고 있음

Subject: 인공지능과 로봇 매니퓰레이션

강화학습이란?

에이전트에게 목표가 주어지면 무작위로 여러 행동들을 해보다가 happen-to-be 목표에 적합한 행동을 했을 때 그 행동을 더 자주(강화)하도록 학습

환경에 대한 사전지식 없이 학습 가능
순차적으로 결정을 내려야 하는 문제에는 약함
- 순차적 결정 문제를 풀었다는 뜻은 최적의 정책optimal policy룰 찾았다함
상태,행동,보상,정책이 주어져야 학습하지

Why AI That Teaches Itself to Achieve a Goal Is the Next Big Thing

by Kathryn Hume, Matthew E. Taylor

ref: https://hbr.org/2021/04/why-ai-that-teaches-itself-to-achieve-a-goal-is-the-next-big-thing

Summary: What’s the difference between the creative power of game-playing AIs and the predictive AIs most companies seem to use? How they learn. The AIs that thrive at games like Go, creating never before seen strategies, use an approach called reinforcement learning — a mature machine learning technology that’s good at optimizing tasks in which an agent takes a series of actions over time, where each action is informed by the outcome of the previous ones, and where you can’t find a “right” answer the way you can with a prediction. It’s a powerful technology, but most companies don’t know how or when to apply it. The authors argue that reinforcement learning algorithms are good at automating and optimizing in situations dynamic situations with nuances that would be too hard to describe with formulas and rules.

강화학습은 예측 가능한 AI와는 또 다른 개념으로, 성숙한 기계 학습 기술이다.
강화학습 알고리즘은 예상치 못한 상황(공식내기 어려운 상황)이나 특정 뉘앙스를 가진 동적 상황에서 자동화하고 최적화하기에 좋다

알파고 이야기에서

Indeed, beyond just feeding the algorithm past examples of Go champions playing games, Deepmind developers trained AlphaGo by having it play many millions of matches against itself.
During these matches, the system had the chance to explore new moves and strategies, and then evaluate if they improved performance.
- 수만 경기를 알파고 스스로 치르게 할 동안 증진된 수행 평가하고 전략과 새로운 움직임 탐색할 기회를 가지게 됨.
Through all this trial and error, it discovered a way to play the game that surprised even the best players in the world.

이어서

Put simply, it works by trying different approaches and latching onto — reinforcing — the ones that seem to work better than the others.
- 강화학습은 더 나은 접근법을 더 시도한다
With enough trials, you can reinforce your way to beating your current best approach and discover a new best way to accomplish your task.
- 충분한 시도로 현재 최적의 접근을 능가하는 법을 강화할 수 있고, 새로운 최고의 수행 방법을 발견할 수 있음

고려할 문제

Consider the many real-world problems that require deciding how to act over time, where there is something to maximize (or minimize), and where you’re never explicitly given the correct solution.
- 시간이 지남에 따라 행동하는 것을 결정하는 문제에서 정확한 해결법이 되지 않을 수 있다.

중요 point!

Significantly, because of how they learn, they don’t need mountains of historical data — they’ll experiment and create their own data along the way.
- 강화학습은 상황만 주어지면 학습해서 과거 데이터가 필요 없다!?
- 강화학습은 행동을 생성

How to Spot an Opportunity for Reinforcement Learning

Make a list.
- Create an inventory of business processes that involve a sequence of steps and clearly state what you want to maximize or minimize. Focus on processes with dense, frequent actions and opportunities for feedback and avoid processes with infrequent actions and where it’s difficult to observe which worked best to collect feedback. Getting the objective right will likely require iteration.
- 일련의 단계를 포함하는 비지니스 프로세스의 목록을 써서 최대/최소화 하려는 항목을 명확하게 명시한다
Consider other options.
- Don’t start with reinforcement learning if you can tackle a problem with other machine learning or optimization techniques. Reinforcement learning is helpful when you lack sufficient historical data to train an algorithm. You need to explore options (and create data along the way).
- 강화힉습은 과거 데이터가 없을 떄 유용하니까 다른 최적화 기술이나 머신러닝으로 문제를 해결할 수 있으면 강화학습 시작하지 말기.
Be careful what you wish for.
- If you do want to move ahead, domain experts should closely collaborate with technical teams to help design the inputs, actions, and rewards. For inputs, seek the smallest set of information you could use to make a good decision. For actions, ask how much flexibility you want to give the system; start simple and later expand the range of actions. For rewards, think carefully about the outcomes — and be careful to avoid falling into the traps of considering one variable in isolation or opting for short-term gains with long-term pains.
- 보상에 대한 결과를 신중하게 보고 장기적인 실득을 가진 단기적인 이득만 고려하지 말고, 하나의 변수만을 고려하지도 말 것
Ask whether it’s worth it.
- Will the possible gains justify the costs for development? Many companies need to make digital transformation investments to have the systems and dense, data-generating business processes in place to really make reinforcement learning systems useful. To answer whether the investment will pay off, technical teams should take stock of computational resources to ensure you have the compute power required to support trials and allow the system to explore and identify the optimal sequence. (They may want to create a simulation environment to test the algorithm before releasing it live.) On the software front, if you’re planning to use a learning system for customer engagement, you need to have a system that can support A/B testing. This is critical to the learning process, as the algorithm needs to explore different options before it can latch onto which one works best. Finally, if your technology stack can only release features universally, you need likely to upgrade before you start optimizing.
- 이 강화학습이 정말 실익을 가져올지 계산도 필요함, 비용도 많이 들고 컴퓨터 리소스도 중요!
Prepare to Be Patient.
- And last but not least, as with many learning algorithms, you have to be open to errors early on while the system learns. It won’t find the optimal path from day one, but it will get there in time — and potentially find surprising, creative solutions beyond human imagination when it does.
- 강화학습은 시행착오를 많이 겪을 것이기 때문에 오류나도 인내심을 가지고 기다리자

어려운 이유

Horribly Sample Inefficient
- 사람 수준 이상의 성능내기 위해 막대한 샘플 필요
Better Solved by Other Methods
- 강화학습은 일반화 가능성이 높지만 도메인 특화 기술에 비해 성능이 낮게 나오는 경우가 많음
- Boston Dynamoc 현대가인수한 로봇제어회사
Difficult Reward Function Design
- 다양한 상황에서 최종목적에 부합하는 적절한 Reward Function 설계의 어려움
- Reward Density 에 따른 trade-off
Local Optina
- Exploitation & Exploration 문제
- 강화학습이 Atari 게임을 잘한다고 해도 단일 강화학습이 모든 Atari 문제 해결?)

강화학습 한계 극복 연구

Model-based RL
- 동적 모델 및 모델 예측 제어기 학습MPC,sample efficient
Reward Finction
- 전문가,사람의 시연으로부터 reward function 을 구하는 연구
Transfer Learning
- 이미 잘 훈련된 모델이 있고 유사한 문제를 해결할 수 있다면
Meta Learning
- 학습하는 방법을 학습..?

강화학습 기반 로봇 매니퓰레이션

ref: https://ai.googleblog.com/2018/06/scalable-deep-reinforcement-learning