Articles, Blog

Learning and Efficiency of Outcomes in Games

December 4, 2019


>>Thank you all for coming for today’s
distinguished lecture. It is a week of attrition since AAAI is happening at the same time, WSDM is happening
at the same time. Eric Horvitz said Triple A.
Susan Dumais said Wizdom, Harry Shum said Wizdom. So yeah. It’s awesome
that [inaudible] is building 99 has come on down. Are there any announcements
before we get started? Okay. So, it’s a privilege to host Eva Tardos for
the distinguished lecture today. Eva is a professor at Cornell
and she was chair at 2010. I remember for sure. And
she’s won so many awards. I noticed the bio
is not yet updated. So she’s won things
like the Fulkerson Prize, the Gödel Prize, but the EATCS has not yet up on the website or it’s
not showing up on the bio. She won EATCS last year. And she’s done foundational
work in algorithms, most famous for network flow. I know the first
computer textbook I picked up was
Kleinberg and Tardos. The first time I actually saw algorithms named after
people and things like that. You know like [inaudible]
there was the Lovász number. And there was
the Tardos function. And I was like, “No. I’m going to go to AI instead.” And also the first book I picked up in grad school was
Algorithmic Game Theory. So she’s played a foundational
role in that field. Yeah. Her contributions
are too numerous to list but one thing I did
realize in reading through her bio
is it’s definitely incomplete because
it does not make any mention of
her unwavering effort at a diversity and inclusion and stem outreach
and things like that. I think the first
email I got from you was the Expanding
Your Horizons email, encouraging grad
students to go and teach those in middle schools
and things like that. So yeah. It’s an honor to have, Eva, come here to talk today. I’ll handle it over to her.>>Thank you very
much. Thank you Hadid. Thank you all for coming and I hope you’ll find
the discussion interesting. I guess one comment
on Hadid summary, I hope you guys all do at improving diversity
and outreach. I think there’s a lot. We
all collectively can do. I do not keep this kind of
things on my CV because I guess I think of my CV as
a sort of professional, sort of my science
contributions. It’s not because I don’t
think these are important. I think they’re
very important and I hope you guys all do this. In particular,
especially I mean, the people who are not
minority and women. Thank you all. This is an effort we
should all participate in especially the
non-minorities. And I hope you guys care. Anyway, but I will limit my discussion to
the particular topic. So the thing I want
to talk about today is how I think about learning in
the context of Game Theory. And I guess it really sort
of come in kind of two parts. Our people learning in
a setting where there is games and if I can
assume that the learning, what kind of things
can I say about what happens in games if people were nice enough to actually play in this relatively wise manner where I have to be
careful on what I really mean whether it’s
wise to learn or not. And that’s what I’m going
to really start with. So that’s the efficiency
part is what can I say about the outcome of a game or efficiency
of art compart. And then, the first part
is learning in games. So what do I mean by games? Unfortunate. I thought
I had a forwarder, but apparently do not. What I mean is a game of
some sort where some players, which in depends on the application that
particular player can be either the people driving
cars or it could be rerouters routing traffic. Some algorithm, can be
a human or an algorithm, is making decisions as you
go one decision at a time trying to repeatedly optimize
some objective function. To make it more mathematical, I will assume that there are a couple of things that
are written on the slide. The example here is
a Congestion Game. So I guess the assumption that cars or [inaudible] try
to follow a shortest path. They try to figure
out where they get to the destination of the fastest. Of course this is a game. Congestion. Your speed depends on who else is going
on the same path, whether your current traffic. But there is an implicit
assumption in here, is that you’re not trying
to hurt other people, your goal isn’t
that other people don’t get to their destination. You just want to get
to your destination. And then I guess
one part that will come in certainly in our results, or I think what’s important to think about in
the context of games. I think I’m giving up on this, is that the traffic isn’t as stable necessarily
as you can hope. Depending on whether
the Superbowl ended, or some new site and then on
the New York Times website, or whatever happened,
traffic does change. And so, as you return even you
optimally send traffic, you’re going to have to learn, or adjust to changing environment
and that’s going to be an important feature as I
get later into the talk. I will alternate between using traffic routing
in my example, or actually using
auction as my example. Where auction I think of
mostly advertisement auction. I need to think of a setup that’s where learning
will be relevant, where we repeating a game such as the
advertisement auction, or such as traffic routing, where the individual outcomes in a game are not very important. Like in ad auctions, a particular value of one ad, it’s a tiny fraction of a cent. In traffic routing, car I think they might not
apply so we’ll but certainly, in packet traffic, if
one packet goes bad, the internet drops
it and tries it again, not that big a deal. If a host chime of
packet does back, that’s kind of a problem.
It’s a big problem. Similarly, in ad auction, if your whole campaign tanks
that’s a problem, if one ad didn’t work out, oh well, one it doesn’t matter. So, that’s the kind of setup we’re learning
will make a lot of sense. And here again, the
advertisers not very fast, but can leave and
join the system, and there is an ecosystem where things can
change as we go. So, to make it
slightly mathematical, I guess the game will be that their auctions that
people can take, put up ads, choose
a rat to drive. And then, something
happens and you repeat this action over time. And the assumptions are players somehow take
the past data and figure out from the past data
what the hell to do next. So that’s the model of learning. I assume that the goal is, that they try to
an average do well. Again, not very
sensitive to outliers, but most of the time
should do well. And that’s because small value
of cost in each period, so exploration is a good way to figure out
what the hell to do. And again, the idea
would be that in the beginning they
don’t know what’s up and then later on they know. So I guess, that’s
the main topic of the sort of repeated games and what the hell happens
in repeated games. And I want to start
with something which given that this
is the AI colloquial maybe I shouldn’t
spend too much time on of why do we
want to do learning? And what might be
the goal of learning? I want to start with actually, learning and games was
a topic combination that we weren’t
the first to start. Economists started it before AI was really
such a dominant field. So, they were asking this question and
not the question that I might want to ask, they asked this question
or learning is cool, because players might learn
to find the Nash Equilibrium, we’ll just figure out
what mode of the learning is. They weren’t
actually thinking of the kind of game
I’m thinking of, which we naturally
or have to play with small six day instead thinking of what
they called pre-play. So, they were
thinking of there’s a giant enormous
auction going on, think of the FCC spectrum
auction that just happened. But before it happened, for
the previous couple of years, there were all kinds of
training things going on, to train the auctioneer’s
how to bid in this game. So, they were
thinking, let’s think of this training as a learning. If you’re on the training
before the real auction then, the players might,
via the learning, figure out what
the Nash Equilibrium is. And then, when
the real auction comes, they can play the Nash
Equilibrium. That was the goal. So, they were thinking of learning just the way
we’re thinking of learning, but they were hoping that learning will find
the Nash Equilibrium and that’s basically was what
they hoped and just to realize how
early they started. Robinson in 51, saw about
fictitious play which apparently means
what I said there, your best response
to the past history. So you watch what
happened in the past, you think of past history as a sample distribution of
what other people are doing, assuming they’re going to sample the very same distribution, and just best respond to that. This is a sort of bayesian
style learning assumption, that the previous behavior is a bayesian sample from something and you react to
that bayesian sample. And that’s what Robinson
was studying and she proved, by the way, just because of
Odette’s comment, she proved. Robinson was a woman. That this pre-play was
hoping to learn lead to a Nash and what was hoped and that’s what she was studying what time does it lead to Nash? And what do I mean
by lead to a Nash? I hope that maybe it doesn’t
need a definition either, but just in case,
oops, what did I do? What it means that they will
find a solution that stable. Stable in the following sense, that stable as a one-shot game. If I take and there is a slight awkwardness in alternating between routing and auction as my two examples, then one is an example where we minimizing the delays
some sort of cost. The other one is an example of everyone maximize our utilities. When we want to
minimize delays then, Nash solution a, this is a vector with AI is
the component for player i. Then, this is what
it means to be Nash. For every player i, if they choose to
do something else instead of their
own component AI, they choose something x instead that would have a higher cost. That is they don’t want
to do this other thing. That’s what it means to be Nash. And again, the notation
here is a minus i is a n minus one dimensional
vector with the i component missing, which got replaced by x. And that’s what it
means to be Nash. And what they hoped
is somehow learning will either achieve this
one-shot game Nash Equilibrium. So that would be
an equilibrium verdict, repeatedly play the same thing
over and over again, or at least converge to it in some sort of distance
like metric distance. And I guess, we would call this as a no regret assumption. And the question
was, does fictitious Play, that is best responding
to the past history, leads to Nash equilibrium. And an unfortunate answer
is mostly not, but Robinson proved
that sometimes it does. In ’51, proved that
it’s a two person, two strategy each game, then it does. Kind of nice. Two by two game. Matching pennies, which
is on top of it, zero sum. It doesn’t have to be zero sum. Any two-person game, or
generic two-person game, algebraically
independent numbers or something, then it does. And then, in ’61,
10 years later, rock-paper-scissors that
is any zero sum game, two-person zero sum game, and it’s kind of over here, like remember that
line is mostly not, but sometimes it does. There have been other attempts
in this line of work, which I will not
survey going on even today where still the goal is to find the Nash equilibrium, and so they come up with weirder and weirder
definition of learning where really
your goal is to find a Nash Equilibrium and
not the moral natural behavior. That’s not what I want to do. So I thought fictitious plays a pretty reasonable model of
what people might be doing, the more modern things are not. And I disagree with the goal. I also disagree with the goal of finding
Nash equilibrium. I don’t know why would we want or expect players to find
the Nash Equilibrium, but there are a lot
of troubles with it. One is there are
multiple Nashes, so why would I expect
them to find one of them? If it’s a multiplayer game of the kind we think about
the internet, many, many players, it’s
an incredible amount of knowledge even to
know who else is playing, let alone what the hell
they’re thinking. And then there is
computational difficulties, something that’s
algorithmically hard to find, the sort of distributed
learning algorithm is not going to succeed in
finding it either. But further, you
look at behavior, we know it’s not happening. This is from Microsoft Data, which because I don’t
work for Microsoft, there are some secrets on the screen of the kind that
I don’t know the answer to, like the bids are normalized and I don’t know the keyword. But this is
advertisers’ seven day, a keyword on which I
don’t know how many, some six or seven people
are advertising on, and these are days,
seven consecutive days. And this is what
their bids look like. These are advertisers who
frequently change their bid. There’s ample evidence that they must be using some sort of online tools because
they change their bid too often for a human intervention. Also, these are people who
have insanely high budgets, which they
practically never hit. So they’re really using their budget as
some sort of insurance, and the thing they played is that number that you
see on the screen. To me that those
numbers especially the red one or this guy
here or that guy here, totally look like gradient
descent style algorithm. Every now and then,
they adjust a little bit like gradient descent would. They’re learning, they’re trying to figure out what to do. So, I don’t want to model these people as playing
a Nash equilibrium for many, many reasons because learning
doesn’t find the Nash, because they don’t appear
to be finding a Nash. They’re not that stable. Instead, what I want to
think off is a repeated play where they repeatedly
experiment with the games. Now, if you’re really
more E-com backgrounded, then you might tell
me yes, yes, yes, of course, they are not stable because this
is a repeated game. They’re doing something
much more complicated than finding a Nash equilibrium
of the repeated game, which is a very, very
complicated object. And I think the pictures are not suggesting that
they’re doing that. That would be sort of
a threats of the form that if I and Hadid are
advertising for the same keyword, I tell Hadid that, “Look, on even day,
it’s my turn. You want a bid
zero. On odd days, it’s your turn.
You’re bidding zero.” And this way, of course,
we both paying low. And that will be kind of cool. There’s no evidence in the data
that they’re doing this. Those are very,
very complicated. And I doubt they’re doing it. Alternately, I can claim
that they repeatedly playing a one-shot game
in which you would claim that on this day, that red player here
had a high value, that’s why he was bidding high. On this day, his value
changed to low or that doesn’t
seem very likely. So none of these seem
very viable assumptions. Instead, I want to assume that they’re learning
and they’re doing something like no-regret
learning or fictitious play. And there are couple
of slides here, which I’m going to flip
through really fast given that this is
the AI of the answer, maybe this is something that
all of you know very well. Instead of playing
fictitious play which is a deterministic strategy, that is you deterministically best respond to what happened, something that
works much better, at least in analyzing and maybe more likely something
similar that they’re doing, is they’re not quite best responding deterministic
doing something similar to it. For example, they
randomize between some relatively
similar playoffs. So a fictitious play
would tell you to choose the best strategy, the what’s called often
smooth fictitious play, put in a normalization
factor and somehow try to be seen similar
strategies, increased the entropy. This is also the same as what otherwise known as
multiplicative weight, and has this very nice property, what’s called no regret
or small regret property. Remember, and here is the awkwardness switch
to utilities right now. Nash equilibrium would require deleting this thing
from the slide, that the utility of
a what they played is better than the single
best thing this hindsight. This thing here says, that’s not true individually
every single time step. So in Nash equilibrium, something like these and other should delete
the sums over there. It’s not true
every single timestep, but if I add it up over time, then it’s almost true. Almost true meaning there’s
a small regret error. So let’s try to analyze
this is a little bit better. There’s a fixed. I can change my strategies
as often as I play. Every player is
allowed to do this. But compared to a fixed
strategy with hindsight, your behavior should
be reasonably good, that is the error that will be better should
be kind of small. And in particular, that rule and many other rules get through
error than to, somehow, root T. And maybe, hopefully, many of you know about
this sort of no regret learning. I guess if you don’t, let me say one more
sentence before I try to put it in context. What this means is if there is a really good strategy that’s consistently good, I’m asking. Either find it sooner
or later not too fast, not too slow that is
after not so much time. Or do something better. That is if I’m driving to work and I don’t know
the highway numbers here so let’s go California area maybe you guys
will know those numbers. If it turns out that we’re
driving in Route 101 is ideal, I don’t have to drive on 101, I can drive in 280 and
alternate between things. But if I’m consistently versed in 101 and
there’s this 101, it’s so simple I
should just do it, then I should discover it. This is what it’s asking
and it’s giving you a little error
because it might take you a while to
discover this fact which is of course natural because you’re learning
from past data only. So, that’s what
no-regret learning is. And I guess a couple
of things about no-regret learning again maybe putting in context
of where I’m heading. So, the first thing
I want to tell you is maybe one slide of game theory tells you of why
this might be a nice thing, but more importantly
are people doing this? Is the data based at that? Is this a reasonable assumption? Will I recommend people
to do no-regret learning? So asking is no-regret learning a behavioral assumption
that I want to assume? Do I want to
recommend to people? How good is this? And then, I want to also tell
you a little bit of what can I tell you about games, if I can make the assumption that people
are doing no-regret learning. So, actually maybe
I should cheat and flip through this and
come back to it later. So, let’s start with is
no-regret learning an assumption I want to make about
the data or do I like it? So, again the utility terms
I want to abstract away from the exact error term which was root T not so important whether it’s
root T or something else. If I played for
capital return T, I want the error to be kind of small compared to T.
So the way I wrote it right nice a more
generous epsilon times T. A small fraction
should come from the error and as
utilities go, time goes, the toy utility you’re
gaining is growing with time, this probably growing with time thereby should grow
a little bit slower with time, if at all at least
an epsilon fraction. Now, what are good things
about this assumption? Or you can do
no-regret learning. There are many algorithms multiplicative rate is one of them there are million of them. And if you don’t know
enough of them go ask Seb or ask many of the other
people in the audience. There are million great
algorithms, very simple ones. The behavior to me the sort of, if there is a good strategy
please notice it. It seems so natural that it’s even believable that
humans can do this, not just algorithms,
very natural. And one thing that
maybe you’ll see at the end is it’s
a very handy behavior model. It’s such a nice equation
or inequality I can use it. It’s a nice assumption that I can build a whole theory on. The same way
Nash equilibrium was a beautiful assumption and economist built
a beautiful theory on Nash equilibrium as
a behavioral assumption which sometimes they’re
questioning but nowadays whether that’s
a reasonable assumption. But there is a beautiful
theory built in it because this is
such a nice equation. The same applies to this, but I also should
confront it with data and the data
is the data set that I already showed you one being
advertisement data set. It’s coming from, I already showed you
one picture from it. It’s coming from a paper
of Denis Nekipelov and Vasilis Syrgkanis
and maybe you’ve seen some of this if Vasilis
talked about this but let me remind you that maybe
the most dominant feature of the data set. So, this is again
advertisers that I can think of as
optimizing something, they are changing
arbitrary frequently. Also can think of them as
a single parameter advertisers. The budget is so
sky high that that’s an insurance that’s not
a strategic parameter, it’s the bid that they’re
changing all the time. And I look at the regret, how much error they
have their regret. Now, this is a plot that you should definitely think about. I guess the blue bars are
probably easier to see. So, the regret relative to what they gained.
Did that make sense? If you gained $100. One dollar loss of regret is different than if
you only gained $1 and your regret is $1 loss
or so normalizing to your gain to me makes
sense as a measure. If I look at someone out here, that poor guy is doing horribly This is the $1 gain, $1 regret. Okay, $1 gain, 77 cents
regret, shitty. If I look at someone down here, the guy has
practically no-regret. These people literally have no-regret and maybe a
better way to put it, the plot is maybe broken. These guys could actually
have negative regret, that is they could
be doing better than no-regret which is
possible because they’ll have to change
the way they bid all the time and the regret measure is against a single bid
with SignSite. So these guys might actually
have negative regret here. Thirty percent of the bidders have negative
non-positive regret. So, these guys are
doing very well. I’m reasonably happy and
can say that these guys here maybe I don’t know
how far I should go, maybe till here, 40, 50 percent, they’e
doing really well. These guys are not. Now, you have to
be a little careful here or I should be
a little careful. I don’t actually know what
this guy’s values are. I infer the value
from for this plot. So, what I really know that these guys are doing
badly, that’s a fact. No matter what the value is, the regret is close to 50
percent of the utility. So these guys are
all doing very badly. These guys here might be
doing well, reasonably well. What does this tell me? It certainly
doesn’t tell me that the little or no-regret
is a good assumption. There was only 30 percent
of the people down here and 70 percent
have some regret. And if I think about it, it may suddenly makes me decide that I like this
alternate version, which I put here on
the slide and I guess we have a paper a year ago, NIPS. Working this alternate
assumption which we called approximate no-regret. Instead of comparing
utility additively I said, “Well, you know
a little bit of error.” You can be that sensitive. If you’re within 10 percent, come on that doesn’t
matter anymore.” So one way to think about
this comparison here. If I change my classically
regret is an additive measure, I just compare
the difference between my utility and the optimum
utility that’s the top line. And as I pointed out, the classical measure
gives you a routine regret after time T. If
I’m willing to give you a little bit of
multiplicative error that is I’m willing to give you
a parameter epsilon and say, “You know, within 10 percent
I don’t care so much. Please make the additive error
much as small as you can, but I give you 10 percent multiplicative error
in addition.” Then you can significantly
push down the additive error, in particular I can push
it down to a constant. So this has all kinds
of advantages, I’ll come back to
this in a second. But looking at the data I think this is actually
probably a better model. That multiplicative regret like giving you a little
bit of error. The data would suggest
over 50 percent of the participants have less
than 10 percent regret error. Whereas, if I have
to go with literally zero there was
practically no one. Yeah.>>Can’t you say that like here people are
reaching something like correlated equilibrium and in this seems definition is
almost same as that, right?>>Yes I’ll come back to this. Give me one second
coming back to the games of the connection to correlate to the equilibrium. I want to spend one more slide on negatives on the issue. Is this a good
assumption or not? But I’ll come back to
your point in one second. Yes, the answer is yes and I’ll come back to
you in one second. This is an assumption. I can offer you one more thing
for the assumption and then I’m going to offer you the main argument
against it. This is reacting to
what he’d asked me, try to talk about
the general picture here. This is an assumption
here additional slides. I like it too, but to prompt discussion I should be more honest that there
is the downside. It’s not clear how good
this assumption is. Yeah?>>Is it going back to this, to back chart there to show
and the square root detail, because you normalize it, it’s hard to say. The noodle is going
zero there, right? There are square root T, it’s hard to say if you allow
some kind of square root T, what would this
distribution look like?>>One very actually
coming even harder, coming more up from you, actually this question that I don’t know how to
ask on this dataset. Is Vada nurses T. What is the moment I
am looking at this? And I guess one thing we’re looking at which I
can’t show data yet, but I guess we now
have some chart, some datasets showing,
what happened, I want to distinguish between
this 50% versus that 50%. 50% that I know for sure has high regret according
to what I’m looking at. What are they doing, the next seven days
and my ingoing assumption which I
think the data shows its true that the people
who have high regret, more drastically
changed a bit on the next seven days than people who I think are doing okay. That is, they too
think they’re having a problem and so they’re
varying their bits very widely. And part of it is, maybe something changed
and they just walk up to it like this regret is
a good assumption if you think you’re against
the study distribution of some sort and if the
best response changes, you should adjust and
maybe for these guys here, something did change
and they slowly waking up to this
and effectively the T smaller than
these other guys T, but I don’t actually know and I think this new data viewer currently collecting
or more precisely. We’ll get back to dealing
with after the EC deadline. We’ll hopefully give a better
answer to this question. There are a couple other
datasets other than my own that I can rely on. There is Er’ev and Roth
in 86 in a lab experiment with two person
coordination game that I think actually it’s
pretty good support for this. I originally started
to think about some of the newer results
I’m going to tell you about. Listening to a talk at EC’14, about a lab experiment
was trying to tell us that people are actually not, no regret learners and in fact that’s what
the people does. It’s a buyer and seller game. Again it’s a human
subject experiment, the buyer and seller game, where only the
salaries are human, the buyer is automate, the game is that the buyer has a uniformly random value
between zero and 10. You’re the seller,
you’re the buyer, you’re supposed to
make him an offer. No, you’re the seller,
you’re supposed to make him an offer and if you make
an offer below his value, he’ll accept and
otherwise he doesn’t, and what offer should you make? And if you do
a little analytics, you should figure
what offer you should make. That’s not hard, but most people don’t do this and especially if they weren’t told that it is
a uniform distribution they would have to observe that it was the uniform distribution, they instead make offers
and see what happens and a better mold that fits their behavior is an incredibly
recency bias version. That is, they don’t best
respond to past history. They best respond
to what the hell happened in the last run, or last couple of runs.>>Do they know
how many buyers are going to show up or is it an infinitely.>>I don’t actually know.
I should look at the paper.>>You didn’t know
that thing against the park.>>I don’t think they knew that. No, they didn’t know
the thing against the park. They said they saw
the thing against the norm but it was
a heavily recency bias. The behavior showed
incredibly strong recency bias and I’ll come back to
that point in a minute. But I think recency bias is not crazy in
a changing environment. Recent experience is more
relevant than past one, which I think is fine and
then there is a recent goal in Nissan experiment
which actually played with in
particular ad auction as a human subject experiment and mostly they
recovered the values, they notice that buyers
that they gave or sellers that they give tiny tiny
values on experiments, so small that they really
should have dropped that. They should lose all the time. They behaved irrationally. They didn’t play no regret. Bombay, I would
explain this that this poor guys in a lab experiment were
forced to sit there for, however long the experiment was and it’s frustrating
to keep losing. They were paid to sit there, but still it’s frustrating
to keep losing. This doesn’t quite happen
in real life because people whose value is too
low for this can just drop out and
advertise elsewhere. I think that’s less
relevant for us. But to show to you
that it’s not as good as I’m making out to be, let me actually give you
what happens in a very, very simple two person game, which is almost
rock-paper-scissors. This is a standard notation
of rock-paper-scissors. This square here tells you if
scissor press against rock, then scissor loses one dollar
and rock gains a dollar. The game I really want, because remember I
already told you that zero-sum games things work out okay is not a zero-sum game. The rock-paper-scissors
except we’re going to each played a date nine
dollars if we actually match. As it turns out this game has the very same Nash
Equilibrium 1/3, 1/3, 1/3. That is 1/3 of the time we
pay 18 dollars to a date. It’s a good game for
a date, not for [inaudible]. Question I want to
ask what happens if I do learning in this game and I’m even willing to get you
started off from the 1/3, 1/3 Equilibrium and I want
to actually depict this in the 2D plane because your strategy here
is in the 2D plane, because probabilities
should sum to one. This is one probability
one on rock, probability one on paper, probably one on scissor
and everything in between that point
here is the 1/3, 1/3, 1/3 and let’s assume you start
it off of Nash and I start off on Nash but I’m very bad at randomizing as we are
all of the rest of us. And as it turns out, a little bit too
heavily play paper. And say I’m playing
against Nikkiu. Nikkiu’s learning algorithm will discover that I’m very heavily playing paper and as
a result his playing, learning algorithm we’re very heavier pay the thing that beats paper that’s
called scissor. However then my learning
algorithm will discover that Nikkiu’s learning algorithm always the scissor
to be a scissor, I probably should create
pay rocks and cube. Pay rock and what we are going to do is actually
cycle around like this and the dynamic in
this 2D plane will come out to the cycling behavior
on the boundary. On the plus part which is cool and actually this is better
than playing the Nash. Better for the two of us because we’re not paying the nine dollars
to him Nash would. We think avoiding
the diagonal in an interesting way but in
particular we’re not playing the Nash as you guys see
and what we’re playing instead is a correlated Nash. The play does not converge. What we’re doing is
correlating our shared history. What we’re doing is
something of this form. We boost learning so we boost satisfying
this inequality that our cost or utility cost
is less than any other cost model or
a little bit of learning error. But we correlated, we’re not doing it independently,
hence it’s not a Nash. Okay. Last slides
on the model of learning and remember
the slide title is negative things
about learning. So this is kind of
a nice property, here is the reason
they’re not doing it. In order to get no-regret, I have to randomize. If I were to be deterministic, Nakeel knowing
my deterministic algorithm would beat me every single time. It would know that I
do and would do the up, do the beating thing. I have to randomize. And if you look at those plots, I said they look like gradient descent
which kind of they do, but they surely don’t look
like they’re randomizing. They’re not wiggling like crazy, that people are not randomizing. So that’s a problem. Would I actually recommend people this algorithm
to do this? So this was the question, “Are they running
no-regret algorithms?” And actually no,
they’re not running no-regret algorithms
because an algorithm to be no-regret, you
have to randomize. And they’re clearly
not randomizing. Second question, “Would
I tell them that, hey guys, you guys should
run a no-regret algorithm?” And honestly probably I
should not, for many reasons. One reason is what do you mean, no-regret against a fixed
strategy you might do better, maybe you should
try to do better. And what’s worse,
no-regret is not that bad, it sort of adopts to
other people strategies. But maybe I should do
better and I actually denote the rock-paper-scissors
example of us here exactly to
show this to you. If instead of automatically reacting to what Nikhil
is doing right now. I know he’s doing
a no-regret algorithm, I can predict what he’s
going to do tomorrow. I should react to that instead. That’s a better move than
just reacting to his past. I know his algorithm, I should they react
to his algorithm, to his next step
which I kind of know.>>I should do the same.>>Yes. Well, yeah, if we’ll get caught if
we keep people do this. So I’m not so sure
I would react to it but I want to make
a different point and that’s my coming back to being positive about
no-regret learning. I did not make
the assumption that they are running my no-regret algorithm, which they clearly are not. I didn’t recommend they running
the no-regret algorithm. I was asking, can I mold
them as achieving no-regret, and it seems like I can. They can do better
than getting no-regret and I would recommend they
do better than no-regret. But they do seem to
achieve a no-regret like condition without
necessarily running the algorithm. So what I’m going to do
in modeling this is not to recommend they run
this algorithm but instead model behavior
with this condition, which is not
the same assumption, and I think a much
better assumption than the recommendation
that they reach. Okay, so I’m going to
give you one other part. And I don’t know
how many minutes I have for one another
part but use a few, some amount of minutes. One benefit of this which I already mentioned in
this one state English but I want to give you
a little context for it, is that this no-regret
assumption is such a clean beautiful
mathematical assumption. It’s so usable. And how many things
one can derive from it? And I would say someone
trained as a mathematician, is that having a clean
mathematical model is definitely good for thinking about what follows from
this assumption. This is where I want to
go back to your point and actually I don’t know if
I can go back fast enough. There is a slide
that I skipped in the middle, exactly this one. There I wanted to pick exactly the point that
you asked me about. If I take the absolute limit, when relative regret
went down to zero because root D that is grew
more slowly than time, then there’s
absolutely no-regret. The assumption here was that the utility
of what they play, if I take the history of play as an empirical distribution, then that distribution will have the Nash like
no-regret property. It will have this property that the expected utility expectation
upwards from the history of player i is better than any fixed strategy
with hindsight. This is called
a correlated equilibrium. It’s like a Nash Equilibrium, same inequality, except
that the player correlated. And the way they correlated is that they both pulled from, both based their behavior on the same history and
they got them correlated, as we saw in the example. And going back to where I was, sorry for doing things
in the wrong order. So the sense that
they’re converging to or getting close to
a correlated equilibrium, we can ask a bunch of
interesting questions. What can we say about quality of a
correlated equilibrium? And what can we say
about the speed at which they converge to
this correlated equilibrium? And maybe one of
the most fun of, most fend off and one to most emphasize is, can
we adjust this? And the answer is we can, even if I actually don’t
keep the game stable. That is, learning is good because learners can adjust
to a changing environment. If you’re learning,
you’re not blindly playing a Nash Equilibrium and hoping for the best that
it stays the same game. You’re watching what happens and if the environment changes, you will learn to adjust. That’s what learning
is good for. Can we actually make
this mathematically precise and say learner’s adjust to the environment and
they constantly achieve, or over average time, achieve high social welfare? And the answer is yes. And in
the last, may be couple of, minutes I want to give
you a sense of how the no-regret learning actually gives you enough power to
prove these kind of things. So hopefully that’s still okay. So what is the kind of
thing I want a proof? This is something that often
goes under the name of price of anarchy and
maybe one nice simple example of myself and Tim Roughgarden is
for traffic routing. And it said that if you take
the static routing example, then the cost of the Nash Equilibrium is
no worse than the cost of an optimally designed strategy where we’re sending
twice as much traffic. You might say that that
two over there was not your favorite
degradation constant, that you have to run this
twice as much traffic. I want to point out that
this is as opposed to much, much, much worse things like, what’s usually known as
the tragedy of the commons, where the degradation is, degrades with more participants,
the worse you’d get. Price of anarchy formally
defined as follows, classically it’s about
Nash Equilibrium. They take the utility of Nash and compare it to the optimum. Then a sequence of papers originally studied
by Blom and co-authors, and then a very nice paper
with Tim Roughgarden, got generalized to
learning outcomes. What they said is, we repeated the same game
capital T times, take the utilities of
the player summed over time, and compare it T
times the optimum. At that point the game and
the players were stable, so the optimum didn’t change. And what I really would
like to do and this is the last one is I’m going to
keep changing the game too. So instead of having T
times the optimum down here, I have to change it some over the optimums of all those games. And that’s what I really
would like to shoot for. Our actual model
goes as follows, it’s the same game like we’re still
playing traffic routing every single step but
players get replaced. And I guess we call
this dynamic population. Every iteration, there is a parameter which I can tell you a little bit
more about in a sec. With some probability P, every player vanishes, and gets replaced with
an arbitrary new player. So what I mean here, because every player vanishes
with some probability P, if there are n players than a noticeable fraction
like Pn of them, lots of them, vanish
every single step. So something changes
all the time. And because the change
is arbitrary, if you stayed in the game
you have to keep watching. The situation might change
and you have to keep learning. But at the same time, because you only vanish
with probability P, any particular pair will stay long enough that
learning will be viable. And in particular, I have to keep them in long enough that in my no-regret
learning assumption, the regret error will not
dominate what they got. That is, so that their regret
goes down to zero. So I have to keep them around long enough then in expectation, they regret it or goes
down to zero. Yes?>>Things changed
dramatically if players were getting replaced
with higher probability, if they had already
incurred a very high regret? Sort of modeling this idea
of ruin, like gambling. People who are currently feeling like they’re
losing in this marketplace, they are likely to be-.>>More likely to leave. No. We would be very, like as long as, you know if anything
it would help me. It certainly won’t
hurt. It’s very tolerant of which fraction, very important that you not replace a very big fraction at a time so that most of
the people can learn. One way that the results are not as good as can be
is that I’m only looking at social
welfare which means I’m averaging
everyone’s behavior. So there could be people who are doing really
badly and my average doesn’t hurt because
there’s so many of them and you’re actually helping in that category
not hurting. So your way of replacing them
would only make it better. Problematic people would vanish.>>Maybe you also, have people leaving because
they’re winning.>>Right. That I
need to think about, that might actually hurt. Like, I have to be careful whether I can keep
the result that way. Deet is helping because his is making the problematic
people go away. Good questions. So, we’re
doing this in a class of games that team called
Smooth games that actually I’m using
my version on auctions. I won’t spend a lot of time explaining what
these games are maybe a one sentence summary is and there will be two sentence summary
between two things. One sentence summary is
that almost all the Price of Anarchy regrets
are of this form. No actually that’s the next
sentence so it’s not here, are of this class but given that I probably shouldn’t
keep talking for too long, I won’t spend too much time
explaining what the class is. I want to tell you
one line of how it gets proven which I think is important to sort of think about what is there to prove. So, all we have is the no regret assumption and
all these proofs need is, so here these guys in this particular
example of playing a sort of auction game
that they want to win one of those colors
things and maybe they have different values for
different colors and they bidding on one of them and that’s a game because
you have to decide how much to bid and which
one to bid on and, of course, as usual
with any of these, the best action might
depend on what the, how the other guys are doing, but the interesting thing in the proof that in
order to prove things, I don’t have to always
figure out what’s best, I don’t have to do what’s best. There is a particular action
that I should not regret, is the optimal solution. So, if my optimal thing is to win that red assignment over here then [inaudible] I should
not regret bidding on that guy. That is you’re not
asked to be on that guy, you’re not asked to
know what that guy is, but you’re not regretting
anything, in particular, you’ll not regret
that one either and that’s the only thing
that proof uses. That is, that there is one thing the optimum should not regret that behavior and that’s the assumption that
the proof uses, there’s lots and lots of
papers that use this kind of proof technique and I’m going to maybe go to the skipping a couple of things
I might come back to. Showing you the one
technical other thing I wanted to show you. What I want to do with this
changing population ideally, I would like you to do
this not just not regret the optimum but not
regret the optimum every single step and that’s
actually went too far. Even if I take one guy out, the optimal might
drastically change just to give you an example of
replaying a matching game. Here’s the optimum,
one guy left, of course, the optimal changed for
everyone’s cause and I’m running pass and
everything changed. But if you look careful you can design something
and I’m definitely cheating here called a Stable Optimum that is a sequence
of optimal solutions, that stay stable, that is, don’t change much
or rarely change or for any one player
change very low. I have two techniques or we have two techniques to generate such a thing one is specialized to particular
things like matching, for example, the greedy
matching is much more stable than
the optimal matching. So, running a greedy
like system is better but maybe an
even better there’s a whole technique
developed out there called Differential Privacy which
since Cynthia Dwork is a Microsoft person or used to be a Microsoft person
hopefully you guys all know about
differential privacy. If you think out
what differential privacy is trying to do, hey, it’s the same thing,
it’s a different argument. They want the solution
to be not sensitive to whether one guy
participates or not. That’s literally what I wanted too for a different reason. So, I can inherit the whole differential
privacy literature and generate theorems
of the form and maybe that there’s many many of these games we can have the Price of Anarchy error
which has to be covered. Even this changing population, I need to have
the probability not go below something
like one in lock B, this is because I want
everyone to stay around for at least log B time so
that the regret error will not dominate
the behavior in my bounds and the bounds I get is a product
of three parameters. Of course, I lose the Price
of Anarchy bounds. It’s the same Price
of Anarchy proof if there was a bound there, I’m recovering kind
of the same bound. I have two losses both of which go to one as things get better. There’s a regret error trouble, I have to accommodate
the regret error if the probability is high then
they don’t stay long enough, regret error is relatively high as I keep
people around longer, regret error goes to zero
and the second is the; how much it cost me to
have a stable solution. As the game gets bigger, a higher population
as you might know from Cynthia’s work or Cynthia work has worked with
many other people, the stable solution
get closer closer to optimum as it’s
a large population game. In a smaller population it’s harder to make it
differentially private. So again, if
the population is high and I kept people around
relatively long time then I recover the Price
of Anarchy bound otherwise there is
extra little bit of degradation around. And I guess I want
us to end with, I hope is the beginning I did convince you that learning
is actually good assumption. It is not a bad way to adapt. Is much better than Nash. If I play Nash against
a bad opponent and I will not pick on either Adat or Nikhil as playing
the bad solution. If I play Nash in
rock-paper-scissors no matter what
the opponent does, I get 0 as my payoff. If the opponents’ playing
badly I should beat him. So, it’s a better way to
adapt to the opponent, no need for the common prior and that is a tiny bit of the hours and
minutes I tried to convince you that learning
actually prepares do well in even dynamically changing environments which I
think is a nice feature. What I didn’t do and that’s
a slide I skipped through in the middle is
the open question part and maybe I should end with
that is that I did criticize learning for the kind of
thing Nikhil picked up on, that I should predict what the opponent does
and react to that. There are a few papers that
trying to say things of what happens when you
do such a prediction. At the moment they
always get beaten, that is they show
something you could do with such prediction
and then say hey but that’s happening anyhow even if they’re not
doing any prediction. So, at the moment we are not aware even though there
were attempts at this, I’m not aware of any papers that says something
better happens to either the players or the game overall if people play in a way that’s more
predicts the opponent. But I think there is a pastor and there’s
something good should happen. It’s clearly the right
thing to do and at the moment I can’t prove
anything interesting, so I’m going to stop
here and thank you.>>If you go down that road of trying to predict what
the opponent is going to play and how he’s going
to or she is going to respond to your
[inaudible] will you then have to model
the occupation time? Do you have to think
about limited resources?>>I certainly should. I mean the actual
attempts that are there over a very much
more simplistic attempts. What they say is any of these no-regret
learning algorithms and multiplicative aid, for example, but all of them they adjust the probability
distribution very slowly. That’s an interesting fact. Know this fact, that is, know that the distribution today is virtually
the same as yesterday, and as a result, ran something that they called the Optimistic Mirror Descent. Optimistic, meaning that
it’s recency biased, that this very heavily takes
into account what you did today and your proof
syncs with that. That doesn’t need,
it’s just optimistic. It doesn’t need
extra computation. It didn’t do the game that Miguel was inducing
me to think about, that if I’m going to take
advantage of what you’re doing, you should predict what
I’m doing is react to that. You just assume that we’re both running this form of
Optimistic Mirror Descent, rather than having one of us run a best response to the Optimistic
Mirror Descent, which is not itself
Optimistic Mirror Descent. So, you’re right, there’s
a computational problem here, once I take this to
the proper level, these papers do not
at the moment. Yes>>Can you connect that with the counterfactual regret
matching kinds of algorithms that,
like recently, the Poker Texas, the Heads-up Poker thing was
solved by Libratus at CMU. And they were essentially doing that kind of
reasoning and they essentially through
insane amounts of compute, and the group forced
trying to reason through all steps of
bidding and betting. And to quote, it was
a regret matching algorithm, but they had to precisely think about what would
the player’s best response be to this and [inaudible]>>Yes. So, I guess as Moore
actually points out that, an issue that I swept under the rug despite other issues
that I focused on, is that the games
I was playing here, which is both traffic routing and this bidding in ad auctions, are compared to poker, they’re a super simple game. It’s like you’re
choosing a pass, you’re choosing a bid, and this multi-line
games such as, even that particular
simple poker, they’re multi-line, the strategy space is
much, much more complex. Mainly the routing game, which has many paths, but you still choosing
edges really it’s, it’s a Shortest
Path Computation. It’s not, you’re keeping parameters on the edges,
that’s very manageable. Poker is a much, much
more complex game, and there is certainly a lot, even running these algorithms
I’m talking about will be problematic in a poker because there are
so many different strategies. They doable in pass or doable
in low-dimensional spaces. That’s a
higher-dimensional space. So, there is a
computational problem there, even with these
simple algorithms. They’re doing something
a bit more sophisticated, but the sophistication
is more coming in that they’re actually
capable of running this. Yes.>>Hey, so sorry
if I missed this, but what information is
the learning player being given? Do they get to see
all the opponent’s actions and utilities, or is that all hidden from them?>>Excellent question. Definitely, they don’t
see the opponent’s action. To get the actual bands to the extent there were bands
at all on these slides, they get information
of what would have happened if they have
done- so they need, there is the, what I
often call in learning, a full information
form feedback. That is, you don’t necessarily know what
the other guys are doing, but you know what would have
happened had you done that. That the sinking
of traffic routing, you drive on a pass, but for whatever reason, you actually get
traffic reports from everywhere including
where you were not.>>So all actions
I could have taken, I know what I will be recording?>>You know what
would have happened. This is based on traffic report, may be reasonable
on traffic routing. I’m not so sure it’s reasonable on internet traffic routing, but I guess they’re doing pics, which maybe can
approximate this. It is actually quite reasonable
in ad auction because the feedback thing they’re getting is giving
them payoff curves. So, they’re getting
this kind of, you guys, that is your company, is offering this feedback
to the bidders. So, that doesn’t say
who the opponent is, they have to go on the web. If they’re a pizza
company, honestly, probably, all of you know
the local pizza names. So they know who’s
advertising for pizza, but if not, they have to figure
it out the way you do it. Go on the web, but they do get the counterfactual effect as a form of feedback and
that’s what this assumes.>>So, they get a curve with
all the discontinuities as their bids drop below
other people’s bids, how things are suddenly changed?>>Yes. What they
get is feedback, is what would have happened
maybe not instantaneously, but aggregated over
some period and honestly, actually, I don’t know
the periods lengths. Maybe over the day or over, maybe some of you
know this here. What I’m reacting to, is you said with the drops, syncs are random enough that there aren’t
that many drops. I’ve seen those curves.
They’re pretty smooth.>>I see.>>And because it’s aggregated, I think on a day, or an hour or something. There’s some amount of aggregation on what
they’re getting.>>Okay.>>But, your company, as does Google,
feeds back curves. Curves like that and that’s
what these results made.>>Thanks.>>So the main->>Yes.>>So the way to look of the social behavior
whereas people are doing this naughty
Curve behavior. So, is there a main take-away, it’s like they converge, compared to like
doing a best response as in Nash Equilibrium
doesn’t converge? Because, the [inaudible] if you
make like these game are smooth, then they would have held it even for pure Nash
Equilibrium kind of thing. So, like what is
this, separates this? Is it just the convergence that this modeling explains
that, okay people converge? Other, like if the Nash Equilibrium wouldn’t
explain that, okay, why would people who converge
to a Nash Equilibrium?>>So, I if I think
of traffic routing. So Traffic Routing
is one of those games where best response or learning on a single game
does converge to a Nash, because it’s a potential game. So they’re the
only plus getting, their are two pluses
will be getting added. The speed of convergence, that is, what’s
the regret error? How fast are they converging? And second, and maybe
that’s the end and I really rush through it that
I can switch the player set. That I’m switching
the player set so fast, then we roughly resend players. I keep them around
for Vanin Logan time. That is almost a
constant fraction of players vanishes
every single pair of timestamps. There is no way they’re
converging to any Nash, because the Nash keeps changing. Yet, they keep
the social welfare high. Right? They said
that I can do this, they can’t converge
to Nash because the game isn’t stable enough. If one player is around
for quite a while, so, he can do something, but the Nash moved
around because I kept deleting
and adding players. So, they can’t possibly
converge to the Nash, because there isn’t
such a thing as the Nash. Every single step
the Nash is different, and yet, they can make
statements about rougher.>>But that somewhat is, I feel, okay at least the way I look at is this
somewhat different. That’s much more like
stability because, like if again it’s smooth, if, even if you don’t
converge to exact Nash, but you can converge
to approximate Nash, you will still be good. Like the price of an RT bond
will still go through, if you make this
assumption that game is a smooth game. So,
I feel like this.>>So, let me actually,
so, inverting, indeed it would go to
a Nash and would go to approximate Nash
reasonably fast, and the only plus here is, I can swap the players in an ad, which I think you need to prove to me that
the Nash moves slow enough. I don’t think so, but if
you take auction game, auction game is not
a potential game.>>I see.>>So, there they’re not
converging to a Nash, in fact there is no guarantee
they’re converging period. They can alternate the way I showed you in the rock, paper, scissor, things never converge. That’s not converging. So, when something
is not converging, then this is
a separate statement, and I guess one thing I didn’t emphasize but
maybe useful to say, that what this game does, as I did say, it converging to the set
of correlated equilibrium. Now, what do I mean?
Correlated equilibrium is a convex set, internal set. It’s a convex set with
a very high dimensional space. So, when you converge to
a convex set or any set, you can do two things. You can converge to a point in that set and if they
were to do that, which they do it in routing, they do not do it
in ad auctions, or in item auctions. Then, the thing that they
converge to is a real Nash, but they can converge to
a set in a different way. Here is my set, and here
is how converge to it. I get closer and
closer to the set maybe I never went into
it and I certainly don’t, didn’t converge in
the classical sense, I just converge to the set. I’m really close the set, I’m circling around it. That’s what’s happening
in rock, paper, scissor. In that sense, they
didn’t stabilize, they didn’t converge, but every set in this whole
correlated equilibrium, the thing they’re
getting close to, has good social WI-FI. There I prove, there the system proves
something much stronger. They’re not converging, period, and yet social WI-FI is good.>>I see. So, we don’t know
such results for like Nash equilibrium,
you’re saying? Like if it doesn’t, okay it doesn’t maybe it
doesn’t even make sense.>>Well, it would make sense. Like I feel like I don’t
know such result because they couldn’t be true where there’s a little cheating
what I’m saying, if playing with converge
to a Nash Equilibrium, I almost would
prove that finding a Nash Equilibrium is
computationally easy. Not quite because I would
say approximating it easy and what we know right
now is finding it hard, not that approximating is hard. But sort of under
the belief that this difference is minor and really approximating
it is also hard, then that thing
should not be true. Learning should not
converge to a Nash, because finding a Nash is hard. This is also true to add
on the item auction game, finding a Nash is hard. So, this method cannot, I don’t have an example to prove to you that this
doesn’t converge to a Nash, but I know it cannot,
because this is doable and that one’s
computationally hard.>>Yes. Okay, thanks.>>Okay?
>>Thank the speaker.>>Thank you.

You Might Also Like

1 Comment

  • Reply William Parker March 27, 2018 at 5:45 pm

    First to comment!… All…right. 😏

  • Leave a Reply