Awesome Basic Tips and Tricks: Cloud Patterns for Resiliency (Circuit Breakers and Throttling)

all right so feature flags they're
awesome right like what could possibly
go wrong so some of you may remember in
November 2013 at Kinect we had a bunch
of new functionality we wanted to roll
out and so we said hey we want to keep
this stuff super secret super hidden
keynotes it I think it was 9 a.m. 10
a.m. that morning in New York we're
gonna flip the feature flags at 8 a.m.
like right beforehand it didn't go well
so we had all kinds of problems it took
us a couple of weeks to fully root cause
everything and the worst thing was is
this was on launch day Wednesday
November the 13th we sent the system
into a tailspin because we flipped on so
many different feature flags so many new
features clearly we hadn't understood
all the interactions and certainly not
at production scale we had all kinds of
things happen like we were running out
of it turns out as we dug into it we
were running out of snap Forks for
example in the SLB I mean when you look
at TCI tcp/ip you've got 16 bits for the
port
you hit 64,000 you're gonna run out
alright 65535 we had all kinds of
problems
so we realized from that experience that
flipping on a punch feature Flags right
before somebody gets on stage it's a bad
idea Brian Harry was on the phone he was
in New York I'm sitting at my desk
he calls me he's like so many words
what's going on and so
we're trying to sort this out thankfully
somebody had the good foresight to put
together a deck with screenshots in case
things weren't working so his manager
Souma at the time could do his talk with
without the system life so we learned
don't do that this and this to add it
make it worse then I'll take a question
here in a second to make this worse
remember we were talking about we only
had one instance at one time this was
that time so we had two things at that
point in the spring of 2013 we had
factored out SPS so we had SPS and we
had one instance of TFS that was it
so this problem affected the whole world
the blast radius was global and so what
do we do now we turn these things on
incrementally and but at least 24 hours
ahead of any event we turn all that
stuff on now we may hide a few last
things like we might have a button that
doesn't show up or something but
everything else is on we've of course
solicited some of you to go through and
try this stuff out as a good counter
example showing that we learned from
this in November 2015 we rolled out the
marketplace some of you we asked you hey
go try out the marketplace we're gonna
roll this thing out please don't tell
anybody so you did that one went really
really well but we had everything turned
on in that case I think full 48 hours
before the event we had all tested in
production we tried it had other people
tried etc totally different experience
major change but executed much much
better question but wasn't that drink
zero supposed to actually solve this
problem mean address this problem
wouldn't that be great yeah sometimes
that doesn't happen so this was a scale
problem we had all kinds of things kind
of come together at once and you you can
go back and read it on Brian's blog
Brian rota an RCA on it and basically
what Brian does when he posts are CAS on
his blog he takes the internal RCA tries
to add enough context so it will make
sense to you and post it otherwise it's
the same thing that we have internally
so you know please go back and take a
look at it it's interesting all kinds of
things happen for the interest of time
I'm gonna skip it so you've got features
tracking bsts then
have multiple instances and all these
feature flags and then you have test
plans how do you stitch it all together
so you know what you've actually tested
good question
so the teams have to think through how
do they need to test things right so you
got a test feature Flags both ways we
all start out online upgrade and the
fact that the ATS work with the old DB
and the new DB so some of these things
we handle centrally so for example with
deployment and and being able to test
whether or not the binaries actually
work with both versions of the DB schema
we have what we call a TD T compat runs
meaning does do the do the do the VMS
ATS and job agents work properly with
the database that's that's old so we'll
take the new binaries run against an old
database obviously we've tested new
binaries a new database so that one's
kind of a given we do runs like that but
from feature Flags themselves it's
really up to the teams to make sure
they've tested the combinations and they
have to keep track of it there's not a
it's not done centrally alright so we
talked about a failure case here in
those days we had no notion of any kind
of resiliency resiliency mint will try
to fix it faster we had none so there's
going to be failure in the cloud how do
we deal with that and there are two
things that we primarily rely on so far
and it'll progress as we get further on
down the road here the first one is
circuit breakers and this was something
that was created originated actually a
net net flix and of course everybody is
familiar with chaos monkey and sort of
they're very aggressive at testing in
production which is something we aspire
to
Manila's gonna talk a bit about fault
injection testing but we're very early
on and what we actually do in production
but the whole goal of a circuit breaker
much like you would find an electrical
panel is to stop the failure from
cascading so you know you end up
dropping a hairdryer in a bathtub and
okay the breaker pops and it and it
shuts it off and so the damage is
limited to that one item and not the
rest of the electrical system
circuit breakers help us protect against
latency and concurrency latency meaning
if something takes too long it's the
same as it failing if it takes you five
minutes to save a work item we're down
like yeah technically you can do it but
you can't get your job done so we're
down so we need protection from latency
and by the way failing slowly is the
most insidious thing if you fail fast
you can actually more reasonably deal
with it but if it takes you five minutes
to fail oh boy
dealing with that it becomes a much more
challenging problem so I always tell
people you know when they think about
failure think about things that are just
simply to slow it they're harder to deal
with and things that fail fast and then
there's volume concurrency if for sequel
azure for other resources that we depend
on there's a limit on - how many
simultaneous connections you can have
for example - a sequel Azure database if
you go beyond that problems start
happening how do we how do we prevent
issues that just due to too many calls
at once we need to be able to shed low
quickly so if something bad happens a
lot of times what happens is is the load
will sort of pile up so in the old days
before we had any circuit breakers
database might have a problem let's say
we roll out something that has a bug in
it the database is is very unhealthy the
CPUs pegged well while that's happening
we're queuing up a bunch of call of an
asp.net right somebody figures out oh
here's here's what I need to do to fix
it so we fixed the problem then what
happens next boom here comes this whole
wave of calls so he made it quote
healthy now it's got to deal with so
many simultaneous calls that goes down
again and this cycle repeats is kind of
like a bit of a death spiral so the
whole point of circuit breakers is to
shed load quickly that way you know so
we keep things from queuing up limit how
much stuff can be pending in the system
in order to allow it to recover another
key with circuit breakers is if you're
going to fail do you have a way to fall
back and gracefully degrade
some features that's easy some features
that's hard if you can't if we can't
call a ad there's no graceful way to do
anything about that you you're either
signed in and you you're good to go or
you haven't signed in and there's
nothing to do about it
but for other things you can fall back
you could for example decide hey if we
can't get to the extension management
system we'll assume you have access to
that extension and let you keep going
and not disrupt your work so some things
have choices some don't here's a quick
diagram of what a circuit breaker looks
like how does it how does this work and
the key is as calls come in and it goes
through the circuit breaker normally the
circuit breaker is closed normally
things are flowing through and it's
looking at the failure rates when the
failure rates exceed some percentage in
a given window of time with a certain
volume it's gonna say oh something's
horribly wrong I'm gonna open and when
it opens it just starts failing calls
and this by the way is a blunt
instruments so you know you may have a
problem in in in in the code and that
problem might have in fact been
triggered by somebody's behavior but
we're gonna start failing all those
calls to save the system circuit
breakers are all about saving the system
to prevent the system from going down to
prevent failure from spreading through
the rest of it and more targeted version
of controlling these kinds of things
it's throttling and we'll talk about
that next but things are coming through
then things start to fail and when it
realizes it opens and it's gonna start
failing on the calls now if I fail all
the calls how do I know when to re
closed how do I know that it's safe so a
circuit breaker actually occasionally
lets something through as a test so in
you may have a thousand calls a second
let's say coming through let's imagine
and instead when it's open you may let
10 go through because you're trying to
figure out is healthy or is it not and
if that resource is some dependency of
ours the other thing that that's good
about the circuit breaker opening is we
take pressure off of that system
whatever that system is but we need to
feed a little bit of it through to find
out is it working because at some point
we need to reclose and go back to normal
so that's represented here in this kind
of state that that's called half-open
here where it's kind of flooding a
little bit through now what does that
look like in code and I literally copied
this out of the code base it's a little
bit overwhelming on a slide it's not
nearly as overwhelming as it looks this
is just literally a set of properties on
a circuit breaker their defaults for all
these things and there are a few things
that are particularly important one is
request volume how much volume do I have
to have coming through the circuit
breaker for it to start analyzing things
what is what is my error threshold how
how many what percentage can fail and
still stay closed at what point should
it open and then what is my time window
what window do I want to analyze this
over
it could be seconds could be minutes it
could be any number of things but you
have to think about how how the circuit
breaker should analyze the calls that
are coming through and then what happens
in is I can go then use this in my code
so if I go back to the previous slide
here this one was what's called
installed extension settings and here
I'm gonna make use of that and I think
actually I ended up yeah so you can see
it over here installed an extension this
is fetch installed extensions and it's
going to look at this circuit breaker
which call a command center sets up the
circuit breaker and then I instantiate
it over here and I give it a fallback
and if the extension mechanism fails
what should we do this this fallback
method will determine what actually
happens when the circuit breaker opens
what response is to do the callers get
and this made a huge difference for us
so when we think about things like
concurrency if I get an overwhelming
number of requests based on the settings
on the circuit breaker I can open
circuit breaker and say something's gone
horribly wrong the system's getting
slammed I'm going to protect the system
the best example and I mentioned earlier
sequel Azure sea glass refer database
there's a limit in how many connections
you can have open we want to protect
that we also want to make sure that if
something is optional versus critical
that we can control that too
you can take some
breakers and create which might call
bulkheads where you say okay I'm going
to allow only this amount of calls from
this stuff that's kind of optional so
that I always make sure I've got some
capacity left for connections from these
critical things identity would certainly
fall in the realm of critical as I
mentioned earlier I'm going to invoke
the the fallback when there are too many
requests so let's take a look at an
example this was one where we had a slow
DB and at SPS whatever the bug was I
don't even remember what the bug was at
this point but we hit the concurrency
limits for one DB for two minutes and
the way the circuit breaker was set up
at that point it opened and we had
passed the the concurrency limit of 100
and it started failing requests now we
don't take this lightly because it's a
circuit breaker opens and start failing
request somebody's having a bad
experience you know if you're using the
product you're getting errors and you're
going but why right I'm just trying to
save work on I'm just trying to do
something you're getting errors because
of this circuit breaker but the system
keeps working and instead of devolving
into some incident that effects let's
say an entire partition DB with 40,000
accounts in it we affect let's say a
couple hundred people and it's bad for
those couple hundred people and we need
to go figure out why this happened and
go fix it but it's also not a giant
emergency it becomes something that can
be handled as it's not an emergency
something we do normally as part of our
daily work and that's kind of the point
you know take the emergency out of this
keep the system healthy don't let it go
down and this is what circuit breakers
do for us so I'm going to transition
here oh wait sorry I'm gonna talk about
lessons learned before I do my
transition to resource utilization so I
actually now so the first line here tune
in production what does that mean if you
tell me hey buck I've added a circuit
breaker to my code I'm resilient now
whatever you say I'm gonna look at you
and go have you tested in production if
you say no I don't believe you because
circuit breakers are only good if they
work and they are they they are failure
cases right failure
cases have to be tested so we want to
try this out in production the great
thing about having these su zero
instances as we can go have them fail we
can go open them so testing a circuit
breaker there's sort of two pieces
there's those that set of parameters the
the volume the percentage all that stuff
I've talked about earlier there's also
just what happens when you open it did I
continue the failure did I take the
system down right because we all know if
you get it wrong you can take the system
down so we'll do things like we'll just
go open circuit breaker on SPSS you zero
or TF SS u zero
we may also intentionally add a bunch of
calls to that intentionally fail we go
hook the code and have the code just
returned failure every time so that we
can see does it react the way we expect
the other thing that's interesting about
circuit breakers is we also have time
outs and so the circuit breaker is
looking at some period of time the
timeouts might be 30 seconds it might be
a hundred seconds or whatever how do
these two things interact with with each
other how did the other parts of the
system react when the circuit breaker
opens none of this stuff is believable
to me until you've done it in production
and this is this is why why netflix has
that whole chaos monkey test and
production mentality it only matters if
you can prove it so this also allows us
to verify the fallback so I I put some
fallback in there does it really help
does it work how are you gonna know this
monitoring is a key piece so I go
testify Coburg remember said earlier on
a different topic the absence of failure
is not success right so when you open
that circuit breaker and go test it may
be nothing quote bad happens which means
you know your co-workers aren't
complaining that they can't get their
job done with msn that's good that's
that's a very low bar though you need to
go look am I getting exceptions that I
didn't expect do I see a spike in
exceptions if suddenly the food
exception starts going through the roof
okay everybody's kind of okay but
there's something going wrong you need
to go understand why and root cause that
because again you want to be able to
depend on these in an emergency and
adversity is the wrong time to find out
if it works make it easy to understand
what calls a circuit breaker to trip
and you say yeah but that's obvious but
you'd be surprised so when we we started
using these one of the things that took
us a long time would it would they would
open and we try to figure out why and we
realized our telemetry wasn't very good
as to understanding what was causing it
because it's always going to be multiple
layers circuit breakers open and then
you got to figure out what was it that
triggered it to open and you got to keep
walking backwards until you find the
root cause and but your very first step
is why did it open and for a while that
was hard to figure that out we've made
that much easier to do when we
introduced circuit breakers we'd get
these people say ah we got a problem
circuit breakers are popping yeah what
we should close them no that's good
they're doing what they're supposed to
they're protecting the system you have
to go understand the recalls why like if
circuit breakers are opening is always a
symptom it's never a cost unless you
have a bug in your circuit breaker code
you pretty soon you'll hammer that out
and that won't be a problem so the
mindset shift here is you know furburger
is open that's a problem that you need
to go figure out but it's not a circuit
breaker
it's whatever triggered it and getting
people to realize hey you've got a
problem in your code you've got to go
figure out so at this point I'm going to
transition to resource utilization so
question so other developer when should
I think about probably putting this
circuit breaker in my my code like do
you have like some guidance around you
know if you're building a service make
sure that you have a circuit breaker or
like any case do you have for your team
it's a good question some of them people
quote get for free so using the server
framework we have circuit breakers for
protecting sequence equalizers major
dependency we have certain burgers for
that but the identity team for example
they had to go think through their calls
to aad and how they want to put circuit
breakers in for that you're really
trying to think through I'm doing
something if that's something slows way
down or starts to fail what's the impact
on the rest of system oh that could
spread let me put a circuit breaker in
there and then you gotta go to knit but
since circuit breakers aren't free this
is kind of key to your question since
circuit breaker is not free you don't
want to sprinkle them everywhere
then you got a different problem it may
in fact cause a problem when a problem
didn't have to exist so do you have do
you have like a standard circuit breaker
that everybody just use it yes so it's
not like anybody can just come up with
the implementation so we're kind of okay
that's right so that that dense slide
that I had with all the settings and all
so there's a standard circuit breaker
class as part of the framework everybody
uses that one and then you have to think
through it's a lot like if you've heard
a threat modeling for security it's kind
of that same sort of mentality think
about how your code could quote go wrong
in some sense be abused in production
and think about where you need to put
circuit breakers in order to protect the
system question you recommended launch
dark feature flags is there a reference
implementation that you'd like us using
for the fast track customers for circuit
breakers so I don't know of a great one
to use for dotnet the sort of canonical
one is written in Java it's called
history and it was put out by Netflix so
you can find Netflix a circuit breakers
implementation up on github so certainly
if you're talking to a customer's using
Java there's you could go reuse that
there's also by the way a more complex
diagram that explains it
there are presentations by Michael
Martin Nygaard from Netflix that go into
great detail on circuit breakers I've
got those in the notes for these slides
anybody who's interested in really
getting into circuit breakers out how
they recommend that but unfortunately I
should go look again but certainly a
couple years ago when we started this
there wasn't a good net circuit breaker
implementation that we could recommend
so any chance to open source but you
guys have good question maybe add that
this is less dependent this is not as
quite as intertwined as the feature
flags so maybe so how do how do you
monitor the circuit breakers like if one
opens up how does how do you guys get
that information and know which ones
tripped ah good question so as you might
imagine there's telemetry around this
and since there's a common class
everybody gets the circuit breaker
telemetry and you name your circuit
breaker so you can tell which circuit
breaker tripped in production when the
circuit breakers pop there's you you can
go look either in the dashboards and see
it
circuit breakers pop repeatedly we're
gonna actually send an email alert right
it's not not alert that would go wake
somebody about a bed but we'll send
email to whoever zone call the DRI the
designated responsible individual
whoever is on call and therefore
responsible for life site you know well
you'll get an email it says hey the
circuit breaker trip you know and it
stayed open let's say for ten minutes
last night okay I got to go figure out
now why that was so there's there's
there dashboards you can look at but
there are also email notifications if
the problem is persistent or recurrent
overlook these because again if they're
popping there's some reason and
sometimes it's a bug in the code
sometimes it might be somebody's doing
something abusive and we're just not
sufficiently resilient to it cuz again
circuit breakers are great they protect
the system but they're also a blunt
instrument because somebody's getting a
bad experience just not everybody so
it's important to go follow up on them
question how do you manually close
circuit breakers is it through power you
I how do you do that yeah so it's a good
question is a power shell script so you
can go close it with a power shell
script and by the way that's rare
because if it opened Oakland for a
reason there are occasions where we have
chosen to close them but rare more
likely is something bad happened in
production and we weren't sufficiently
resilient to it but as a way to quickly
mitigate but there have been times when
we go manually open them to mitigate the
problem while we go figure out the root
cause and get it fixed and then put
everything back to normal
so kind of like feature flags they do
have the the advantage of giving you a
way to cut certain things off quickly
question the patterns and practices team
in their cloud patterns has put together
a circuit breaker implementation in
c-sharp okay great thank you for that
I'm just similar that there's a project
called poly which is just join the
dotnet foundation oh really the poly
project org pretty good implementation a
circuit breaker and retry oh that's
great perfect couple recommendations
here all right so I'm gonna move on to
resource utilization next and this is
also about resiliency but it's a
different form of resiliency it's very
much targeted resiliency or targeted
controls so this is about limiting the
load by an identity and one of the best
examples we get hit with quite commonly
yes somebody's got a bill running and it
does all sorts of crazy stuff or this
happens a lot internally somebody let's
say in Windows writes a new tool that
queries the system for some piece of
information and everybody else goes man
that's cool I need to I need that as
well and so they start running this tool
and this is somebody got it run on a
desktop under their desk and they start
adding all their buddies to it to query
work items or whatever they're doing and
pretty soon this tool is just making
these queries just running like mad and
so that one identity is consuming an
outsized
amount of resources to be at CPU or
whatever and we want to be able to react
to that and not let that cause other
people problems so as I mentioned this
is also much more fine-grained than
circuit breakers so this is targeted
toward the goal of resource utilization
is to target and limit the offender not
other people as I mentioned before
there's a noisy neighbor problem as is
kind of the common term for if you are
able to use way too much of the system
in a multi-tenant system that means your
experience is coming at the expense of
somebody else and that's bad so we want
to prevent that we also need to be able
to let people know when are you hitting
the limits or when you approaching the
limits what are the limits how do I deal
with this as a user we need to be
flexible enough to be able to respond to
a range of issues so let's start with an
example what it looks like and then I'll
talk about how it works and many of you
have have seen this for one reason or
another some of you work with customers
who like to trip this so you can go to
this page on your account and you can
see the impact you're having now in this
particular sort of test example if you
will
this user here Edwin is get
throttled a lot and what's happening
here and you can see it in the
highlighted red box is it's delaying and
there there's sort of two main
mechanisms here there's delay and
blocking and the first thing you want to
do is delay because blocking is blocking
can be downright cruel right because if
I start blocking you everything you call
just fails and that's that's rough
sometimes it's necessary but we won't
start with delays and see if see if
that's enough to kind of slow slow
things down and so this is key to
helping people understand if you're
getting throttled
why are you getting throttled and you
can see in here
it'll show you the calls give you some
detail on kind of what's going on these
are all the same command e reset default
class I had no idea what it is but it
somebody just generated this for me so I
could get a screenshot but if let's say
you were calling a get endpoint to query
push history you would see that here in
that column so you know what you're
doing so how does this work because
we've got multiple things in plan inch
and earlier resource utilization is
actually quite complex and it's complex
because we have a lot of dependencies
part of it of course big dependency of
course is is database so for example we
want to be able to track database CPU
time and I'll show you how we do that
it's it's actually pretty cool then
there's there's some window of time that
you want to look at for how much a given
user is consuming in your system and
that's mentioned before they're kind of
two pieces delaying and blocking so to
do this we want to allow you as a caller
to understand when are you getting close
to the limits when are you being blocked
you need to know so if you're writing a
tool that that tool can react to it and
in the end there's a general concept in
resiliency a notion of back pressure if
if I make a call to the server and the
server is struggling or or whatever if
the server sends back information to you
that lets you know hey the server needs
you to back off that's that's back
pressure it's pressure being pushed back
to the client if the clients intelligent
and reacts to it and again we're talking
about friendly clients right we're not
talking about abuse this is not DDoS or
anything
but if the client is is well-intentioned
maybe it's something that you've written
then you can look at the headers and
take action you can have the code take
action you might back off you might
pause your calls or whatever and so the
other thing with with throttling is
needs to be highly configurable one of
the challenges is we need to be able to
do throttling for work on tracking and
version control and release management
and code search these things all work
very differently what's expensive for
work item tracking is not expensive
necessarily for version control they're
just completely different things going
on some of that has to factor into the
throttling to react on the client we
give you a set of response headers so
you make a call you make a rest call and
we're going to let you know if you're if
you're in danger of getting throttled or
you are being throttled you can look at
the headers and see what's going on the
server is going to tell you hey I
delayed you this much or if you go
beyond this limit we're going to block
you we're gonna flat-out throttle you so
provide the client information allow the
client to react intelligently and if
you're flat-out blocked we're gonna give
you a code so that you know that we're
gonna send you a 429 and other systems
do the same thing it's not unique to us
but we'll give you a 429 so that you
know it failed and this is why it failed
wasn't it wasn't because maybe they
request is wrong or something who knows
maybe the request has bad parameters I
have no idea I need to run it just give
me a 429 so as I meant before the two
flavors here delaying and blocking
delaying allows us to spread out the
load if I can simply slow it down that
might be enough and if you've got a tool
that you've written in maybe somebody's
using your tool that you wrote in a way
you didn't expect if we just slow it
down it still succeeds we just spread
out the load sometimes that's enough
sometimes it's not sometimes enough of
it piles up that we just have to start
blocking and sometimes this can be a
multi-threaded tool it's just making
tons of simultaneous calls it could be
any number of reasons but at some point
it gets overwhelming and we start
blocking
now the interesting thing here the the
really sort of interesting thing here is
how do you do this like this is it's all
well and good delay throttle whatever
but how do you do it since so much of
our stuff is dependent on sequel this is
really the key piece for us there are
other parts to it but X events is the
big thing and excellence is a standard
feature in sequel a sure and from it you
can determine all sorts of stuff like
you can keep track of who's using CPU oh
okay this particular call by this user
this command turned around and called
you know PRC query push history and
that's been happening over and over in
the last five minutes and it's now
consumed 90% of the CPU on this database
you know we need to either delay start
inserting delays or maybe flat-out block
if it's gotten too bad and this ability
to accurately attribute that usage to
the particular call the particular
identity that's that's causing it is
really key because that lets us go after
the offender and not just hit everybody
with a hammer and the other thing that's
really cool about X events is they're
very lightweight so this is a feature
and sequel a sure it's not us you know
you can go use it yourselves but they're
very lightweight so this it which is key
because a lot of things that monitor for
you they all eat some percentage of your
cycles great thing about X events is
they're so lightweight they don't really
change anything for you you don't have
to go to the next cue up or anything
like that in sequel
they're also asynchronously connected
collected so it's not getting in the way
of the responsiveness with user calls
here's a very quick diagram on roughly
how it's done
and the key here is sequel Azure these
databases that you see here as the
cylinders are pumping the data into
storage we then take that and we
actually pump that into Azure log
analytics you you will probably hear
people refer to cousteau that's the
internal name for as you log analytics
it's the same thing it's just cousteau
is a easier name to say but by pumping
all this back into as you log analytics
plumbing it back into cousteau we can
actually then
have queries that run against that store
to figure out interesting things about
the usage in each account so there's a
delay here you know so when when
throttling kicks in it's after a while
because there's there's all this going
on in the background to grab the X
events grab other data it's not the only
piece of data shove that into cousteau
run queries against it so there's a
delay here it doesn't react
instantaneously certainly not today but
this has been very valuable to us in
order to help make sure that every user
has a good experience in a multi-tenant
database so what are some lessons
learned out of this it's kind of
interesting because since we didn't
always have throttling we start rolling
out limits and some of them might have
been you call a call us and go oh my
gosh you're throttling me can you can
you stop throttling me well okay let's
talk this through and kind of understand
why so when you allow somebody of course
to go do things unlimited and then you
start putting limits on people get
unhappy and it's very reasonable right
as a customer I was paying for this it
was unlimited now you tell me I'm paying
the same amount of money and it's
limited what's going on so trying to put
this in afterwards has been an
interesting challenge now it'd be great
to say oh yeah and the thing you should
do is from day one go implement resource
utilization and of course everything
else I've talked about but it's just not
practical but I do tell people today
since we do have this as you build new
features and we did this by the way we'd
get is think about the limits think
about how how you need to put resource
utilization in there from day one so
that you don't have to come back and
start negotiating with people on kind of
taking the limits down one of the
biggest challenges with this has been
with the windows account they've been
using work item tracking for years and
then as we started putting limits to try
to change the load and manage that load
it becomes an interesting sort of
conversation with them as our customer
hey you're doing this we need you to
bring it down you have to start talking
to people about running again the
example of somebody running a tool under
the desktop that's just pinging away
running a query all the time or whatever
to try to negotiate and get things back
into a healthy state because again a lot
of this comes back talks like how much
we spending how do we provide a good
service for everybody at a reasonable
cost
as I mentioned delaying is effective if
it's a single thread if somebody's going
nuts and Konya with much calls in in in
parallel you're gonna have to go and
have it block we need to be able to help
understand understand why if I if I
block you or delay you and you can't
tell why that's been very miserable
right cuz you're it's a black box to you
now I would love to say and the
experience we have for this is great
we still need to improve it I don't know
if you've paid attention to the UI but
it has this notion of TST use these fake
units of load that we created nobody
knows what a TST you is so we've got to
do things that make it more
understandable but it's at least to
start this also you'll notice a common
theme here tune in production if if you
put in limits for resource utilization
and you don't try them out in production
they you may not have the limits you
think resource utilization circuit
breakers timeouts all these things can
interact in interesting ways if I start
throttling you in five seconds and your
timeout is thirty seconds you're
eventually going to fall into blocking
because you're gonna sit there and keep
retrying alright so these are the sort
of interesting interactions that happen
in the system that you have to to try
out in production yes
do those limits depend on the account
size or not like are you giving the same
limit to somebody who's like you know
running thousand users versus somebody
just five users it's a great question
and so right now the answer is mostly
yes they're the same over time what's
going to happen very much to what you're
alluding to is you pay more you get more
right so there is a monetary component
that will go with it there will also be
at some point no doubt you can buy more
capacity so to speak so there are all
these things that we have to to work out
we're still relatively early in the
journey of resource utilization it's
actually something that we've been
working on for about two years now
because it's a problem that on the face
of it seems so simple oh well if
somebody's using
too much you know delay a block and
whatever but actually figuring all that
out from the underlying system been
surprisingly hard but over time it's
gonna be like you described as you pay
more you get more
same question about reference
implementation no this one is very much
tied into our system and how it works
and quite honestly the characteristics
of our system so this one would be
difficult probably the best thing we
could do is at some point really
document how it works so at least you
could decide what ideas make sense for
your system and go build something that
would work for your system and least
steal some ideas well the motivation is
is with the fast track program us
bringing the Microsoft story to the
customers they're gonna ask these
questions and so if we're gonna train
the people at my company I want to like
we've got to be able to equip them with
something yeah good question I don't
have a great answer for you on that and
I unfortunately I need to give that some
thought question do you publish out the
thresholds for your throttling or is it
done more like on a case-by-case basis
like so it would be a surprise to me
when it when I was actually impacted yes
it was a good question it this is again
one of our challenges I'd love to be
able to say if you if you have more than
n number of calls per second you'll get
throttled it's so simple so easy to
understand and it's not that simple
because you there are calls that you can
call like mad and it won't have much of
an effect on us at all and there are
other things that are very expensive and
a few RPS is a significant load right so
there's just this wide variation on cost
of a given call and so at this point
they're not published over time I need
to figure out we need to figure out how
do we give you guidance so that you can
start to make sense of it and plan for
it we don't do a good job of that today
[Music]

Awesome Basic Tips and Tricks

Wednesday, February 4, 2026

Cloud Patterns for Resiliency (Circuit Breakers and Throttling)

No comments:

Post a Comment