Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job

Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job


happy Thursday everyone and welcome to
today’s webinar titled Akka, Spark or Kafka selecting the right streaming engine for
the job my name is Oliver White, chief storyteller and emcee at Lightbend and
joining me is my favorite streaming expert dr. Dean Wampler he’s our VP of
fast data engineering and light Bend not to mention a physicist author speaker
and excellent photographer while I take a moment to introduce today’s session
and allow any late comers out there to get their popcorn ready I’d like to ask
our audience a quick poll question and as an incentive to share your
experiences will be mailing a light band t-shirt to a randomly selected poll
respondent if you’re joining us today then possibly you were in the same
position as a lot of us out there who were realizing that the batch oriented
architecture of big data where data is captured in large scalable stores then
processed later is simply too slow so if you’re looking for a competitive
advantage like our clients PayPal HPE Starbucks in Capital One then you’ve got
to embrace streaming and fast data architectures where data is processed as
it arrives but for most people we’ve talked to there’s rarely a
one-size-fits-all technology that can handle all streaming use cases so with
so many stream processing engines out there which ones should you choose
there’s a few things to consider when making the right selection such as
latency how low is necessary volume how much volume are you expecting is complex
event processing involved integrating with other tools which ones how data
processing in bulk as individual events what other kinds etcetera and this is
why Dean is joining us today to help better equip you with the context and
background to make good decisions when it comes to adopting streaming engines
using our fast data platform as an example which supports a host of
reactive and streaming technologies like acha streams kafka streams apache flink
apache mesosphere I’m sorry Apache spark mesosphere DCOs
and our very own reactive platform we’ll take a look at how to serve particular
needs and use cases in both fast data and micro services architecture today’s webinar is made possible only by
light Ben’s hundreds of happy customers especially our client pivot us ventures
a company that’s that helps financial service providers offer smarter and more
human options to their clients based in Menlo Park California they’re looking
for a lead site reliability engineer with experience in java scala and akka
so that you can help scale their infrastructure to a whole new level I
think you can find out more about this position and other their open positions
with our customers and partners on light Bend comm under the company tab as
always this webinar is being recorded and it will be shared with you next week
if you have questions for our speaker feel free to add them to the questions
tab in the GoToWebinar control panel and we’ll see if we can get around to them
in the Q&A part of today’s session if you’re an existing subscriber then you
can directly access our developer assist program where our expert engineering
teams are available to help you address any additional technical questions how
to’s bex best practices and what ifs alright that’s all for me so let’s
finally hear from our special guest Dean wampler idean thanks for joining us hi
Oliver it’s good to be back thanks everyone for
coming in this is an updated version of a talk I did maybe a year ago now kind
of reflecting our evolving thinking about how to do this stuff properly
you’re the core ideas are kind of the same but there’s a lot of refinement
that we’ve done to our thinking based on working with customers and really seen
the strengths and weaknesses of various tools so that’s what I want to kind of
leave you with today and let’s get the right button there we go so it also this
is based on a report I did some time ago called you know fast data architectures
for streaming applications this link at the bottom is actually the the landing
page for the fast data platform that I lead the development team for but it has
a link to get to this report if you’d like to get some more details than we
have time to cover today so let’s put this in context a little bit this will
be fairly quick I think most of you probably
we have dealt with one way or another with classic batch architectures like I
do but some of their main reasons I’m going through this is just show how some
of the ideas from here are carrying forward but they’re also being refined
this is a schematic diagram of what I do cluster might look it look like we’ve
got storage which could be the Hadoop distributed file system but people also
use things like s3 search engines like elastic search and databases of various
kinds then you need something that can actually process this data that you know
that you’ll and the data as Oliver said and then you wanted to process it later
originally was MapReduce now spark is used a lot for this there’s there’s
other tools and then there’s layers on top of these tools like sequel
environments such as hive and then the other big piece of the Hadoop ecosystem
is yarn yet another resource negotiator it’s like sort of a first generation
resource manager more modern versions of this concept or mazes and kubernetes for
example and then for completing this I added a couple of other popular tools
for getting data into the system or even exporting that scoop is great for
databases and floom it has been popular for log aggregation purposes but you
know as we discussed it’s it’s kind of maybe a competitive disadvantage if all
you can do is wait for these batch jobs to run to analyze your data now that
doesn’t mean that you’re going to you know turn off your dupe clusters are
still plenty of use for them for things like data warehousing it kind of you
know after the fact exploration analysis reporting and so forth but if you want
to get answers out of your data more quickly then you need something that you
know they’re supposed to work fast data architecture it has emerged
it’s basically stream data processing this diagram is taken from the report
it’s a bit busy but we’ll go through the pieces of it the numbers correspond to
numbered sections in the report I have tweaked the diagram a little bit it’s
this is a slightly updated version and I’ll get to the those details as we go
so now we’re going to have data coming in you know in real time and I’m using
that term very loosely real time it could be streams of data like sockets
here I’m kind of using this as a I sold her for anything external to my
system such as the Twitter feed it might be a log exhaust either from server logs
or even things like clickstream blogs I’m going to analyze and some of the day
even comes in through classic rest requests where y’all might be processing
user sessions but at the same time sending the data to my data stream
processing pipeline for doing things like machine learning recommendations
and all that kind of stuff so what we’ve seen emerge here is that building stream
processing is not only different in the sense of now I’m doing things faster
compared to Hadoop but it’s actually raised the bar on the kind of services
we’ve build in terms of their production profiles they need to be scalable and
resilient and durable and you know I’m gonna stand up a stream pipeline that
could run for months it’s going to encounter almost everything that could
go wrong you know MIT work partitions hardware failures etc so resiliency
becomes a much bigger problem in stream processing versus batch processing where
you know the batch job might run for hours and my process terabytes of data
but it only needs to last for those few hours and then I you know run another
one later so that’s kind of forced these architectures for string processing to
look more like what we’ve learned about service architectures like micro
services you know CI CD pipelines all the buzzwords that your perhaps used to
if you come from that world so we do find that a lot of these systems need to
be built with those kind of capabilities in mind and as a corollary if you’re you
know if you’re trying to get answers out of your data quickly and exploit it for
advantage that means you have to integrate your existing micro service
infrastructure with your data processing infrastructure so you can get that data
into your services that’s needed so that we see this sort of converged
architectures emerging where I’ve got a bunch of micro services running if I am
using something like Mesa AZ or kubernetes where I can literally run
everything on top of one big cluster then I might be running my entire world
on that one big cluster and you know hopefully not I mean there’s still
reasons why you might have duplet duplicate clusters and for resiliency
and whatnot but nevertheless it’s more of an
integrated picture than having like a Hadoop cluster off to the side so to
speak zookeeper is usually part of these sort
of things for your federal ready masters and storing shared data and in fact it’s
mostly here because Kafka uses it so Kafka is kind of emerged as a very hot
item in this space because it’s a great date of backplane it is loosely speaking
a message queue model but it’s really more of a log aggregation model where
you you collect the string the streams of data in topics that would be
analogous to accuse and a message queue but you actually traverse them
differently it’s much easier to have the data stored in Kafka and then have
multiple people traverse the same data even go back and reread it and so forth
and then it does this very well at large scales and you know large volumes of
data you know LinkedIn likes to point out that they’re now running you know
well over a trillion messages a day through their Kafka infrastructure so
you know it’s got these capabilities so we really like Kafka for this kind of
backplane that you need and I mean a bit advantage of Kafka I think I have this
on the next slide is it architectural II solve some problems for you one of which
is that I’ve deliberately drawn this you know complete bipartite graph I do use
the technical word to kind of illustrate a worst case scenario where I have a
very complex connection topology and it’s hard to understand but it’s also
rather fragile I suppose it’s service one on the right hand side crashes you
know I could have data loss if I have these direct connections from all these
other sources and that data is now getting dropped or you know whatever it
was at service one is doing it can no longer do this and it won’t be able to
pick up where it left off whenever I replace it so as we all know we solve
any problem in computer science with another level of indirection and this is
what Kafka does for us in this case by having this in the middle we’ve gained a
whole bunch of advantages we’ve simplified things architectural II we
have one sort of universal way of connecting things together you know i
certain the services on the Left I only need to understand the Kafka api’s they
don’t have to understand particular ad-hoc REST API let’s say
that all the services on the right support for example and if we have the
scenario where something crashes well the data will be reliably captured and
you know the services on the Left won’t have to even know that something went
wrong whenever I restart service long efforts gone down it’ll just work so
it’s it’s a very well-thought-out architectural piece of our system and
adds a lot of value and then we was sort of analogous to what MapReduce and SPARC
we’re doing with HDFS and the Hadoop world we’ve got all this data that’s
flowing through Kafka and maybe going into storage on the lower left but what
am I going to do with it so we looked at the landscape of streaming engines and
there’s a whole lot of them that claim that they do streaming and we tried to
decide what you know if we give you the minimal set that covers the full
spectrum and tools that also have been like a very vibrant if community and
seem to you know be it that like they’re going to be around for a while then we
settled on for SPARC flink akka streams and Kafka streams and we’ll spend most
of our time talking about the qualities of these engines and the kind of
requirements that they satisfy and being we’ll get into a little bit as well
that’s here and so just to finish the diagram and then before we get into
streaming again it’s it can be bring your own persistence people do use HDFS
here it’s still very useful but and actually all this talk a box is actually
the same as it was on the Hadoop slide pretty much all the same persistence
engine supply it’s a little bit different if you’re doing stream
processing there are some new engines that are a little better at managing
streams and others more traditional databases that are getting better at
handling streaming data and then the last thing on this chart is we’re gonna
run this stuff well it’s really all pretty flexible it’s it’s really
somewhat agnostic to how you run it obviously I mentioned that makes us and
kubernetes our new or more mature systems that give you much more
flexibility for running wide class of applications and services
yarn has been catching up to some degree but I think I think the days are
numbered for yarn in terms of building these kind of new generation
architectures but you can run this stuff anywhere you want to on you know on
premise in the cloud whatever okay let’s get to streaming engines then in
particular and to start what I want to do is talk about the kind of
requirements you should think about when you’re picking an engine or features if
you want to call them that well here’s the list of them and I’m gonna go
through these one at a time so I won’t read this list now let’s start there
with low latency you know how low what’s your budget for processing before you
have to have a results you have to send downstream and it can arrange over quite
a spectrum if you’re doing what is truly real-time and we tend to abuse that term
but if you’re talking about Pico to microseconds like you know flight
control software like the spacex diagram well you’re probably talking about
custom systems that really are outside the scope of what we’re really
interested in today and that most of us have to worry about you know you’re
using custom hardware a low-level c programming and so forth you start to
get into the range of under a hundred microseconds or so you can actually do
some of this with very fast JVM libraries you know a very effective use
of even some rest protocols this is more the domain of regular trading as opposed
to high frequency trading and if you’re working on like diagnostic medical
devices you know this would be the kind of time frames where you need to be
responding ten microseconds and the reason is pictures of credit cards
here’s I once had a chat with some people from a bank who are responsible
for authorizing credit card transactions for major websites and they said you
know when you click buy on a website and there’s all kinds of steps that that
chain of transaction is going to go through and the usual rule of thumb and
usability is that you need to get an answer back to the user with and say 100
or 200 milliseconds and make sure there’s a typo here there should be 10
milliseconds not microseconds but they told me that
their their budget that the time they had to actually authorized the card was
ten milliseconds that’s how long they had to make a decision you’re getting
into the realm of hundreds of milliseconds now it can be you’ve got a
lot of flexibility if you’re serving a dashboard you know that’s obviously not
a real time problem but you know human attention you’d like to let people know
within some reasonable amount of time that something’s going on if your train
machine learning models or even scoring in some cases then you often need more
time because these are more compute intensive models I mean if you start to
get up to seconds two minutes now you’re in the realm of like model training
worry I don’t have I don’t have to have my model up to date like to the second
necessarily but maybe within a few every few minutes or this could actually go
much longer even hours your spam filters don’t need to change very fast but they
do need to change pretty regularly or if I’m doing classic ETL where I’m you know
just ingesting this data and I want to park it somewhere then a really nice use
for kafka streams for example is like processing raw data and then putting it
into a much nicer format for downstream consumption again typically not
something that necessarily requires low latency and and to be frank if you’re
going above like a minutes or so of latency remember I mentioned how hard it
is to make micro sir versus run and stay healthy for months well you’re probably
better off maybe just doing periodic short batch jobs if your latency is
really quite long that’ll actually be more likely to stay healthy for you just
to periodically kick things off there short running all right the next topic
would be what’s your volume you know how much fog how much it really this is
probably should be velocity actually how much data are you processing in some
unit of time a lot of what we’ve built in the past especially if you’ve worked
on websites or whatever you know 10k 800 K messages per second might be kind of
typical you you if you use rest carefully like non-blocking rest you can
meet these requirements pretty easily you typically scale and parallel to
kind of address these needs and also provide the resilience that you need
when you’re getting into the realm of them millions of events per second then
you need to be a little bit more careful about the design of having non-blocking
services the reason the nest thermostat pictures here is a nest was implemented
in akka and they once told me that they get a really interesting problem every
day and every time zone at the end between like 6 & 7 a.m. they would get
this big spike in traffic is that that’s when people got up typically everyday in
those time zones and then there would be this big spike of traffic as the
environment flipped to sort of daytime settings so they had to do all kinds of
stuff to sort of broaden and flatten that bump of traffic this is a great
place where you want to use non-blocking rest non-blocking services like akka
messaging and so forth what kind of processing do you want to do on the data
what kind of analytics a lot of people like to use sequel and this is kind of
amazing to me actually that you can understand why people would want to use
sequel in a batch context because it’s a great concise way of asking questions of
data but it’s actually emerged as a popular domain specific language for
stream processing so I you know if I’m just a guy who’s writing some jobs that
need to do some analytics on the fly but I’m not a great programmer it’s actually
really nice if I can use sequel to express the job because that I may know
that language much better than say Java or Scala so here’s a couple of examples
kind of made up of you know doing a group by over the zip code for some IOT
data both of a sequel query on the left and on the right I’m using the spark DSL
that I might use to write it in in Java or Scala code dataflow processing here
I’m thinking about a sort of a factory model where I’m going to send the data
down this pipeline and at each stage I’m going to do some sort of transformation
filtering splitting it and so forth that this happens to be a section of an
implementation for the inverted index calculation if you know that calculation
but I highlighted in red the kind of operators you
each stage of this factory and how I might process data is ago we talked
about ETL earlier it’s a typical kind of very common if mundane sort of data
processing where I might have data that needs to be transformed cleaned and
transformed into a format that’s more useful downstream and then finally
obviously great importance these days is exploiting machine learning and there’s
really two phases to this one is your model training which presumably you want
to update periodically either in in some sort of streaming way or if it really
doesn’t need to be updated that often you might do this offline with batch
tools but then somehow integrate those up model updates into your pipeline
that’s then going to serve the model or score data as it flows through another
interesting decision point is do I need to process these things individually and
I’d like to use the word events for this something that has a unique identity
unique sense of importance and I need to do some complex event processing with it
or is it more like bulk records where I can kind of operate on things in mass in
some way even if I actually in practice do it one at a time
but I’m sort of thinking about these kind of operations in bulk
not so much treating each record as the term I’m using here with some sort of
individual identity now of course in a real database most records have an IDs
like it’s a you know select where this ID happens to be some value but mostly
in the streaming context we don’t think that way we just think of structured
data flowing through the system and we might want to do some bulk operations on
it in some sort of small window of time getting near the end of these
requirements this one is actually new too if you’ve seen this talk before I
began thinking about how really acha streams and kafka streams are
fundamentally different from clink and spark in the following way acha screams
and kafka streams are libraries that you embed in your applications and that
gives you a lot of flexibility and freedom especially
for integrating other kinds of processing into this flow it’s also
doing streaming but it also puts more burden on you to actually run Thanks
whereas spark and flame run his services that you startup you know they just
regular UNIX kind of demons and you submit work to them and they figure out
how to distribute it over the cluster monitoring job execution task execution
to make sure that things are restarted if they fail they do a lot of work for
you but it’s a more constrained a model so it may not be as flexible as you need all right with that let’s talk about the
for streaming engines I mentioned actually and the fifth one as we go so
we’ll kind of zoom into the diagram here a little bit that was on the right hand
side the reason beam is on this diagram is because beam has emerged as an
extremely influential streaming engine in terms of what the semantics of
streaming should be you know what if I want to do as much processing as
possible in a streaming context what are the sort of things that I have to
account for or do I process things in like Windows what do I do about data
that arrives late all these are the kind of questions that the Google data flow
team which basically being came out of that what they what they went through
thinking about these problems over a period of years the way they did it
though they actually open sourced the top end which is the beam semantics the
beam API where you define these data flows for processing and then you run
them with another runner like flank or SPARC unless you happen to be running in
Google Cloud in which case you would use data flow and today flink is the most
mature runner that I know of in terms of supporting most of the beam semantics so
beam is really just finding the state of the yard it’s very influential on how
all of these engines are implementing their capabilities and meeting the needs
of people it I’m not really sure how much beam will actually get used in
production there are definitely some big production users already outside of
Google but it may be more influential in its ideas than actual use and just to give you an example of what
I was referring to a semantic let’s suppose for the sake of argument that
I’m actually doing accounting calculations every minute for some
reason maybe I’m calculating how many items I sell of Apetit of each SKU in my
catalogue per minute and if you think about how the data is going to travel
over my cluster to the whatever node is doing this calculation because the time
of flight is not infinitely fast will always be some delay so some messages
will arrive in the next minute rather than at by the end of the first minute
when it when I want to trigger this calculation so like in this first column
I have three events on server two and one on event on server one but I ate it
that one of those events and server two will arrive late so what do I do about
this well being thought about this and it gives you ways of specifying how long
the way to what to do about late arrival of data when to say I’m not going to
accept anything that’s beyond a certain date and so forth now so you know I here
I’m illustrating just the normal state of things we’re just because of you know
network time network latency I’m going to see some delays but it would be even
worse if I if let’s say server two was lost to me from a net because of a
network partition it’s still running it’s accumulating data but maybe hours
later I’ll start to see its data instead of you know microseconds later or
something this is just a taste of what they’ve been thinking about can in that
project and and how other people are starting to now adopt these ideas so
that chief link is a project based out of Berlin and it’s it really started as
a low latency streaming engine that when we get to spark we’ll talk about how it
actually started as a bad tension and the implications there I think it’s of
the ones we’re going to discuss it’s the best theme runner it has a mature sequel
engine that actually came out last year and is widely used now as the
programming API so to speak and some large companies in China so if link is
one of these tools that you might pick if you’re mostly building a streaming
pipeline system and you don’t necessarily need a
lot of like spark batch smooth kind of stuff in which case you know you might
decide I will just use flame for this problem and will let the data scientist
over there worry about using spark or whatever for their their analytics spark
is probably the one you know the best it’s certainly a very large active
project as I said it started as a batch mode system it’s sort of a replacement
for MapReduce but as streaming got more popular they first implemented a clever
hack which was to say well we can do streaming if we wrap around our batch
engine the ability to capture data for fixed time intervals and then kick off
little jobs little badge jobs at the end of those intervals and that was their
mini batch model the API was called spark streaming but they’ve recently
introduced a new lower latency streaming engine that’s more of a true stream
processing engine and that’s part of their structured streaming API so
there’s a lot of choices when you’re working with spark mini-batches
obviously the most mature in terms of years of use the structured streaming is
new and maybe not as fully baked yet but it’s it’s it’s definitely the future for
for spark streaming and also spark is great for doing bad stuff and things
like sequel processing they just sort of idea say a sequel language now I think
it’s a superset of hyves sequel for example a rich and growing machine
learning library and lots of third-party libraries are integrated with it now
like tensorflow and intel’s big deal and so forth so again those sparking flink
those are the things that you use as i’m going to run these services i’m going to
submit big chunks of work to them and just let them run and figure out how to
partition that data over the cluster but if you have maybe really low latency
requirements and maybe not quite as much volume but might need a lot more
flexibility and how you both process the data and how you integrate the stream
processing with other things that you’re doing then you might consider using one
of the library alternatives akka strange across the streams
oh there’s a typo on this slide – boy I should read these slides so akka streams
does not implement a beam runner instead we would discuss this within like Benbow
at this time it does and we’re also working on whether we should do sequel
or not but mostly what we’re using office dreams for today is very low
latency string processing but very rich semantics for expressing the data flows
you might need and then one way we’ve used it in some of our client
engagements just click the button there one way we’ve used it is to actually use
it for that model serving case where you can send messages to akka actors to get
new data new model parameters into the screen pipeline and then use that
actually do scoring within the same JVM with low latency so that’s a really nice
use case for off the streams Kafka streams is really oriented towards
Kafka and that’s often really great so what that means is it’s not designed to
talk to databases or search appliances or rest requests things like that it’s
really oriented towards reading and writing Kafka topics whereas akka has a
rich connections library called elf pocket but a lot of times it’s all you
want you just want you know to read data from a Kafka topic manipulate it and
write it to new topics they’ve done a really good job thinking about common
use cases and so they provide these streaming semantics and also table
abstractions where a table might be you know what I only really need to see the
last value for a given key I don’t need to see all of them that have come
through or I’m going to do some roll-ups and so well I’m really going to do is
like keep it running average or something and I don’t have to see all
the values that fit so the table abstraction Kasbah streams is very handy
and they recently introduced a sequel tool and that this is a little bit
different than SPARC whereas in spark and flink sequel is actually a library
just an API and Kafka streams it’s actually services that you run that let
you do queries against running streams so still very useful but it’s a
different model okay so you could kind of summarize a
little bit spark and flank you know rich analytics lots of flexibility best for
working with massive data sets aqua streams or kapha streams your best for
micro service integration and wider flexibility when that’s important okay
so once again here’s the link for that paper the report rather that I wrote
some time ago like Ben comm products fast – data –
platform and you can also then learn about what we’re doing to provide
commercial tooling around all of these open source projects you know integrated
testing and all that stuff to give you something that’s reliable in production
oh and here’s that link again and now I’d be happy to take your questions all
right thanks very much Dean you mentioned a couple use cases in your
slides including nest I’m wondering if you can share another interesting use
case that you’ve seen in the field when it comes to using streaming and also
like fast data analytics yeah so actually one of the reasons I sort of
highlighted the microservice angle a little bit more with akka streams and
kafka streams in this talk versus ones that I gave last year is what we’re
finding is that a lot of people really do find that to be the right way to do
so-called string processing so there was a there’s a telecom company in the US I
don’t think I’m allowed to name them but they had a situation where they’re their
old system was largely based on Hadoop and some of the streaming tools that
some of the do vendors provide where they were trying to do processing during
like major sporting events where there was some promotion going on and so there
would be these big bursts of traffic and they needed to be able to keep up and
they were finding that their existing system couldn’t do that and what they
did is reimplemented that using this platform but using aqua streams as the
streaming engine and just had like orders of magnitude performance
improvements I mean a lot of orders of magnitude actually they were able to
reduce the number of servers that may need it in production by
something like a factor of five or 10 and and still you were like ten or a
hundred times had ten or hundred times more throughput with this new system so
a lot of times it really is avoiding the big heavyweight things and just focusing
on something small and fast by Kaka and you know tailored to the particular
problem you’re trying to solve all right thanks a few people in our audience have
have kind of asked a similar question so what if you’re only building a fast data
analytics engine and you really just want to use spark and Kafka and pretty
much keep it clean why why is it important to apply streaming to your
microservices architecture as well and kind of make sure all of these things
are working together yeah I think all of these things are sort of you know pick
what you need and and don’t deploy what you don’t need so for example I don’t
expect anyone would use all four of a streaming into insisting sample mostly
what we’re trying to support here and what we’re seeing people need is you
know you may not be thinking so much that you’re working with data today and
you might be thinking more if I’m managing user sessions and users are
doing work but as you grow bigger it you tend to get overwhelmed by the amount of
data that’s flowing through the system you have to process so that’s why we
think that you need the flexibility to be able to grow in that direction to
exploit some of these big data tools in a streaming context even if you don’t
need it today so you might eventually decide that you really do have to pull
on these things and conversely if you’re the data science team and you’ve come up
with all these cool analytics and you’re gonna your value to the company will
obviously be accelerated or incur magnified let’s say if you’re able to
get those results integrated into you know what people are actually doing
right now on the website so yeah I think a lot of people either come at it from
your classic services direction or from the big data world and we’ll probably
meet in the middle of one way or the other all right so speaking of microservices
architecture and how this kind of combines with fast data and streaming
architectures where do concepts like event sourcing and CQRS
play a role for for streaming data yeah this is some this is another area where
there’s just this kind of synergy emerging and a lot of it depends on your
perspective in a way Kafka supports the idea of event sourcing in the sense that
it gets captured this stream of data that for some configurable amount of
time and you can treat that as an event source that you replay as much as you
want or as different consumers come and go but in one case you might be playing
that as I’m really treating these as individual events where I’m going to do
some complex event processing but then another consumer might really think of
it in this sort of record oriented way where this is just a bunch of data
structured data that I’m going to do some ok you know Global Analytics over
or something so in my view a lot of these ideas like event sourcing and CQRS
will really come down to what what perspective that you’re bringing to your
problem domain and the kind of problems you’re solving and whether or not the
data that you’re working with is something that you think of as event
events or just records and the same data can be seen in both ways depending on
the consumer so that’s the way I think of it is that all of these things make a
lot of sense as always with any pattern or idea you just make sure you’re not
like drinking the kool-aid but actually applying something that makes sense and
delivers clarity and value to your system but but this sort of spectrum
that we’re laying out here actually covers that very nicely as well as
supporting the date of people who don’t even know what the heck you’re talking
about when you talk about events or seeing or CQRS great thanks for answering that there’s
quite a few questions appearing when it comes to you know resources and and
what’s kind of needed to even set up a test environment and
start playing with some of these technologies what are some of the
differences between the the infrastructure resources needed and
sizing that with fast data architectures compared to the previous batch model of
Hadoop right one of the things that sparked it that was just so great when
it came out in the Hadoop world was it actually was possible to easily run your
spark job just locally on your laptop as you’re prototyping working through the
algorithm working with a test data set and then you could just by flipping a
switch sort of metaphorically we’re really just telling it all right you
were running locally now I want you to run on this cluster over here then it
suddenly was able to scale to massive scales I think all of these tools
support that to some degree there’s even a way of running Kafka is sort of an
embedded library within your application so you could set up your development
environment so that it really is like one big JVM you’re running but it’s got
Kafka in there it’s got SPARC etc see that you can work out there’s a logic a
little bit of the performance testing that you might want to do in development
this is really important when you’re learning SPARC is understanding the
impact of various constructs but then take that and repurpose it to run in a
big cluster and then in the middle there typically what people do is they do
stand up a development cluster the ones that we use just as a reference point
you know my development team shares like a 40 node cluster that’s got everything
running on it in our platform and it’s you know plenty big for you know
submitting big jobs versus small jobs plenty of storage and again this is a
benefit of meso sand kubernetes is that you can have people coming and going and
needing resources you know like mate I like tonight I need just you know most
of the cluster to do some big job for analyzing last year’s sales data or
whatever but then when I’m done all those resources are freed and then
somebody else can work so that’s the other thing you don’t actually need a
big cluster especially if you’re not installing everything and you can tune
these things to be fairly small so you have this spectrum of everything from
the laptop all the way up to it your big production
cluster and if it’s very nicely in a classic CI CD pipelines great well Dean
thank you so much for joining us today this is all the time we have for
questions I do note that there are many of you that have questions that were
that we were not able to get to if you’re a light band subscriber you can
add those to the customer portal and our engineering team and possibly Dean
himself will be answering those those questions for you
also if if you feel like it’s time to find out a little bit more how light
Bend can help you ramp up with fast data platform then you can simply write to me
directly at oliver at light bend comm or you can contact us on our website and
set up a 20 minute chat with someone so dean thanks once again for joining us
and i wish everybody on the call today a wonderful thursday thanks again
everybody

3 COMMENTS

    Sorry to miss that valuable talk, but now I am happy to see this here and appreciate it; many thanks Lightbend!

    Deans comments about Kafka streams at 29' are incorrect. 1. Kafka streams and KSQL are different products and runtimes, the former is embedded, application mode, the latter is run as a cluster that dynamically runs Kafka streams topologies from SQL. 2. Kafka Streams means of ingress/egress is via Kafka Connect

Leave a Reply

Your email address will not be published. Required fields are marked *