S2 E1: Bit: What's the hype with Hyper?
Everyone says Hyper is just faster, but speed is only one of four reasons it's quietly revolutionary.
- Hyper was acquired by Tableau around March 2016 and introduced in Tableau 10.5, replacing the older TDE extract format; it was built by academics in Munich, including query-optimisation expert Dr Thomas Neumann.
- Hyper's speed comes from four pillars, not one: query optimisation, query compilation via an LLVM intermediate step, doing as much processing as possible in memory, and morsel-driven parallelisation.
- Because RAM is dramatically cheaper and CPUs more capable than 20 years ago, Hyper could be designed from scratch to exploit modern hardware, something legacy enterprise databases struggle to retrofit.
- Morsel-driven parallelisation splits work into millions of tiny pieces rather than fixed per-core chunks, avoiding the diminishing returns Amdahl's Law predicts when you simply add more cores.
- Hyper aims to collapse separate transactional, analytical and beyond-relational systems into one, enabling features like Tableau Prep, live tooltips and loading data into extracts without a refresh.
- What Hyper is and where it came from0:11
- Beyond speed: the four founding principles3:00
- How modern hardware changed the game6:30
- Query optimisation explained13:25
- Query compilation and LLVM18:09
- Why memory beats the hard drive23:54
- Morsel-driven parallelisation30:11
- Tableau's competitive advantage35:33
- The future Hyper enables41:51
- Further reading and wrap-up46:46
0:00Hello and welcome to What So What Now What.
0:03If you missed our previous episode, you'll have missed a brief update about the new format we're going to be approaching.
0:09This podcast with this year.
0:10And so today's actually a bit.
0:13It's going to be about hyper, and I'm joined by my co-host Ravi
0:16Hello, hello, how are you doing?
0:18I'm good, I'm good.
0:19How's your new year?
0:20Um New Year was good.
0:21It was it's um I'm slowly getting back into the swing of things.
0:24So it's a nice easy easy way back in um after a bit of uh the chaotic kid
0:30Christmas period.
0:31So we're going to be talking about hyper.
0:33Absolutely.
0:33We're going to be talking about hyper.
0:34Um I'm really excited about this because the this this technology has actually been around for a while and
0:40Definitely in the tableau sphere, it's it's definitely something that we've all started using and have almost forgotten um sort of when it was introduced and and how it was introduced.
0:50So we just wanted to take a a
0:56Yeah, for sure.
0:57So it's it's it's interesting.
0:58So hyper was dropped in Tableau ten point five, right?
1:01Um it was an acquisition, so I think they bought it in um
1:06e around Easter time twenty sixteen, I think it was March.
1:10Um and it's basically a technology that was developed in um Munich by a a couple of Germans.
1:16Um and it's it's an as it's it's basically an academic
1:20uh endeavor, there was a project or something that they worked on and they were looking for a bit more financial backing and Tableau came in and purchased them because I think Tableau also wanted to move on from the TDE.
1:32So taking a quick quick step back, so
1:34What these two technologies, a TD and a hyper, uh these two file types are, they're they're they're extracts of um data.
1:42So what you're doing when you connect to a database um on Tableau or even an Excel spreadsheet.
1:47You have the ability to extract that data into a what was formerly a Tableau Data Extract or what's now a hyper extract
1:55Um and when it was demoed, I think the first time it was demoed was in Austin, um 2016.
2:01Um the biggest excitement was the speed increase.
2:04Uh and the speed of processing it it brought and the fact that you can query um millions and millions of records of data on the fly and it's so much faster to extract and
2:14the processing is so much better.
2:16But I think um when we when we dug a bit deep deeper into this we found out there's a lot more going on.
2:23than just what we see, which is like when you when you when you uh like I I mentioned in the previous podcast, I'm very much a I just need to know what it is and that so I can explain it.
2:31You know, have that have that one liner that you mentioned so you don't sound stupid.
2:36Um
2:37And for me it was like, oh yeah, it's just a lot faster, it's processes and extract a lot a lot quicker, but I've never really ended up digging into like, okay, but
2:46What is it?
2:47Like what it's is it a database?
2:48Is it an extract?
2:49Like how does that actually all that all that work?
2:52So um that that's what hyper is.
2:59Yeah, it's it's an interesting one because uh speed is often the thing that everyone talks about.
3:03And I think uh after that you you just kind of put put
3:07the technology to bed.
3:08And I think that's actually a mistake.
3:10Hyper is a lot more than just about speed.
3:14And to sort of dig into it, you have to kind of go back to its its founding principles.
3:19And um it was actually derived from an academic uh background and essentially it was spun off by, you know, two Germans who were trying to understand why today's database
3:29Were so slow and had so many trade-offs compared to some of the hardware and technologies that we have available to us today.
3:38And so the core principles were very, very simple.
3:41They bought once to build one system.
3:43Uh this meant that you only needed to have one database that could serve the purposes of being transactional um or an analytical database or sort of uh beyond relational kind of um
3:55um uh uh infrastructures as well.
3:57Um they wanted it to have to have it in one state so that you didn't you didn't need to have, for example, with the TDE when you add data to it, you can't and extract the data and be
4:09reading data from it at the same time.
4:11So they wanted it to be able to be accessible in multiple states.
4:15And then the other uh sides of it they wanted no trade-offs.
4:18Yeah.
4:19Um given those those requirements and they wanted no delays.
4:22So speed is actually only one of the four key factors that they wanted to sort of implement into this.
4:27And I think as we dig into this, I must stress we're going to be talking about concepts that are actually, uh, unless you dig into the research papers and
4:36you understand a lot of the terms around them.
4:38We're going to be massively simplifying some of these concepts just so that we can so we can A have a discussion that you and me can
4:45and keep up with, but also be um uh relay them back to you in a way that kind of makes sense.
4:50Um so we're gonna be using a lot of analogies and metaphors and oversimplification of a lot of concepts.
4:55Um but we just want to stress that up front before we go in
4:59Sure.
4:59And I I think it's also worth stressing that neither me nor Tim have a hard background in commuting in terms of ac academia.
5:06Um I'm an economist and Tim, what did he study again?
5:09I'm a generalist, I'm a management.
5:12Right, exactly.
5:13So like the th this sort of top level view is really um
5:18Really useful for us as well.
5:19So it it's breaking down these concepts so we understand them, and then also you our listeners can can understand them a bit better as well
5:26Um I think what you mentioned there about computational power is really, really um quite interesting because it's it's that sort of the rate of innovation is so much faster in the last ten years, even versus the previous decade, right?
5:38Um I think uh the the the the tweet going around very uh a lot is that 1999 was now twenty years ago.
5:45Um
5:46So, you know, whether it isn't it?
5:48It's it's yeah, exactly.
5:49And and the the leaps and bounds of that that we've gone through in what we do and how we access content and information has changed so much.
5:57But
5:58Fundamentally the computer power that generates this that makes all of these things possible, you know, the fact that I can play Fortnite on my phone is put possible because, you know, the the Apollo eleven computer that landed on the moon
6:10had like a thousand times less power than like the iPhone five.
6:14Um like it's we've condensed so much technology into smaller things, but it's the question of have the wider enterprise level of technology is kept up and
6:25I'd I'd argue no based on what we found.
6:29Yeah, absolutely.
6:29And and so we're kind of gently moving into the so what uh so what era here.
6:34And and and so to summarize, you know, Hyper is, you know, as Tableau terms that a next gen database system, right?
6:42It's not a database that you can buy off the shelf today.
6:45Tableau definitely haven't yet spun it off as a database, but it's uh it's the driving engine behind um Tableau Server, Tableau Desktop.
6:54And pretty much most of the new kind of innovations from Tableau.
6:57And so if we do take that step back and start to think of the so what?
7:01Um Ravi, you touched on the computing landscape.
7:04And I think it's important to take that step back and just
7:07sort of um you know touch on the a couple of points regarding hardware and hardware capabilities of today versus
7:15twenty years ago.
7:16So the the simplest thing you can look at is hard drives.
7:18I mean, um I remember buying a hard drive, you know, one of those USB pen drives, and the most you could get on it was four
7:25megabytes and that was that was that was revolutionary because the floppy disc I used to have could only contain you know 500 kilobytes right?
7:32That's what three songs?
7:33Yeah exactly
7:36Um and now like you know, just to download one Instagram photo it you know, requires more than just, you know, five hundred kilobytes or whatever you could fit on a floppy disk.
7:44And so
7:45Ye the c the cost of computing has come down massively.
7:48And hard drives is one aspect.
7:50But the other aspect that people don't of often talk about is memory.
7:53And by memory I don't mean um, you know
7:56uh memory in your hard drive, I'm talking about random access memory.
7:59So this is this is the the kind of memory that your computer uses to store information whilst it processes it.
8:06Yeah.
8:07Yeah.
8:07So short-term information, right?
8:09Exactly, exactly.
8:10Um and uh th this is kind of how computers were built.
8:13They were they were built with this sort of logic in mind.
8:16And so the the way the way a typical computer works
8:19works is you have information stored on your hard drive and your computer needs to process something, it takes some of that information and whilst it's processing it, it leaves it on the um memory so that if you need that information again, it can pick it up and process it.
8:33carry on processing it.
8:34So if you do something like Photoshop editing, that is a mostly in memory process because it has to keep all that information in memory so you can keep on editing
8:41editing your your image.
8:43And if you look at the price of memory, um, you know, from nineteen ninety-four um to pretty much where we are today, it's gone from, you know, costing roughly
8:53you know forty-five thousand dollars for a megabyte down to being under two dollars to the point where today you can buy a one terabyte uh RAM system for about fifteen thousand
9:05Which sounds like a lot of money, but for one terabyte of random access memory, that's that's just revolutionary.
9:10And so it brings a lot of sort of capabilities into scope.
9:15Um the other aspect obviously is also the the
9:19the computer, the thing that processes that information, that has also been progressing forward.
9:24So you know back in the day you used to have uh single core computers.
9:28I remember some sort of
9:30of revolution when you could start having dual core CPUs and everything just got insanely faster.
9:36And ever since then what you've seen in computing
9:40is Moore's law is starting to slow down because we're reaching the theoretical limit of what's possible in terms of physical architecture.
9:49today so so Tim what is Moore's Law?
9:51So Moore's Law is a simple law that this guy called Moore came up with and it basically suggests that every two years the number of transistors that you
10:01you can fit in a given space doubles, okay?
10:04And so, you know, back in the day in nineteen seventy, that number was about
10:09a thousand and where we are today is that number is about twenty million and it's actually getting harder and harder to squeeze more transistors into
10:18into that space, it's it's actually peaking off.
10:21So it's no longer this exponential effect that you'd see on a graph if you used a logarithmic scale.
10:27Yeah.
10:27So w what's interesting about that what you mentioned there was like, you know, I I remember I was very young at the time, but I remember when the dual core processor came out, I was like, it's a dual core processor, you know, you see the adverts saying like, it's dual core, I can fit two into this.
10:40And now we're at like quad core and no one bats an eyelid, right?
10:43Like it's yeah.
10:44It's like, oh yeah, of course it's quadcore.
10:46What else would it be?
10:47In fact, it's it's almost seen as uh uh you know, the number of cores going up is almost now seen as uh
10:54uh almost lesser uh technological advancement compared to being able to produce your circuitry at a small
11:02smaller level.
11:02So you'll often hear the thing.
11:04Right, you exactly real estate.
11:05Yeah, real estate.
11:07So the difference between a fourteen nanometer circuitry and seven nanometer circuit
11:12circuitry.
11:12So um it's literally exactly as it sounds.
11:15Uh the circuitry is uh much much smaller on a seven nanometer process
11:20compared to a 14 nanometer.
11:22And so what that allows you to do is use less power because you don't have as much circuitry to run power through.
11:27And that means you're
11:28Have less resistance, all this, you know, wonderful physics stuff.
11:31But it also means you can fit more powerful units into a smaller amount of space.
11:36And that is what's more important for things like phones and computers and laptops.
11:40It means cooling is easier.
11:42It means
11:42you can you can um sort of get advantages from being able to do things in in sort of interesting ways um and so things even like neural networks and sort of uh ai architecture
11:54is easier to do because you can fit them in smaller and smaller spaces.
11:58Yeah, and just a point on nanometers, um, it's like a strand of DNA is two and a half nanometers in diameter.
12:05Yeah.
12:05So we we're getting really, really small.
12:07And also there's there's also like just be uh there's there's differences in in the different types of seven nanometer processes
12:15you can get so for example Intel's version of 7 nanometer isn't the same as uh Qualcomm's version of 7 nanometer.
12:22Oh really?
12:23It's not like a an entry standard.
12:25It's simply the name for a process, but it
12:28actually when you get into the nit nitty gritties, each sort of uh supplier or uh you know f fabrication um chip fabrication uh company will have slightly different ways of getting those
12:42benefits um even it going into like things like 3D architectures which which is just way beyond this podcast but yeah um right so we we're kind of getting sidetracked there but
12:52If we go back to the overarching story here, the the simple fact is that computers uh are getting faster and more capable.
13:00The hardware has changed massively since 20 years.
13:03ago.
13:04But many of the databases, many of the principles we rely on are much older than 20 years ago.
13:10And so if we if we drill into, you know, why is hyper so revolutionary, we can actually start to address some of these
13:17these um some of these concepts right from the top.
13:20And and so let's maybe just go through some of those.
13:24There's really three or four key key concepts.
13:27concepts here of why hyper is fast unlike its competitors and I I'll say competitors in this sense because if you're going to compare it um to other databases you have to view
13:38other databases as competitive, even though it's actually not a product.
13:42Um and so the first the first one is query optimization.
13:45Now query optimization is something I think we're all familiar with, the Tableau.
13:49When you're writing a query or when you're building a Tableau Word.
13:53One of the things you might do is look at the way that Tableau's building the query.
13:57And you might find uh areas which aren't optimized.
14:00And so what you might do is change the way the chart
14:03built so those queries are optimized or you might change the way the database behaves so that that query is optimized.
14:09And exactly better better optimized queries means uh they're basically asking for just the right amount of data
14:16uh in the right way so that you can then work with it in your visualization or in your database.
14:21It's basically building um what
14:23The s saying something in the most efficient manner, right?
14:26I think is the best way to put it.
14:28Because w once we start talking about queries, I I always think of translation analogies as the easiest way to understand these sorts of things.
14:35Um and in this in this sort of thing you're basically saying that if you know it's when you start learning German for example and you're saying can you point me to the left hand direction
14:43to the pub or something like that in a long-winded way.
14:46But then when you start learning a bit more and optimizing your language, you're more likely to speak in colloquialism saying, where's the pub?
14:54Exactly.
14:54Exactly.
14:55And and so you kind of hit the nail on the head there.
14:58The way you get better at query optimization is to surround yourself with people.
15:03Right.
15:04And it's because it because it it's the kind of thing in uh database and e even just in general in terms of research, it it's it's very hard to yield results.
15:14And so there are not many researchers
15:16that actually are good at this because it's just simply the kind of thing that takes so long and even let's say you spent five years researching a certain type of query optimization, it might not actually yield any results that are beneficial in today.
15:30sort of enterprise context.
15:32And so when Tableau acquired uh hyper, it just so happened that one of the leading uh sort of
15:39um thoughts on query optimization happened to also be on the Hyper project, and that was Dr.
15:44Thomas Neumann.
15:46And so i in many ways
15:48That was something that Tableau very luckwi lucky with.
15:50But uh we should also highlight that Tableau's own team have actually been, you know, from the top, very, very upfront about improving the way that query.
16:01So I'd say this is a small point, but it just it adds sort of a weight to Tableau's uh
16:06sort of research and uh you know push to make better optimized queries for a whole range of databases um that tableau sort of interacts with
16:15Exactly.
16:16I think it's this is the thing that Tableau teams do very, very well, um, in terms of they'll push their own boundaries as far as they can go and then they'll appreciate the fact that they need to bring in some external
16:26uh influences to help develop the products further.
16:29So for example, Hyper was one and then we had the um natural language processing company that was bought uh a couple of years ago now or last year.
16:39just to help push that like that that thinking process further and move that needle along uh to keep up with the the market and the changing
16:47world we live in.
16:47And also another thing to touch on is Tableau's research team in general.
16:51Um if you go on research.
16:52table.
16:53com you'll find that a lot of the the prominent thinkers in the space we all
16:57we w we work in um in data and data visualization and mapping and now databases and optimizations.
17:05Um Tableau has a lot of that the the the leading thinkers in in those in those fields.
17:11Like I'm I'm thinking
17:12Maureen Stone on colour theory, uh you've got Robert Crusar on storytelling, um Sarah Batterby on mapping.
17:19Um so it's very much something that's on on Tal Tableau's mind in terms of innovation.
17:24It's part of the DNA of how they work.
17:26Very research driven and they they put as much effort into research as they do into building their product, which is what makes their product better.
17:33It's research-driven and innovation driven.
17:36Um that's the point.
17:37Um and so that's that's sort of the first step, right?
17:40When you're talking to a database, you you build a query and if that query is optimized, then the database is more likely to respond faster and better and it should
17:51also take less time because you're getting it to do exactly what you need it to do.
17:55Right.
17:55Now there is actually another step um between the query optimization, so writing a better SQL query and actually getting a response back to the
18:04um to f from from the database about that query.
18:08And for this I'm going to talk about query compilation.
18:11Okay?
18:12So when you write SQL, that is essentially
18:16um code that you and me can read.
18:18Uh in fact in Tableau if you uh record the performance of any workbook connected to a database and
18:24And then you stop the recording, you'll get a view that shows you all the queries that were sent to a database and the SQL that was written to
18:33request that data.
18:35Now when you actually fire off that SQL query the computer can't read that and by that I mean the processor itself does not take your nicely
18:46a sort of human readable SQL and process that.
18:50What it actually does is it converts that SQL query, essentially that code and instruction you've given it, into something called machine
19:06Yeah, no neither of us are experts on how uh the the nooks and crannies of machine code work, but again it's effectively a translation system.
19:13It's changing it from uh a language that me you and me understand and cutting out the middleman of let me just quickly stage this and change it and basically saying, hey, uh let me help you communicate this better.
19:24Exactly, exactly.
19:25Exactly.
19:25So it's just ones and zeros, uh lots of numbers.
19:28If you saw it, you'd think, What is this?
19:30It looked like the matrix.
19:31It literally looked like the matrix.
19:33Um and so if we take a step back, what the team hyper discovered and it's actually uh
19:40There's a paper that they refer to which was written by Monetb, which is another database, where MonetB essentially proved that even their hand-coded C program-based queries
19:53performed faster than a computer generated uh or computer compiled code.
19:59And so what I'm saying here is that when the team took a query
20:04and wrote it in machine code, essentially ones and zeros, and then they got their database to try and do the same thing, the hand-coded version was still faster than what the machine
20:15created for itself.
20:17Which if you think about it is actually doesn't make sense, right?
20:21Why why can't a computer can create machine code that it can read as fast as we can write for it, if that makes sense
20:29And so this is what Hyper tried to answer.
20:32And and and it it actually goes back into just how the database itself is built.
20:38And to to kind of keep this short
20:41short and sort of unabstruse and uh not abstract.
20:45Um what what what hyper basically does is it kind of plays into this.
20:50When you when you send it a SQL query, it doesn't process that SQL
20:54What it actually does is it converts that SQL into machine code.
21:00And it does that by writing something called LLVM.
21:03And LLVM is uh is an acronym.
21:05But in i in summary, LLVM is the intermediate step between machine code and uh a programmer, a programmer's code.
21:14So let's say you do HTML coding.
21:17or let's say you do SQL programming, yeah, what you write is a conduit that then gets compiled into machine
21:26Code for the computer to process.
21:27And LLVM is essentially much, much, much simpler.
21:31It's basically one step ahead of that.
21:33So if you were to send LLVM to a computer, it would then almost instantaneously
21:40convert that into machine code that it could then process straight away.
21:43So it's it's one level in between what a programmer writes and what a machine understands.
21:49Exactly exactly.
21:50And that that's the that's that translation step that I keep referring back to.
21:53That's what it's doing effectively, right?
21:55Yeah, exactly.
21:56And so
21:57What and and so what happens here is that because because Hyper is doing this, you you're actually able to get highly specialized and highly optimized queries um generated
22:09Um for the CPU in question.
22:10Because the other thing to bear in mind here is that not every CPU is the same.
22:14Every single year new CPUs come out, new manufacturers compile, different instructions
22:19sets into their CPUs.
22:21And so being able to compile instruction sets that are specific to that CPU allows you to do optimizations for that query that only you could take advantage of in that specific
22:33processor.
22:34So when I say CPU I mean processor.
22:36And I have to keep you have to keep reminding me to kind of use acronyms.
22:40Yeah right.
22:40And then then the the two two types of two like main types of processors in the market today are you've got AMD Athlon.
22:46Well no athl
22:46ones are types, so AMD and um Intel, of course is Intel.
22:51You've got another sort of um split going on which is between ARM uh arm ARM-based processes and Intel-based processes.
23:00and you know all all this all this stuff is is massively massively in flux at the moment um and and so the benefit here is that you know you're probably wondering okay so what if it can compute compile this faster so what if it can you know
23:14know get this all done.
23:15What's the value, Tim?
23:16What the value am I seeing as a user of Tableau that's better?
23:20Okay, so it essentially comes down to speed and basically how close you can get your SQL query
23:28to the computer's um sort of processing.
23:32Okay?
23:33And this is this is two-pronged.
23:35So number one, it's taking less time to write the
23:40um SQL query because you're optimizing it.
23:43Number two, you're then translating that into machine code a lot faster than it would traditionally take because you're doing it
23:50uh using that middle step I described called LVM.
23:54And then earlier on we talked about memory.
23:56Okay.
23:57So memory is a really, really important thing in this game because when you're doing all that work, you need to be able to put
24:03what you're doing somewhere.
24:05So if you give me a pack of cards and you tell me to memorize the order in which they are, I need to be able to put what I memorize somewhere.
24:12And for me, that's my brain.
24:13But imagine if I had to write it down.
24:16That would take me longer to write that down.
24:18Yeah.
24:18I I'm assuming that my brain and my memory is better.
24:25But assume I had to write it down, I'd actually physically have to wait for me to see the card, write it down, get make sure I've written it down correctly, go to the next card.
24:33And so you start to get a picture here.
24:35This is
24:35is taking time.
24:36And so in computing terms, um what this relates to is essentially how far away you are from the job that is actually
24:45being done.
24:46And by this I mean how far is the information that's needed away, how far away is it from the computer, the actual processing
24:55that's being done.
24:56And it's hard to sort of imagine this because in in real terms this is all happening inside of your laptop.
25:02There's no there's no great with almost no delay and it's just happening.
25:06We're talking about infinitesimally
25:08small periods of time that massively scale up.
25:12And so we're just gonna sort of try and give an analogy, right?
25:15So if I put a bit of information inside of my processor, inside of my computer's processor
25:22That is the equivalent of having that information on my desk in terms of, you know, if I'm doing something and I need it, I need it on my desk, that's just like reaching across the table and grabbing it from my desk.
25:32Okay.
25:34So it's you getting your list of the curved cards you memorized, right?
25:37Exactly.
25:37I've just literally picked up the list from my desk and I'm now reading it.
25:41straight away.
25:42Now imagine that that list that I've written down is now no longer on my desk.
25:47Imagine it's on the bookshelf.
25:49Okay, now that's still in the same room.
25:50So I just get up off my chair and I go to
25:53my bookshelf.
25:54Not so hard.
25:55Now imagine that that list is actually next door.
25:58So I have to get up, go out the door, go next door, get it, look on the right bookshelf, pick it up, come back, sit down, read it, and then process.
26:06Okay.
26:07You're starting to get the idea here.
26:08It's getting taking longer and longer and longer.
26:10More and more tenuous.
26:11Exactly.
26:12More and more tenuous.
26:13Now imagine we are where we need to be, which is RAM.
26:17So random access memory, what we talked about at the beginning
26:20Okay.
26:21Now RAM is the equivalent of being in the building next door.
26:24So imagine that I'm sitting in my office and I realize I need that
26:28list of cards.
26:29What I've got to do is get up, go out, go into the lift, or call the lift, wait for it to go down, go down, go out, go into the other building, go to reception.
26:38You know, do whatever you need to do, get to the floor that has your your list, get the list, and then do all of that to come back again.
26:45That is actually what random access memory is like.
26:48Okay?
26:48And it that sounds really really slow.
26:50But wait till I explain how a hard drive actually works.
26:54You see, when your computer talks to a hard drive
26:57In real terms, that's the equivalent of having your list on the other side of the planet.
27:02Okay.
27:04So although your hard drives and SSDs are getting ridiculously fast, in in in computing terms
27:10The information is actually being stored incredibly far away.
27:15And this all gets in the way of processing data and processing information
27:19information very very quickly.
27:21And so because Hyper does as much of its processing in memory as it possibly can, it's effectively minimizing the
27:30the the travel between um the data and where the processing is done.
27:34So it's essentially the difference between going to the building next door and going to the other side of the planet.
27:41That's basically the best way
27:42the the best way to think of it.
27:44And so in summary, you know, a lot of databases today use the hard drive option.
27:49They don't store a lot of information in memory.
27:52In fact they don't store much if at all um information
27:55in memory.
27:56It's all read from disks, all read from the hard drive.
27:59And then if if you add in the factor that a lot of these databases are behind firewalls, they have security layers and
28:05authentication options, you start to see like the fact that you're signing off a query and the fact it's taking a while to come back to you, it's all
28:13relative to all of these different things added up together.
28:16So it might be in a hard drive, but then also you have to jump over all of these different things to get to the other side of the earth in this case.
28:23Exactly.
28:24Um
28:25Before you getting your response, which is why you sort of get people's uh annoyance with the fact that it takes ages to query big data, as it were.
28:32Yeah, exactly.
28:33It's a
28:34It's and it i it's funny because you know you will get technologies, uh you get you know uh you get companies who claim to have database.
28:42that are super fast.
28:43But what they're doing is essentially brute forcing this approach, which is uh they're storing all of this um information in an architecture
28:51of in an architectural format where basically the computer knows where absolutely every single bit of data is.
28:57So it's actually cheating.
28:58It's a bit like having
29:00Uh yeah, yes, your data might actually be next door, but what you go and do is you go take a picture of it on your iPhone.
29:06So the next time you need it, you don't actually have to go next door.
29:10You just pull out your phone and look at the list.
29:11list, right?
29:12But it still takes you time to open up, get your phone out, look at that list, and that's still not as fast as having it, you know
29:20Just just right next to you.
29:21Okay.
29:21And so um the other thing is that we're talking about having everything in memory.
29:27And that's not a sort of computing world we're in yet.
29:30Uh we still all have SSDs and the hard drives.
29:34whilst the cost is prohibitive to have RAM only systems.
29:37But uh you know, technologies like this already exist.
29:40Intel, for example, have a hard drive you can purchase.
29:44which is all based on RAM.
29:45It's just all RAM.
29:46It's a 200 gig RAM-based hard drive.
29:49There's no hard drive other than that.
29:52And it persists the information you store on it when you switch it off.
29:55And that's actually the
29:56technological challenge.
29:58How can you get these r memory systems to keep the information they have in them so that they're continuously fast all the time?
30:04Um and so that that's sort of the the the third thing in in hype is sort of
30:10trick book of tricks okay uh how do you process information quickly and how do you get it quickly now the last one uh the last key sort of technical technological uh
30:24uh achievement for hyper is the way that it paralleliszes tasks.
30:28I can't never say that phrase properly so excuse me.
30:31It's a bit of a tongue twister.
30:33Parallelization.
30:34I think that's
30:35I said that right this time.
30:36So parallelisation is the concept here.
30:39And i in Hyper's academic paper they actually call it um something they call it morsel driven parallelisation.
30:47Okay.
30:47Mm-hmm.
30:48And so I'm just gonna go straight to analogies here because it's much simpler.
30:52Imagine I've got a cake, okay, and I have four people.
30:55Okay.
30:56Now, in modern terms, that's a four-core CPU.
30:59That's a com
31:00computer processing unit with four computer cores.
31:03Okay.
31:04And the the the most effective way of actually divming that cake up is to cut it into four, right?
31:10And then I feed it, I feed each quarter to each person.
31:14Okay?
31:14And the task here is to finish eating the cake.
31:17Okay?
31:18Now you immediately get problems.
31:20For example, what if you're one of those people, Ravi, and you eat cake?
31:24Faster than everyone else.
31:25Okay?
31:25Yep.
31:26Uh you're gonna finish your cake much sooner and then in terms of in in computing terms, you're gonna have nothing to do whilst me and maybe two other people are being polite about eating our cake, you know?
31:36We're we're taking our time and meanwhile you've got nothing to do.
31:39Okay?
31:40Yep.
31:41The other problem here is that what if a fifth person turns up and I've already cut the cake in four?
31:46It's actually quite hard to
31:49does re-cut the cake into five pieces once I've already cut it into four.
31:52I have to go figure out like how much I take off each person's cake.
31:55And then that person gets like a part of a cake.
31:58Yeah, yeah.
31:58Exactly.
31:59And if if you turn up halfway, people have already eaten cake and if you're eating
32:03eating cake faster than everyone else, I've I've got even more complicated maths to do here.
32:07And although a computer could do it very quickly, it's still not an efficient way of doing it.
32:11Okay?
32:12And so
32:14Uh the the key challenge here is how do you parallelize a task, okay?
32:19And the way Hyper does it to simplify it is it actually cuts the cake into millions of pieces.
32:25Okay?
32:26And it basically puts those pieces onto a table.
32:29And then it looks to see well who who's here to eat the cake.
32:32Then it just tells everyone, just start eating.
32:35Don't worry about which pit bits you eat.
32:36Eating, just start eating the cake.
32:38And so everyone's eating cake.
32:40And because the bits are so small, if you eat cake faster than everyone else, Ravi, that means you get a bigger share of the cake.
32:46There's no sort of uh equal division based on the number of people
32:49people in the room.
32:50It's just lots of pieces, off you go.
32:52And if a fifth person turns up midway, they can also contribute to the task.
32:56And if someone has to, you know, go away and do something else, which in computer
33:03Then that also works.
33:04And it essentially means that when everyone finishes eating cake, everyone still finishes at the rough same amount of time.
33:10Because if you're eating faster, you just eat more
33:13cake and if you eat slower you just eat less cake.
33:16And so this this has actually got a a wonderful sort of name to it.
33:20This this guy called Amdal who uh writ wrote a law.
33:24Okay.
33:24And when we actually apply it to
33:26computers you start to understand why this is a big challenge.
33:30If I take a very simple 32 core Tableau server, okay, and I I
33:37Let's assume we're just sending it one task and 95% of that task can be parallelized, parallelized.
33:44Okay?
33:45And so by that I mean if I'm cutting a cake and the task is to eat.
33:48eat the cake, there might be one task which is uh let me know when the cake is finished.
33:53Okay.
33:53And that task can't be parallelized because
33:56it essentially it's essentially one person's responsibility to basically look at the cake and watch the table and let me know when the cake is finished.
34:05So if I parallelize a task to about 95
34:08percent okay across 32 cores what actually ends up happening is that I only get a 40% utilization of my 32 cores because of the way the maths works and the way the the parallel
34:23has a drop-off rate, it basically means that even with more power you don't exactly see the benefits.
34:29And this is the problem with enterprise hardware and enterprise architecture is that you don't actually
34:33get uh massively uh parallelized tasks and this is how you know hyper sort of combats this by
34:42doing this morsel driven parallelization, it's building resilience against something called SKU, and SKU is essentially the difference between how processes perceive a task and when they start and when they finish them
34:54Okay.
34:54And so that's that's one of the sort of the big benefits.
34:58Um I would give another example, which is very simple.
35:01If you have a TOS that's nearly a hundred percent paralyzed, so ninety-nine percent
35:06parallelization you have 200 cores you only get a 63 speed uh improvement even though you have 200 cores compared to
35:15to just a single call.
35:16And that means you get about 32% utilization.
35:19Again, because the drop-off rate is is is really bad.
35:22And so the more calls you shove in there, it doesn't actually mean the more speed you get.
35:27And so you're not
35:27not taking full advantage of the CPU cores that are actually sitting on your computer.
35:32I guess the tricky part is that fact that, you know, it's
35:35It's such a big change from moving away from the old system as it was, right?
35:39Yeah.
35:39Like the what what we have here and what hyper compared to what Hyper's doing is
35:44This is what's revolutionary.
35:46This I think this is the big crux of it, right?
35:48It's this processing speed and the fact it's it's compiling a query in such a different way.
35:54Um
35:55That it's it's just completely different and it's hard for a competing technology to do it similar unless it's a new competing technology right.
36:03But I'm I'm talking about the old school, enterprise level um
36:11Exactly.
36:11In many ways, you could only really build a database like this if you started building it today, which is what
36:17Which is what essentially hyper is.
36:19Hyper has been built out of a modern perspective and a modern way of thinking about this.
36:24And you might ask, well, why don't all databases do
36:27this and the long well the short answer is that it's a lot of work.
36:31If you've built an infrastructure that lots of enterprises rely upon
36:35and you suddenly try and change it, that's gonna cause bugs.
36:39It's gonna cause um sort of a poor performance for a period of time.
36:42So the only way you can get the real benefit is by building something from scratch.
36:46scratch.
36:46And you can obviously understand why that's not in the best interests of of some of the existing incumbents, you know?
36:53And you kind of you you don't want to have to tell your customers, hey, by the way, we're building an entirely new way of working with
36:59databases, it might cause problems, but trust me it's better.
37:02Whereas, you know, Tableau customers have I think have traditionally shown that they are interested in new ways of doing things.
37:09And at least for now, Tableau can take advantage of the fact that they have a clean sheet.
37:14And they're building a database from scratch in a modern world compared to the world we lived in 20 years ago.
37:21Yeah, so
37:22Um what I think that Tableau um might struggle with here is just the adoption understanding.
37:27Like I I don't think I think that a lot of the times when it's being mentioned about hyper being this innovative process, it's not given its dues in terms of that advantage.
37:37Because of this thing, like we're saying it's faster, but we're not you don't want to explain like everything you just went through about career compilation.
37:44Like imagine saying that in like an elevator pitch.
37:47If if you're speaking to a C suite guy, he's like I I don't care.
37:50I don't under I don't it fundamentally care why it's faster.
37:53Is it faster and am I gonna get my results quicker?
37:56And with all due respect to the researchers
37:58and and you know the people the smart high minds behind this yeah like I'm massively simplifying what I'm saying and barely understand it so like I read the first page of one of the papers on hyp
38:12Yeah, and I uh it's called Fast Serializable Multiversion Concurrency Control for Main Memory Database Systems.
38:20And I was just like, whoa, well, MVCC for main memory database systems, of course.
38:27I was just like, what it what does that even mean?
38:29And
38:30Uh and it n that's the thing, you're you're talking about incredibly deep level of sort of understanding.
38:38And the the you know, what everyone takes away from it is it's faster and it's it's actually not just that.
38:43It's it's there's a lot more to it.
38:45And I you know people
38:46People just need to spend a bit more time with the Yeah, but I think fundamentally it's faster is an easy way of doing it because as we mentioned, Dr.
38:55Thomas Dominion is one of what
38:57Five people now I don't know.
38:58I I don't know how handful of landscape is.
39:04At that level, um
39:06Which is why is again I I a lot should be a lot more should be made of the fact that in buying hyper they also got one of the leading thinkers in this space.
39:13Um
39:14And it's people like him, it's sort of like when when we hear um when you speak to a developer at Tableau and then they're asking you like, Oh, so um w what what are your opinions on, you know, w what would happen with this thing that we've done?
39:25It's like that's like
39:27I trust your judgment.
39:28And in a similar way, I trust this guy's judgment that what he's doing is the best and most optimal way, right?
39:33Like I'm not gonna be like
39:35Wait a second, I've got a good idea about MVCCC.
39:38Um I don't have that knowledge to do so.
39:41Um but understanding this allows me to be a bit more like, hey, when I'm troubleshooting or I'm trying to understand
39:48Is it me?
39:49No, no, it could just be the database.
39:51It could be the fact that when I'm making this query, it's not as optimized as it could be.
39:55Or when I'm exactly
39:56uh when I'm working with this database and I'm trying to convert it to a hyper what could be happening is this.
40:02It gives me that sort of understanding about why things are happening.
40:06Um rather than sort of I'm now gonna go away and
40:09start working on my own um hybrid database, right?
40:13Exactly, exactly.
40:14And it's actually quite hard to do if you're an incumbent.
40:16Um exactly.
40:17So the funny thing is all this research is published, it's out there and
40:21You know, there's in a funny way, uh, Tableau's being very transparent about the innovation here because this is all available to public and the ideas and the papers are all out there.
40:31Other people could very much just read these and think of
40:33similar similar concept, maybe even better concepts.
40:45Because
40:45what what people often forget is Tableau's developers and the product launch teams are like three versions ahead of what we see, right?
40:52So they're using it completely might even be using completely different technology to what we are seeing and using today, the stable version.
40:58So
40:59I think what you're what you sa what you said earlier was right.
41:01Like I think the the competitive advantage that Tableau have in this case is the fact that every database can't do this, at least the old school ones can't, because they'll have to fundamentally change the core of it, right?
41:12It's it's exactly how the processing works that has to be changed.
41:15Um I guess that's the hard thing, you know, steering the you know the sh ca the cruise liner once it's heading in a certain direction is really, really
41:24hard.
41:25And I and I think that's uh that this leads us really nicely to the now what which is like okay we've now talked about this bit about the hyper and um the innovations and all these different things about query compilation LLVM and all
41:37Complex stuff which basically fun b boils down to it's a bit fast, it quir it it's it's translating a lot quicker and um it's able to be more flexible
41:47Right.
41:48In in in how it how it approaches a task.
41:50But those are the three fundamental things, right?
41:53Um but for me when I'm when we when we sort of start reading through this and finding about a bit more is it's the future that's really exciting, right?
42:00Right.
42:01The fact that Tableau's changed from a T D to a dot hyperfile means that they're able to then say, okay, cool, so we've now got something that's doing things faster.
42:10in these nanoseconds, how can we use that to leverage and get a marginal increase in how we're how we're approaching like fizz and tool tips, right?
42:19How how can we use this to make um
42:22uh tableau prep so much better than it already is, right?
42:26Exactly.
42:26And and and that's the thing I think that is is is hard to miss here.
42:30You know, things like Tableau Prep uh are only really possible because
42:34Cause of the innovation in Hyper, you know, charts and tooltips.
42:37Again, only really possible because of the you know the integration of Hyper
42:42And it wasn't.
42:42And even then you've got a bit of lag, right?
42:44Like when you th there's still that tiny, tiny second when you hover over something and it's like, oh god, let me go fetch that chart.
42:50But it's still fetching that chart, doing the filtering.
42:53And everything you've set up to do for you in that moment.
42:57Yeah, exactly.
42:58Exactly.
42:58And and I think if you if you if you're looking at the now what?
43:02Um
43:03piece.
43:04The the key thing I'd say here is that what Hyper ends up being actually once you take all this into account is it ends up enabling a type of database that hasn't traditionally
43:14existed before.
43:15In a typical enterprise context today, you have a transactional system.
43:20So this is basically your bread and butter.
43:22This is how all businesses work.
43:24These are the same kind of you know systems.
43:26That if they go down, you can't take transactions on a website and everything stops.
43:31You literally can't do anything else.
43:33And then analytics, which is a function of those
43:37transactions is done on a in a on a separate layer.
43:40So you have it maybe an ETL process or workflow that takes the data from your transactional based systems and does it in the analytical systems.
43:48And so what you're doing with analytical systems
43:51is scanning data.
43:53And what you're doing with transactional systems is actually editing and modifying individual rows.
43:59And those two things are massively uh uh so different in terms of the requirements they have.
44:05And then you also have um what's typically called beyond relational activities.
44:10Um and so these are things like you know charting, Hadoop, data mining, things that happen another layer above analytics because they
44:19they need to happen on a much sort of longer timescales.
44:22And it's actually hard to have all these systems in one place because they're all so far removed from the raw data that happens.
44:30And so the real And also this is where you get that latency between Yeah.
44:34You know, you've the fact you you you got all your transaction data transactional data going to one place.
44:39Then you have to do some ETL on top of that to put it into your analytics uh like data warehouse.
44:44And then you can do your your actual analytics, right?
44:47Like that that's it's those three-step processes that then add time and computational power and and ultimately cost.
44:54Exactly.
44:54You you have three instances of the same thing basically
44:57because you because of business continuity or uh you know risk factors you you can't afford to do analytics on on your on the data you should be doing analytics
45:06analytics on and so uh hyper tries to sort of solve this by basically offering um one one place to do this all and I think that the easiest way to sort of
45:16of draw people's attention to this is the demo in the Vegas conference where they had on stage um a large amount of weather data from the
45:26US and what they were doing is they were loading they were reading data from a hyper extract.
45:33Number one, it created it really quick.
45:34quickly and even after it created it, what they ended up doing is loading data into the extract without having to refresh the extract.
45:43And so this is this is this is sort of this idea of being able to do analytics on your
45:48with transactional data and then being able to do other bits of sort of analysis on top of that.
45:55And if you think of Tableau prep with
45:57technologies such as Python and are being integrated, you start to be able to do data mining, uh analytics, and all of this stuff all in one place, in one sort of architecture, as Tableau calls it, the Tableau platform.
46:12As it were.
46:13Exactly.
46:13And and this is this is the stuff that people often miss out, right?
46:16The fact that the this technology is enabling so much um in the back end, um and it's helping
46:23The developers at Tableau understand a bit better about la hang on hang on hang on.
46:26So wait, so if I compile this query in this way rather than the previous way
46:31I'm now getting what to an X whatever X improvement.
46:34Yeah.
46:35Why am I not doing it that way?
46:36And then you're suddenly unleashed into the wild and able to sort of build out further, right?
46:41So Exactly.
46:43Exactly.
46:44Okay.
46:45Cool.
46:46That was that was really dense, wasn't it?
46:48That that was very dense.
46:49Yeah, that that's that's something to really like you d you it wakes you up in the morning if this is if uh the what so what now what podcast is part of your morning computer then
46:57This will definitely get your brain in gear and we read it.
47:08a whole bunch of really advanced uh concepts throughout throughout the whole entire pod.
47:13But it's it's really interesting.
47:15You know, as Ravi said, definitely check out the the research pages on the tab
47:19website.
47:20Hyper actually has its own website you can go to.
47:22If you just Google Hyper.
47:24Um Hyper database you'll find their website and on there they actually publish
47:29All their white papers, going all the way back to when they were founded up until today, they have about fifty or sixty p no um papers that have now been published.
47:37the basically touch on everything we've touched on in way more detail.
47:42If if you so wish to find out more, right.
47:45Yeah.
47:46The other gem is if you can get hold of it is
47:49The twenty sixteen uh recording from conference um that basically it's called Boom.
47:56Uh what is it called?
47:57Boom, there goes the database.
47:59It's it's uh basically it's a hyper presentation from the Austin 2016 conference.
48:04And that basically summarizes how HYPER works and how it's different to traditional databases.
48:12It's about 25 minutes long, it's really good.
48:14And there's obviously some QA at the end that
48:17Um it's funny, most of the questions and answers um from that twenty sixteen sessions have now been answered.
48:24So it's a great it's a great great thing to watch because we're now um, you know, uh twenty
48:29sixteen my word we're we're nearly as on yeah so three years on nearly four and obviously we've seen hyper now but it also gives us a glimpse of what we haven't seen yet that is is going to
48:42to come.
48:42So that would that'll be really interesting.
48:44Exactly.
48:49Hang on a second, they've not talked about that for a while.
48:50And it's like, hang on.
48:51So maybe it's under wraps and they're going to do a big reveal.
48:54Exactly.
48:56But yeah, no, th th for me it's been really interesting mainly because it's it's given me a lot more um topics to talk about and touch on and a a better understanding of exactly what's going on and what's what.
49:08Um
49:09Which means I can speak a bit more intelligently about it rather than my, you know, my soundbite other.
49:13It's just faster.
49:14Let me show you this video from Craig Bloodworth where he's comparing a T D extraction to a hyper extraction, how much faster it is.
49:21Um
49:22But yeah, it's it's there's a lot more going on that that Tableau can leverage and it's absolutely something that's quite exciting in terms of what's possible with it.
49:33So yeah, no, thank you for listening to this episode.
49:36You can obviously find us on our website threewatspod.
49:40com.
49:40You can get in touch with us on Twitter or you can reach out to Ravi or myself on Twitter as well.
49:47Um do be sure to check out the show notes.
49:49We add lots of valuable information in there.
49:52Show notes are now available in Apple Podcasts as well as Overcast or any other sort of podcast.
50:00sort of app of choice.
50:01What do you use, Ravi?
50:02I'm actually I've actually moved on to Overcast.
50:05Yeah.
50:05Yeah, because it it uh for me the thing that's quite nice about Overcast is the speed controls.
50:10Oh yeah.
50:11I can two X and then also there's a smart speed function which like
50:14gets rid of silent periods in a podcast.
50:16So like you don't get the awkward silences when some people are recording.
50:19Um they just get cut out and it speeds up through it.
50:22So exactly you feel like you are learning a bit more in a small small a shorter amount of time.
50:26Yeah, my my my top recommendations on on iUS would be overcasts, it's entirely free.
50:32If not Apple Podcasts and then on Android, I think uh Pocketcasts or TuneIn Radio and also Google Podcasts now is a
50:41is a um is a podcast app.
50:43So you can use any one of those apps.
50:45You can obviously add our feed yourself if you can't find the podcasts on the podcast on those sites.
50:53um just head to our website and add the RSS feed and you'll be able to listen on any podcast app of choice.
51:00Thanks so much for listening.
51:01We'll catch you on the next podcast, which would be towards the end of Jan.
51:05Take it easy folks.
Future-proof your career https://n1d.io
| We’re back with our first main episode for 2019 discussing Hyper! the next generation database acquired and now implemented by Tableau into it’s suite of products. We’ll talk about the theories behind why its faster than conventional databses and how it might enable more than just speed in future feature launches by Tableau.
Notes
• HyPer (https://hyper-db.de/) – A Hybrid OLTP&OLAP High Performance DBMS
• Tableau Acquires HyPer (https://www.tableau.com/about/press-releases/2016/tableau-acquires-hyper)
• Tableau conference 2018 session on Hyper (https://tc18.tableau.com/sites/default/files/session/assets/18BI-053_Boom%20goes%20the%20data%20engine.pdf) : Boom goes the data engine
Feedback welcome on Twitter to Ravi at @scribblr_42 or Tim at @tableautim - or e-mail us, at datumpodcast@gmail.com (mailto:datumpodcast@gmail.com)