The road to the data lakehouse – Protocol

Hello and welcome to Protocol Enterprise! Today: how open-source projects from big companies like Netflix and Uber helped create the data lakehouse, Knative finds a familiar home and here come the robots.

Spin up

It’s hard to overstate the impact that open-source software has had on enterprise tech over the last two decades, and it’s not slowing down. According to Red Hat’s State of Enterprise Open Source report, 80% of IT leaders plan to use open-source software as they adopt emerging enterprise technologies.

An open house on the lake

The big data compute team at Netflix was dealing with some pesky data aggravations a few years ago.

“Earlier this week, we had somebody go in and rename a column, and in one engine they were getting results, and in the other ones they were getting null,” said Daniel Weeks, then a Netflix engineering manager, speaking at a 2019 developers’ conference. As head of that team, he and others were building a new way to solve those sorts of data-processing engine complexities that had prevented smoother analysis of the data rushing into the Netflix streaming service.

The new approach that was under construction at Netflix, with help from developers at companies including Apple and Salesforce, became an open-source standard for table formats in analytic datasets called Apache Iceberg.

  • While most companies don’t need to perform business analytics on top of tens of petabytes of data the way Netflix does, data architectures including Iceberg and Hudi — a system incubated inside Uber to solve similar problems — now form the foundation of products sold to other enterprises as so-called data lakehouses.
  • Dremio, which calls itself a lakehouse company, announced Wednesday that its Dremio Cloud data lakehouse platform — based in part on Apache Iceberg — is now widely available.
  • “A lakehouse needs to be open source: That’s why Iceberg has started to get so much momentum,” said Tomer Shiran, founder and chief product officer at Dremio.

Right now, open-source data lakehouse architectures are following a pattern seen with other data standards built or used inside large Silicon Valley tech companies before businesses began moving data to the cloud.

  • What vendors today call “the lakehouse” is, to many data professionals, just an evolved version of the data lake that combines elements of the traditional data warehouse.
  • A data lake is essentially a receptacle for ingesting information, such as website activity data showing what movie content people perused, or data associated with trips taken through a ride-hailing app.
  • The lakehouse provides a structural layer on top of the otherwise raw and chaotic data stored in a data lake, allowing data scientists and others to perform analytics processes such as querying the data without having to move it first into a more structured warehouse environment.
  • “Moving data can be very expensive from system to system,” said Ben Ainscough, head of AI and Data Science at business intelligence tech company Domo.

“People have been shouting about data silos for effectively ever,” said Boris Jabes, founder of Census, which makes software to help companies operationalize data for analytics.

  • What’s different today, Jabes said, is that sales, marketing or other teams can each run their own data workloads separately on the same storage layer.
  • “There’s a lot more infrastructure that can be shared now,” he said.
  • When Vinoth Chandar, founder and CEO of Onehouse, worked at Uber as a senior staff engineer and manager of its data team starting in 2014, “we ran into this predicament:” People from disparate divisions realized one team’s data may have reflected recent updates, while others did not.
  • That meant each team had been conducting analysis to understand what was happening in specific cities based on different data.

At the time, Uber had a data warehouse stored on-premises, and used data infrastructure including Hadoop to manage all the analytics and machine-learning algorithms it was building to do things like decide how trip prices should change when it rains. It turned that project into Hudi.

  • In the past it was only the Ubers or Facebooks of the world that could afford the hardware and software infrastructure necessary to use these types of technologies in their own data centers, but today the more widespread cloud-centric data ecosystem is ripe for broader adoption of those technologies by other businesses, said Rockset’s Venkat Venkataramani.
  • Because Iceberg and Hudi were designed to work in cloud environments, where companies can afford to manage large volumes of data and easily estimate costs of performing queries and analytics using that data, Venkataramani said, the barriers to adoption have been lifted.
  • “It’s the market demanding projects like Hudi and Iceberg,” he said.

That could bode well for Weeks, the former Netflix engineer who helped create Iceberg. Just last year, along with two other former Netflix data wranglers who also helped create Iceberg, he co-founded Tabular, a startup building a data platform using Iceberg.

— Kate Kaye (email | twitter)


At HashiCorp, we believe infrastructure enables innovation. We help teams operate that infrastructure in the cloud. Organizations rely on our solutions to provision, secure, connect, and run their business-critical applications. Our products provide multi-cloud infrastructure automation, and underpin some of the most important applications for the world’s largest enterprises.

Learn more

Knative reunited with Kubernetes

Please meet Donna Goodison, Protocol Enterprise’s new infrastructure reporter! Donna joined us from CRN this week and makes her Protocol Enterprise newsletter debut here.

The Google-founded Knative project is now officially in the hands of the Cloud Native Computing Foundation (CNCF).

CNCF’s Technical Oversight Committee voted to accept the open-source, Kubernetes-based platform, which allows developers to build serverless applications, as an incubating project. The incubating stage is an associated maturity level for CNCF projects before the graduation stage.

Alibaba Cloud, Bloomberg, IBM and VMware are among the production users of Knative, which Google founded and launched four years ago and developed with IBM, Red Hat, SAP and VMware.

Google, which ceded direct control of Knative in 2020, announced in November that it wanted to hand over the platform, including its code, intellectual property and trademark, to the vendor-neutral CNCF after it reached version 1.0 status and was deemed stable for commercial use.

— Donna Goodison (email | twitter)

Investors catch AI’s third wave

Expect to see more illustrations of robots and humans shaking hands in enterprise news reports over the next few years.

Vancouver, Canada’s cognitive AI and robotics company Sanctuary Cognitive Systems aims to build cognitive AI for robotic software that attempts to enable memory, sight, sound and touch the way the human brain does. And, rather than devise robots for specific use cases, its goal is to create robotics that work for any purpose.

Now Sanctuary has a lot more money to put toward its lofty goal. The company announced Wednesday that it raised $58.5 million in series A funding from investors including Verizon Ventures, Canadian government-affiliated Export Development Canada and Canadian healthcare company SE Health.

Cognitive AI researchers say we could expect advancements in so-called “third wave” AI coming in the next few years. Not only are these areas getting funded, researchers at large companies like Intel are working on cognitive and neuromorphic AI that is adaptive and recognizes context in ways only humans and animals do today.

But don’t worry, Sanctuary believes that even if they think like humans, its robots will merely augment the special talents of the human workforce rather than replacing people.

— Kate Kaye (email | twitter)

Around the enterprise

Miguel de Icaza, who has played key roles on Microsoft’s AI and developer tools teams since it acquired his company Xamarin in 2016, is leaving the company.

Oracle and SAP announced that they would stop doing business in Russia as the fallout from the invasion of Ukraine continues.

Snowflake investors were not satisfied with a 101% jump in revenue and guidance predicting around 66% growth in product revenue during its upcoming fiscal year, underscoring how weird Wall Street can be.

Snowflake also announced that it had acquired Streamlit, which makes tools that let developers quickly build web apps based on data scripts, for $800 million.


At HashiCorp, we believe infrastructure enables innovation. We help teams operate that infrastructure in the cloud. Organizations rely on our solutions to provision, secure, connect, and run their business-critical applications. Our products provide multi-cloud infrastructure automation, and underpin some of the most important applications for the world’s largest enterprises.

Learn more

Thanks for reading — see you tomorrow!

Spread the love

Leave a Reply

Your email address will not be published. Required fields are marked *