The Race To Solve The Data-ML Divide: Designing The Next Data-Centric AI Platform That Can *Actually* Help Drive Value With Gen-AI
Basically 👩🏻💻 New Job Alert 👩🏻💻
Gen-AI: Highlighting The Need For A Reconciliation Between Data & ML People
As many of my talk attendees1 and Substack readers know, for a long time I've felt uneasy about how decoupled the data layer has been from the modeling layer.
We know the following:
✅ Great data is a necessary ingredient in great ML products and that there's an incredible amount of operational complexity in scaling a single, standalone unimodal application to one of many multimodal services as part of a comprehensive offering. As challenging as the task is to scale ML in production with structured data, unstructured data like audio, images, text, geospatial, etc has additional challenges.
✅ Data can be a powerful differentiator & competitive advantage. But if everyone is training & serving on the same data, then there's no real difference between competitors.
✅ Unstructured data can be incredibly rich but has also been incredibly time & resource intensive to unlock, requiring a complex orchestration of labelers, data ops teams, data engineers to build up the datasets, MLE's to develop & train models (even offline), platform SWE's to deploy the model pipelines, & hopefully get the model performance logged someplace (maybe in an ELK stack) for a DS or MLE to manually review or script for evaluation, only to go back the source in case there's issues with data quality or more labeled data is needed.
Some Titles Are Useful, All Are Wrong
At one point I was a growth hacker2.
Then almost 1-2 yrs later, a data analyst.
Then a data scientist. Then an early-stage startup “analytics engineer”.
And after many pivots & sideways slides, I became an MLOps Engineer. Who leveraged content to become a DevRel. That still codes & helps consult & build ML platforms.
And over the years I’ve received tons of messages asking:
“How do I break into X/Y/Z career?”
“What master’s program should I do?”
“Which job should I pick?”
“Which tools should I use?”
“Which certificates are more valuable?”
Maybe because I’m now closer to “older-than-dirt” territory versus “spring chicken” but I have fewer & fewer concrete answers to that category of questions.
And yet it’s important — being able to communicate, define and quantify your value ensures you get your next job, your next client, build your brand, connect with your peers and partners, find your group.
But what a trend that I’ve started noticing as part of my responsibilities as a DevRel is that some of the most viral projects, libraries and tools in the LLM and Gen-AI space weren’t, in fact, built by data scientists or ML engineers.
They were built by individuals or groups of individuals that had the technical capabilities & curiosity without necessarily the domain expertise.
Different Day, Same Problems? I Think Not.
One of the most ridiculous statements I’d heard a few years ago, from an influencer friend who shall not be named & shamed because they’re otherwise a great source of inspiration, was that the problem with MLOps was “tools”.
Orly?
Like, a lack of great tooling that solves the *exact* problems that each of the unique individuals, teams and orgs in the world need solved.
I’m willing to forgive that person because frankly, it’s been a while since they’d built and shipped an ML system.
It’s easier than ever to build an incredible project, ship it, and then scale it up.
Don’t want to spin up a ton of infra? Great, you have an abundance of options including serverless.
Don’t want to spend money on a beefy computer? Great, cloud-based IDEs that allow you to ship pre-virtualized & containerized apps written in your language of choice are available.
Don’t want to use Airflow or Kubernetes? Every single one of the major cloud providersand even a bunch of startups offer alternative orchestration and scheduling options.
If you’re a solopreneur, a startup, and even an “intrapreneur”, you don’t need to be doing the same things that your kindred in enterprise land are doing. Don’t worry, eventually they’ll catch up to what you’re doing because that’s the circle of life when it comes to disruption.
On That Note…
I’m excited to announce that I've joined Labelbox as their Head of AI Developer Relations!
We have a unique opportunity to bridge the many gaps between the people of data, ML, and SWE's.
I'm excited to do my part in helping upskill, empower, and grow the next vanguard of innovators, builders, and makers of ML.
Up Next…
But most importantly, I’m excited to have the backing of a company like Labelbox to do more of what I’ve been doing and want to keep doing.
And of the list of exciting projects I have in the works, I can say that I’ll be back to writing regularly on LinkedIn and here about MLOps, Data-Centric AI, the nitty-gritty of build ML Platforms & ML products, and how to navigate the evolving landscape of data & ML.
Additionally I’m currently scheduled to be speaking at two virtual conferences in October:
Lesbians Who Tech — “The Fun-Sized MLOps Stack from Scratch” — Oct 16, 1:30 PM - 2:00 PM PDT (30 Min)
- ‘s Data Engineering And Machine Learning Summit 2023 — “MLOps Beyond LLMs” — Oct 25, 10:00 AM - 11:00 AM MDT (45 Min)
For Ben’s conference I’ll be (virtually) joining friends like
, etc.Missed my Australia Data Eng Bytes talks? Want to take a crack at the slides? Check them out here: “MLOps Beyond LLM“, “The Full-Stack Data Scientist Is Still The Sexiest Job”, "Featurization & Feature Stores: A Crash Course In The ML Lifecycle & MLOps"
Video: “The Full-Stack Data Scientist Is Still The Sexiest Job”
Yeah, I know, still sounds like a job you’d only find in Insufferable Valley but it is a real role with impact.