Season 1 · Episode 11

Metadata, Graphs, and Responsible AI with Paco Nathan

Data Science player and coach, Author, and Venture Amplifier Paco Nathan talks with Sam Ramji about Hybrid AI, mathematical reversibility, and using AI to solve knowledge problems that the exponential growth of data will create for years to come. Join these two as they discuss how you can bring multiple data disciplines together using empathy and math.


Episode Guest

Paco Nathan
Paco NathanData Scientist and Venture Amplifier

Episode Transcript

Sam Ramji:
Hi, this is Sam Ramji, and you're listening to Open||Source||Data. Today, I'm joined by Paco Nathan, known as a player and coach, with core expertise in data science, natural language, and cloud computing. Paco Nathan has spent 40 years in the tech industry, his experience ranges from Bell Labs, to early-stage startups. He currently advises Amplify Partners, the IBM Data Science Community, Recognai, KUNGFU.AI, Primer, in addition to being a lead committer of PyTextRank and kglab. Formerly Paco was the Director of Community Evangelism at Databricks and Apache Spark, and he was cited in 2015 as one of the top 30 people in Big Data and Analytics by Innovation Enterprise. Welcome, Paco.

Paco Nathan:
Thank you, Sam. Really great to be joining you.

Sam Ramji:
I'm really excited to chat with you today. I like to start each episode by asking our guests, what does open-source data mean to you?

Paco Nathan:
Great question. In data science practice, there is so much on which we rely in terms of, on the one hand, the open-source tooling that we have, the PyData stack, if you will. There's a lot of pandas and scikit-learn, and these tools that we just use over and over, they become a lingua franca. And it's been such a great aspect of this field that we were able to find common ground, took a little while, but definitely found common ground there.

Paco Nathan:
On the other hand, there's a lot of aspects of how do we share data across partners, across departments, within an organization? How do we do public private partnerships with government agencies that are providing data? So there's this general problem of, how do we get across data silos? And so a lot of what that means to me is, what are the technical sides, but also what are the people's sides of crossing the silos and the implications that has right now, when we're talking about compliance and fairness and bias and ethics and security, but also just how do we get work done within an organization?

Sam Ramji:
There's a lot of different components that you touched on there, we've talked about things like, what would a container for data be? How do you make the technical layer of data flow as smoothly as we flow, like stateless compute? But there's also the computation of, what are the permissions you have on the data, what are the fields of use? And then just simply, giving people the knowledge that the data even exists. There's a lot there that you've spent decades and decades digging into, starting from the math and ending up with almost the humanities of getting people to agree and work on this stuff together.

Sam Ramji:
I'd love to hear you talk just a little bit about the beginning, what you were doing in the early '80s, as you were exploring math, and how that drew you into the world of applied mathematics to data.

Paco Nathan:
I started out in this strange interdisciplinary program in applied math at Stanford, it was kind of a prototypical data science program. It's actually based off of something that one of the researchers at Bell Labs, years ago, had put together a data science curriculum. And so they had this program called mass science and it was part statistics, part operations research, part computer science, on and on. So I had that as a foundation, and I really loved working with some of the areas of advanced math and what did that mean for, how can we transform data or how can we do optimization models?

Paco Nathan:
But then I shifted over into computer science and going toward AI, working in an AI program. And it was interesting to me, because at the time we were working with machine learning early on, or I would call the early '80s, mid '80s, early on for machine learning, and it wasn't really accepted by the establishment at the university. And in fact, they eventually came down and said machine learning is just too empirical, it's not an academic pursuit right now. And yet some of the same professors, at least one of them, was a billionaire ten years later because he had invested personal money in a little startup called Google. So it's just kind of crazy this whole art of where the industry has gone with machine learning and leveraging data. I saw some early parts of that 40 years ago, but then got to see other things that really changed along the way, like cloud computing and adoption of data science, adoption of machine learning, et cetera.

Sam Ramji:
And you got heavily involved in the Spark Community post the emergence of cloud computing, the rapid fall of the unit costs of compute, and you started being able to do things that you'd only dreamed of in the '80s. I remember you had a symbolics machine on your desk, specialized hardware, which you've also looked into in deep writing microcode for the kinds of processors that might be able to do this kind of heavy lifting. And now we've got general purpose computing that does all this stuff. So maybe talk a little bit about the shift in the cost of computation, how that changed and what you're able to do, and maybe how that inspired some of what you did in Spark?

Paco Nathan:
So long, long ago, I was a grad student working at IBM Research on AI projects, and my lab partner later became a neurophysiologist, and she would talk on and on and on about neural networks, so eventually I was like, "Janet, I agree. I'll study neural networks." And I did, I spent about seven years in research and development in neural networks, but it was just too early. I was even on a project for hardware accelerators at Motorola for neural networks, writing micro code into our Pillow processing pipelines, but we just knew it was really way too early. We had a lot of great ideas out of the hard AI space, a lot of search problems that were being represented, obviously, but then also a lot of tools like planners and optimizers and solvers, and neural networks as a general purpose approach, we just didn't have the computer power.

Paco Nathan:
But around 2005, late 2005, a friend out of Seattle was calling me up with these really bizarre questions. It's like, "Hey, if you had such and such service, how would you use it?" And they would throw ideas at me and I was kind of a Guinea pig. And then later they showed me a website to go and sign up and it turned out to be something new called AWS. So I got to be really early on, building a 100% cloud architectures, I was Tech Co-Founder for a startup that year, and we dove all in on cloud. And it was really interesting, because suddenly this started to open up this world of what we thought was possible with AI, we didn't have the hardware, suddenly we started to have the hardware.

Paco Nathan:
How many years is that? 15 years now, since cloud was introduced. It's been driven so much by this rapid evolution of hardware, to the point now where we're seeing intense GPUs, but also a lot of custom hardware, ASICs, and whatnot. And all of that is opening up a lot of potential of what ideas that have been around for decades, even in some cases, software that's been around for decades, but now it can really do something effective in enterprise with it.

Sam Ramji:
Yeah. And you were working with Elastic MapReduce in the early days, did it start to change what you think of as the structure of a computer, once you start to realize how fast we can go now? I was talking with some folks from Dell about their work on NVMe over Fibre. That will start to really let us re-imagine what we think a computer is and tie in a lot of these ASICs, tensor processing units, the advanced GPUs, APUs, whatever, into something that's almost was unimaginable probably 40 years ago, but that you're advising folks and you're involved in a bunch of projects and you're advising a lot of startups, driving people to focus on the future of data, data architecture is the use of it. I'd love to hear you talk a bit hybrid AI, responsible AI, how you think about Knowledge Graph? I think it's a pretty rich field for us to explore with you.

Paco Nathan:
I love this area of working with graphs. I had done some work for O'Reilly called, Just Enough Math, which was basically showing advanced math to business executives, people who are going to be working with data organizations that depend on data and data at scale, and they're going to be using machine learning, or their people are, how can you get fluent talking with machine learning engineers, for instance? I did this course that was about getting confidence, working with the math, working with just enough of the code to be able to build a neural network and solve a problem, or do some sort of matrix transformation for parent data, all these things, so that the people who are really the stakeholders on the business side can understand what the tech people were doing.

Paco Nathan:
And I took a business school approach of how to showcase studies and histories and vignettes, put yourself in somebody else's shoes, and it really took off. But one thing that struck me, was I kept going back to this theme that, we work a lot with graphs and real world data is connected data. For instance, in data science, when you do a workflow, you almost never use one dataset, you almost always use two or more that you have to join together and clean up and all this, the data sets are connected and there's always a graph. And we talk about having a DAG when we do some sort of Big Data thing like Spark. But for years, that was always buried.

Paco Nathan:
And it really struck me about seven years ago, when I was going out and teaching these courses in industry, if I would talk about complex graphs and graph processing, or I would talk about things like ontology, or maybe I would talk about things like tensors, people would run screaming in the opposite direction. And the world is coming around, because now if I mentioned tensor, people nod and they go, "Oh, you're talking about neural networks," I'm like, "Yeah, we're talking about neural networks."

Paco Nathan:
So I think that there's a really interesting opportunity now to be doing a lot of work with, for lack of a better term, let's just call it Knowledge Graph. There's a lot of connected data, there's a lot of metadata that we can leverage. And that's really the important part for having explainable AI. We push machine learning models out in production, but then they make decisions and we have to sometimes come back and answer hard questions. Why did they make those decisions? And unfortunately in the race to get really efficient machine learning out there, a lot of that context is thrown away via pipelines, usually. When you look at supervised learning models, literally you're generalizing off of things that happened in the past to try to predict things that might happen in the future and you're generalizing, which means you throw away context. It almost by definition makes explainability very difficult.

Paco Nathan:
Knowledge Graphs are a way of capturing a lot of contexts, and having that as a different thing that you can rely on. You can build models out of it, but you can also explore and explain what's going on. And more importantly, the graphs are really good for this, what's emerging as a practice of an overlay on top of a lot of different data silos. So if you're an organization and you're working with some partner organizations, can you have a graph for the shared definitions of what you're working on? And how do you translate the foreign keys from one system to another? And when you're looking at a critical metric, is that measured in time units of weeks, or days? All these little nuances that, at the end of the day, can compound to something that's disastrous.

Paco Nathan:
So Knowledge Graphs and the inferencing you can do with them, and the algorithms you can do with them, are really interesting for solving a lot of the data problems that we're seeing today in a world that is post 2018, post GDPR, post Cambridge Analytica, post all the hundreds of millions of accounts that were upended during 2018. 2020, of course, has seen an even bigger rash of security problems. So I think a lot of the dialogue for using data in business right now, is taking a step back and saying, "Hey, we can build big machine learning models." Great. But there are people issues and social issues and organizational issues that are really probably more important. And this is where graphs just hit it really hard.

Sam Ramji:
So graphs become almost an orchestration technology for, not just the data, but for bringing the people together?

Paco Nathan:
Very much so.

Paco Nathan:
I got to get involved with this event in December, it was called Metadata Day. It was interesting, because with GDPR, some of the larger tech giants grumbled about GDPR, I saw that firsthand, but the other let's say more AI native, a little bit newer companies, like Netflix and Lived and Airbnb and LinkedIn and others, they had a common response. And what happened was, they set out having people working in data strategy for compliance to answer GDPR issues. And across the board, they started working with graph representation of that and graph databases and whatnot. And then they started looking at what they could do once they started to pull in, not just metadata about their datasets, but who's producing it.

Paco Nathan:
And realistically, if I want to have a conversation about policy in the company about who's producing these five data sets on which I rely, I'd probably need to have my boss talk with their boss or higher up. So I've got to bring in HR data and then I've got to bring in operational data and probably some customer success data. And once you start getting that 360 graph based view of how your organization uses data, you start to realize, "Wow, we're not only solving compliance problems, there's business upside. We could open up entirely new business lines that we could never see before." And so this was common experience at a dozen different companies after 2018 because of GDPR. What was awesome was they all responded by having open-source projects. And then the project leads put their heads up and said, "Wow, we're doing the same thing. Let's get together and talk."

Paco Nathan:
So we had this event, the first one's kicked off, LinkedIn was sponsoring it, in December and we're going to do this on a regular basis. It was really cool, because we brought in people who have decades of experience working with metadata and Knowledge Graph, like Natasha Noy out of Google and Deborah McGuinness and people who really know what's going on, `also on the team out of Berkeley, but then we also brought in the open-source lead committers and the product managers out of these companies. And I think that what we're seeing now, is a wave of other enterprise companies adopting these practices. And it really bodes well that we're able to take something that's a corporate risk compliance and turn it into business upside. I think there's a lot of road ahead on that one.

Sam Ramji:
It's an amazing outcome, because we often see the data siloing problem from a positive perspective, which is, "Oh, we've got so much philosophy, because all these microservices are building along in their merry way." But then, the first problem is how do we audit across all of that? It's incredibly difficult. And if you don't have the data connected, you can't take a computational approach. But what you pointed out, is once you started to poke holes in all of those, for the sake of audit, now you have this view of how data is actually being used across the enterprise. And you can almost run a hotspot analysis to understand where the most valuable data is, which is often one of the biggest challenges that a company has is like, "I know I'm drowning in data, but I don't know which stuff is valuable, so I throw half of it away."

Paco Nathan:
And that's where the graphs come in handy, because identifying those hotspots, it's not something that you would do a typical statistics, the best way to do that is... The easiest way to do that, is something called centrality. Yeah. We use eigenvectors and a lot of fancy math, but at the end of the day, it's basically PageRank, the other name for it. So really, take a look at this PageRank view of, where are the producer consumer hotspots for datasets in your organization? It's kind of like looking at how does your website rank, it's basically the same math.

Sam Ramji:
You talked very eloquently about some of the math that can be used to do responsible AI. One of the things that's come up is your ability to reverse, or one's ability to reverse the processing. As you pointed out, as you go through a pipeline, you throw away data, right? It's sort of a lossy compression as you go through. And one comment you made was that when you're using tensors, often that's a tense realization of a graph.

Paco Nathan:
Right.

Sam Ramji:
And you've thrown away some information, but you've also dug into the math to say, what would it take to reverse that and be able to walk backwards from the conclusions into the tensor, back to the graph and make it more whole, can you talk about that a little bit? Because we were on the verge of a decade of explosions of data, where every company is going to be wallowing in exabytes. And the practical applications of the work you've done for decades are probably just about to come into full force.

Paco Nathan:
The transforms on a social level or organizational level, what that translates into is integration and collaboration. In my weird head, I think of these mathematical transforms and the social implications of them. If I have a complex graph that describes how multiple business units are collaborating on a big customer problem, I can take that graph and I can characterize it, and now I've got a vector and I can run some statistics. I've thrown away a lot of information, but I've made it so I can number crunch efficiently. Now I could also build out a matrix, and there's different ways of doing a matrix representation, and in math we talk about factorization and frankly, PageRank and all those are based off that.

Paco Nathan:
I can go a step further and I can do an end dimensional matrix, it's called a tensor. So I can just take the data and turn it into a numerical representation inside a tensor. And by the way, this is how we work with neural networks these days. The problem is I've lost a lot of context. Without diving too much into the math, there's an area called algebraic graph theory and there are these things called algebraic objects, so vectors and matrices and tensors, they're algebraic objects. But if you've done object oriented programming, it's very much like what we talk about in terms of class hierarchies and inheritance. There's a lot of math to manage that kind of thing.

Paco Nathan:
So one of the first things that we did with KG Lab, this integration project for different graph libraries, is to have a class hierarchy where we can run these math transforms, but we can also do infras transforms. And so I can take a graph and project some data out into a tensor and then run node effect, do some deep learning on it, really cool insights coming out of those models. But then I can turn the numbers into oxide and the inferential side of what we're doing. And so that really helps for things like explainable AI, of course, and audits and all the rest. But what it really means, is now I've thought of this as like a great tragedy, that the people who are doing deep learning on graphs, the embedding people, they're off in their own camp. The people who are doing graph algorithms like patron, they're off in their own camp, the people who are doing Markov networks and probabilistic graphs, they're in their camp. There's all these little camps that don't have a lingua franca.

Paco Nathan:
So a big part of what we've been doing on our open-source project is to leverage the math, develop that common language, so we can go back and forth between, and leverage these different technologies together. And it has a lot of bearing on things like data quality. You go out and do a lot of NLP work, you build up some annotations, you're going to go use these, but how would you do a unit test on that data? Well, it turns out there's some really great probabilistic methods, they're over in this obscure area. But what we've worked really hard to do is say, take obscure area X, turn it into something that is basically pandas, and then you can go in and join that data with another obscure area Y and make it look like pandas, and put them all in a scikit-learn workflow.

Sam Ramji:
That is the big problem of our age, is being able to connect people who know a ton about different things, that don't know how to talk with each other, right? There's intersectionality... I recently ran across someone raising the bar above interdisciplinary or multidisciplinary, that said they are anti-disciplinary. So you've got some skills there in stitching together these very deeply educated, knowledgeable folks into coherent communities. In this particular case, or in a more general case, what are you finding that works, that people can copy?

Paco Nathan:
That's an interesting question. It's really hard, because the people who have the most depth in their fields, they're usually focused on benchmarks and publishing about how they're beating some benchmark. They're not as focused typically on use cases. And so a lot of the practice there that I worked out, the people I'm working with on kglab, frankly, it's a lot about the software engineering aspects of AI and how can we really inject a good dose of software engineering, and things that we've learned by using Kubernetes and doing other practices over on more the system side, we can bring this in and have more regular practices on the open-source software that integrates a lot of different areas of AI. That's first and foremost, that's the thing.

Paco Nathan:
So, can you have a suite of unit tests that are tied to known use cases, because frankly, there's probably not a lot of that. There might be benchmark testing. It might be unit tests toward benchmarks, but not toward the end use cases, or toward the integration of different libraries. And that may sound subtle to people who not software engineers, but it's a real hard problem. It actually takes a lot of work.

Sam Ramji:
There's a lot of need for a cross domain appropriation from DevOps into environment, right? What does MLOps look like? You can see technologies like Kubeflow, or ways to figure out how do you get your models back into production? How are your data scientists sharing features? What is your feature store approach? But a lot of these things still have to develop as fields of practice to become a little bit more routine.

Paco Nathan:
Unfortunately, a lot of the semantic technologies, those came out a decade before Big Data really got going. And so the code, even though it may have been written last year, in some cases, it's still looks like httpd. It's written by people who think that XMLs a really good idea. And so what we're doing is coming back and saying, "Hey, there's this thing called Parquet. And we can have sharding based off of partitioning within the files. And by the way, we get a lot of compression and push down predicate and all these techniques that we use on the DevOps and the MLOps side." We're trying to inject that back into the AI practice, so that for the data scientist, "Hey, it's a couple of lines of Python," it's not something you really have to think about much, but it makes your MLOps team happy.

Sam Ramji:
So you're doing a lot with startups, and I was wondering if you would be open to talk about Primer.AI or [STEMA or Metafour, a little bit about what they're doing and what gets you excited about seeing them develop?

Paco Nathan:
Yeah. I mentioned STEMA and Metafour, There's another, I can't give the name out just yet, but also working with... Those are examples I was mentioning about in Metadata Day and this graph based practice for understanding where your datasets were being used and what the metadata is and all that. Let's do a couple of spin-offs out of now from Lyft and LinkedIn and others for new ventures. And so this kind of a new category of companies that if they're working in that area, and I think that that has a long road ahead, because they're still pulling the data sources together in the metadata. There's this whole other aspect of, once you get that, you can start applying the graph machine learning on top.

Paco Nathan:
Primer is one that I'd sec that as a really good example, one of my favorite examples. I've known them since, it was like four or five people in a room, and now it's a much larger company. Primer is working in natural language. And the idea there is it's going back to this hybrid AI that we were talking about, where they do a lot of work with deep learning, but they also do work with Knowledge Graph and other kinds of systems, some just like more "traditional" machine learning models. What Primer is doing is, where you've got a case in enterprise where you've just got torrents of texts coming in. Like say, one of the things they talk about is Walmart on their website, say you're Walmart's legal team, and think about how many thousands of inbound email messages you get to their legal team every day. And a lot of those have consequence.

Paco Nathan:
And really we're talking about, whether you're on a legal team at a big corporation, or whether you're doing logistics for a big company, or maybe you're a DOD, or wherever, you're having to read all these reports. It's beyond human scale. There's no way that people can do this. And it's actually really challenging even for teams of people to do this scale and dimensionality. So what Primer is doing is sort of this hybrid approach on AI, I would consider them to be more of like human-in-the-loop in a way, that they can take a first pass through the whole fire hose of incoming email and they can start to pull out what are the themes going on today? Who are the people? What are the connected parts of that? And put that graph, if you will, that map in front of the person who has to take action.

Paco Nathan:
That's getting used in finance, it's getting used in manufacturing, it's getting used in a lot of places. But I think we've definitely reached the limits in a global economy of what humans can or can't do when it comes to communication. And this is where AI can really augment and bring the most salient parts back down to human scale.

Sam Ramji:
I feel like the exhibit name is appropriate here, like I heard you like AI, so I put some AI in your AI, so you can hear it in AI. Because computationally, you look at the fact we've got 196 countries in the world recognized by the UN, many of them have multiple states. They're all coming up with regulations about what happened with the data. So you can think of all these different ways that metadata ought to get applied, and you fast forward 10 years to 2031 and you think, "My gosh, we're in a post zettabyte world." You're going to have to have a computational approach to prove, or hope that you're doing the right thing with most of your data. So that's why I get a lot of inspiration from the work you're doing from things that Primer is doing.

Paco Nathan:
I like this notion of the human-in-the-loop. We're seeing more sophisticated data science workflows, where when they're doing their data preparation, they take a much more sophisticated view of leveraging AI for human-loop data preparation, things like Snorkel, a good example of using that human expertise along with the machine learning. And I really think that this is the path forward. So I think Primer is a good example of it, what they're doing, I can mention a number of other companies too, but just... You've got to take into account the people in your organization and what they know, and the fact that you get a job in this field these days, you're probably going to be doing something else in 18 months.

Sam Ramji:
It's so true. And it's interesting to think how much things have changed. You left AI for a bit during the first AI winter, I graduated into the second AI winters. For many years, I didn't practice it at all. But maybe we're in the endless summer of AI because of these multidisciplinary or anti-disciplinary constructs that you're talking about, bringing all these concerns about humans, making better promises, and maybe we'll think about it as augmented intelligence, as opposed to magic.

Paco Nathan:
Very well said. And to dig into that, there was a really, really fascinating series of conferences that set the stage for this. Back in the late '40s, early '50s, it was called the Macy Conferences, and the notions of AI really emerged out of this. But the idea was that people who were involved with really complex systems, whether you're talking about building up commercial airframe or nuclear power, they were dealing with complex systems and they were seeing that we're reaching the limits of human scale, that we could build systems that were bigger and more complex than we could handle. And the implications were just terrifying.

Paco Nathan:
And what came out of that was something about cybernetics. Second order cybernetics was this idea of describing how do people fit into this picture, working with a lot of complex systems. And it's really amazing when I look back at that, because there were really fascinating people in that space who really came out with a lot of blueprints for what later, we would start to be calling parts of AI. Think of that as a touchstone or a watershed of what we can go back to and look at, sort of the social systems view of AI.

Sam Ramji:
Yeah. And a lot of demand for simulation, for people who can work with AIs. I saw at Autodesk a couple of years ago, they built a new system for designing airframes, for example, called generative design. And so the computer can come up with all these ways to create lighter, faster, stronger airframes, but a person still has to look at it and say, "Could that be certified? What are these other fields of expertise?"

Sam Ramji:
So in closing, we're at the end of our show, and I would love to ask you as an expert in this field of data and data science and math and Knowledge Graphs, what's one resource or piece of advice you think our audience should walk away with?

Paco Nathan:
I really love this notion of what's emerging called graph-based data science. I think that there's a lot of practices that can augment or your existing data science team and the practices, but really leverage this area of hybrid AI. And again, bringing the people back into the question in the social aspects of this. And I would point toward... I'm very much involved with this community, it's called Knowledge Graph Conference, but we have a lot of view toward the industry applications, the people who were out in the front lines, implementing things for supply chain or whatnot. I think companies like Bosch and Siemens and AstraZeneca, all those, there's the KGC Conference that happens every year and I'm part of that, I'm running tutorials this year.

Paco Nathan:
But there's a Slack board where there's over a thousand experts working with graphs. And so you can come in and ask questions and network and find people. I think that the learning right now, is that recognizing there are a lot of graph problems and the opportunity is find other people who have different areas of expertise, again, that multidisciplinary or anti-disciplinary view, it's like let's break down the silos, talk with other people who are seeing similar problems and find out ways to collaborate.

Sam Ramji:
That is awesome. Paco, it has been an absolute privilege having you on the show. Thank you so much.

Paco Nathan:
Thank you so much, Sam. Wonderful talking to you.

New episodes bi-weekly. Get alerted!