First we knew it as Xbox’s response to the Wii, with mysterious rumors pegging it as everything from a wand-like input device to a revolutionary motion-sensing camera. Then, the latter of those two prospects got introduced as Project Natal, the head-scratching codename for Microsoft’s controller-free initiative. A media circus (literally) during this year’s E3 revealed the actual product name this summer. Since then, silence and anticipation built steadily until Kinect’s debut in stores today.
Microsoft claims that Kinect will change everything. Much of that line of thought comes from Alex Kipman, the company’s Director of Incubation and the gentleman parting the waters into the promised land of super-immersion. Kipman’s been barnstorming for the last few weeks preaching the Kinect gospel and spoke to IFC News about the philosophy and practical considerations behind Microsoft’s next big bet.
I know a lot of research has gone into Kinect and it’s finally out of development. Now that you guys can send the product off to launch, can you talk about how long the journey’s been to get here? Has it been a two year cycle? Longer than that?
One could say it was a million years in the making. This was about making you into the controller, right?
I would say that what you see here is a combination of us understanding a moment in time. Of us understanding that computers, as a whole, are transitioning from this world where all of us of had to understand technology, into this world where technology fundamentally understands us.
But I see Kinect as the peak of that journey, of that transitioning moment, of the catalyst that brings us from this old world, to this new world that will be.
From that perspective, why hadn’t we had this before today? And the answer is because we haven’t been able to get the algorithms to a level of sophistication around the various elements–computer vision, machine learning or voice recognition–to a point where we could transition science fiction into science fact.
So, to give you the idea of time, which is what you’re asking, I need to mention that we have a huge branch of research at Microsoft. And if I were to add the many years of people with domain expertise in these fields, it’s decades’ worth of work.
Right. So you were already doing biometrics research and stuff like that?
Any number of things like that. Generally, we pick the key experts in the world in all of these fields and fuse them together to really make sure we can get a very strong platform that really lives up to, “Hey, simply step in front of the sensor, and it recognizes you.” It knows the difference between you and I, you and your family, you and your friends. Start moving, and the sensor understands fundamentally your human movement. Knows when you kick a soccer ball, or gesture to move between UI [user interface] screens. Then, it can tell when you move around to do tai chi poses as in “Your Shape: Fitness Evolved.”
And, finally, when you use your voice, you’ll have voice recognition work in a natural way. So, if you’re watching entertainment, you can simply say “Xbox, pause,” or “Xbox, why don’t you suggest me a movie for me?” Voice commands and things along those lines.
Those three pillars create the palette, the paint colors and the paint brushes that allow us to create these unique experiences that land us in this new world, where technology disappears and Kinect fundamentally understands you. Now that’s half of the story.
The other half of the story is how the combination of research and technology–paint colors and paint brushes, if you will–lets players be painters and paint pictures. We think the seventeen experiences that we bring to market at launch will let you do that. All the launch games were really created from the beginning to get you up on your feet, get everybody collaborating, cheering each other on, playing together, having fun and laughing together.
So what’s the nature of the challenges in the various Kinect games?
These experiences were designed to be simple, fun, and approachable for everyone. Now, that doesn’t mean that they are simple in every way. They’re simple to start, but they’re still skill-based. They take forever to get good at. It’s like golf. You and I can go to the golf course today. I know the rules. I can swing my arms and I can hit a ball. I never played before. To beat Tiger Woods, I’m going to have to spend a little bit more time to get to that level. Same thing here.
All of our experiences can be described as simple but approachable. Super-easy to get into and go. But it takes work and skill before you can get really, really good at it.
Can you talk a little bit about integrating the research? Like you said, you’ve got three pillars here that are all combining to create essentially a seamless experience. Can you talk about the different directions that you guys could have gone, for things like voice recognition or the body scanning?
Yes and no. The reason nobody has been able to crack this problem before is because everybody goes down a route, after trying to figure out a pre-set path. That’s the very engineering way as opposed to the artistic way of approaching the problem. I always say to people, “I’m the Kirk in a world of Spocks.” The world of Spocks requires you to choose something. It’s zeros and ones. It’s true and false. It’s black and white. It’s yes and no.
The answer to your question, which is the Kirk answer, is the more emotional and artistic answer…
[overlapping] You got some of the William Shatner body language going on, too.
[laughs] The point I’m trying to make is that the human body and human expression represent a system that’s analog. It’s not yes and no. It’s maybe. It’s not black and white. It’s gray. You’re moving to a world that’s not “what you know” but to a world of probabilities where all of these probabilities exist all the time. Your brain’s job is to create a language that allows you to know what to choose out of all these probabilities and when to choose it.
That was a whole bunch of philosophical blah-blah-blah. Let me give you some concrete examples. Take identity recognition. I can reduce that entire space to a signal-to-noise problem. Why haven’t I had identity recognition that works in the past? It’s because people choose a way of doing it; either a face, a voice or a fingerprint gets added . It turns out that if you and I get in front of a camera sensor right now, we’re very different people. Kinect is going to use that facial recognition data to lock us in.
Now that facial recognition is signal, everything else is noise. Still, it turns out that in the living room, Darwin is against us. You are genetically similar to your family. At that point, facial recognition sucks. So, then, facial recognition just became noise, I need something else to be signal.
So it’s really about trying to create hardware and software that look at the world in terms of “ands” instead of “ors.” It’s not about choosing a path. It’s about realizing that no one path will get you to the Promised Land, and you need to create a language that tells you that everything’s probable. You need to have some language around confidence. You need to know when you know something. You need to know when you don’t know something.
All of that sound ridiculously difficult to program a computer to do…
It gets better, because there’s a second derivative to it, which is that you need to be confident about your confidence. Because if my [computer vision] system says “Hey, I’m really 100% sure that this is a head,” and it’s really a foot, well, it’s not really confident about it confidence. So, the entirety of Kinect is designed to be this probabilistic, statistical-based system that really looks at everything–identity, motion, and voice–in terms of a signal-to-noise world. And it knows when to focus on the signal, when to throwaway the noise, much like your brain works.
Our minds are essentially massive signal-to-noise machines that are way more complicated, complex and sophisticated than Kinect. Like, right now, your attention is focused on me and my voice, relegating all the voice in the other rooms to the background. Al of our efforts for what we want to do on the console have been to basically replicate a similar means of judging and filtering multiple streams of data, to figure out the most probable conclusion for which user you are, what you might be saying and how you might be moving.
So that goes back to what I said about no single path for decision-making. It’s about all possible paths. And it’s about being confident about your confidence so we can believe in the choice. Traditionally, it’s super-simple to create an artificial intelligence system that knows something. Now, to have the artificial intelligence system know when it’s stupid and when it doesn’t know something, that’s the hard problem. And Kinect does that with something that’s uniquely ours, something that we invented, which is this language to be able to describe these very analog concepts in a robust way.
One last question. This conceptual framework, this architecture for the algorithms that you’re talking about, is this something that we can expect to see rolled out on Xbox in different ways, or even onto the PC platform? Because it sounds flexible enough to kind of reinvent user interfaces altogether.
I meant what I said. The entire computer world is changing. And when we look at Kinect, it’s the beginning of the journey. It’s not an end of a journey. And we begin the journey very focused in the living room, and in gaming and entertainment as a whole, but it would be silly of us to not be looking at this in a broader sense.
We don’t have time or wish to think about that broader space right now. We need to have an amazing consumer launch on November 4th and have an amazing device for everyone in the living room, but, as you say, we believe fundamentally in ushering this new era of computers, and we see Kinect as the pinnacle of that transition right now.