Xbox One's voice control technology is not as awe-inspiring as you hoped it would be. And that's only half the story.
It’s easy to learn the wrong lessons from success. When Microsoft releases the Xbox One at the end of this week, one of the world’s biggest companies will take a major step toward cementing voice control as a central computer interface for its most successful hardware brand.
Microsoft’s Kinect was originally released in 2010 as a reaction to the unexpected success of Nintendo’s Wii, and the combination of full body motion control along with simplified voice controls made the peripheral a fast success. With more than 24 million Kinects sold over three years, Microsoft assumed the 3D camera and microphone combo was the starting point for a bright future of communicating with computers in the same way we communicate with each other, and so a more powerful version of the device has been made the centerpiece of the new Xbox One.
Ironically, early reactions to Microsoft’s new game machine seem to suggest Kinect has turned into a noose the company has unwittingly slipped around its neck. While the original peripheral sold well, the games released for it were mostly disastrous. Hopes for new ways of playing were dashed against frustrating experiences like Kinect Star Wars, Steel Battalion: Heavy Armor, Fable: The Journey, and Rise of Nightmares. Yet, the presumptive ease of interface offered by voice and arm swipes to move between menus, search Bing, and open apps buried layers deep in a machine without mouse or keyboard was too appealing to let go of.
Voice controls rely on the worst elements of both computers and humans, so it’s strange to see them turned into a primary mode of interaction. Even between humans, verbal communication is hugely unreliable, and confusions abound.
Voice controls rely on the worst elements of both computers and humans, so it’s strange to see them turned into a primary mode of interaction. Even between humans, verbal communication is hugely unreliable, and confusions abound. The amount of attention and energy necessary to translate a non-verbal thought (e.g. hunger) into an articulate command (e.g. I’d like a turkey sandwich with mustard, mayonnaise, and pickles) is significant when compared to the instantaneity of button key commands committed to muscle memory. In many cases it’s a matter of milliseconds compared to seconds, but that can feel like a huge encumbrance when you know there’s a shorter and less demanding way of doing things—you could hit alt-tab six times in the time it takes to say “Xbox, switch.”
More off putting is the simplification of language to directives and simple questions based around a basket of action words that Kinect depends on. In the same way that we speak differently to a grandparent than we might a best friend or co-worker, the spread of voice controls has taught us a new, intimate vernacular of artificially vague commands that would seem cryptic when directed at another person—Switch to what, exactly? Why do you want to know the surface temperature on Mars?
There are definitely uses of voice commands where articulating a command line is slightly more efficient than a tactile interface—while driving, for instance, or to avoid typing a search term with a game controller or touchscreen keyboard. And the technology is incredibly useful for people with accessibility obstacles like blindness or limited arm or hand mobility. Yet, these use cases show the benefits of voice control as mixed at best, sometimes better for a small subset of people and in other cases significantly less efficient than what’s presently available.
Apple’s commitment to Siri has helped popularize voice control as much or more than any other consumer technology, yet it’s consciously limited approach to the feature is telling. After releasing the feature with 2011’s iPhone 4S, Apple has avoided pushing for people to make the service the primary mode of interaction with their phones and it has never charged more for it. Instead, Siri is an add-on feature easily faded into the background for most users, who on average called on Siri only once a month, and only half of whom are satisfied by the overall experience. Apple has been even more conservative in bringing Siri to its laptops and desktop computers, limiting its use to dictation in applications with text capabilities.
Microsoft’s deep dive into voice control for Xbox One seems less like a step toward a new way of communicating with computers and more like an affect of trying to sell an expensive piece of technology that is neither revolutionary nor paradigm-shifting.
Microsoft’s deep dive into voice control for Xbox One seems less like a step toward a new way of communicating with computers and more like an affect of trying to sell an expensive piece of technology that is neither revolutionary nor paradigm-shifting. After the unexpected success of the Wii and the follow-on of the simple pleasures of iOS games, the traditional games industry has been beset with all sorts of mixed messages about just what its audience will want next, and as both PS4 and Xbox One have shown, there are no especially revolutionary new experiences to be had at present. That’s not bad, per se, but a marker of how widely accepted videogames have become.
“Our gizmos and gadgets, our phones and laptops and tablets and video games just aren’t very special anymore,” Ian Bogost wrote at the official announcement of the PlayStation 4 earlier this year, “And counter-intuitively, that’s what’s so new and special about them: their familiarity, their ordinariness.” Over-estimating the importance of Kinect in Xbox One is a way of trying to postpone the onset of this ordinariness, making something perfunctory and mundane seem revolutionary.
It’s not that Kinect doesn’t have interesting uses nor that all of its functions are pointless, but that they’re marginal, making basic system features slightly more convenient in some limited ways, the kind of curatorial upkeep one would expect from a service provider but not something to sell a new machine with. And, like all voice control, the more accurate it becomes, the less exciting it seems, and without that exciting gloss of newness, most users will be left with the awkward and unanswerable question of why they’re trying to talk to their televisions in the first place.
Michael Thomsen is Complex's tech columnist. He has written for Slate, The Atlantic, The New Inquiry, n+1, Billboard, and is author of Levitate the Primate: Handjobs, Internet Dating, and Other Issues for Men. He tweets often at @mike_thomsen.