Rendered at 18:56:24 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
why_at 24 hours ago [-]
My first impression coming away from this is skepticism.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
jasonjayr 14 hours ago [-]
The killer app was conceived as early as the 1980s: an agent running on your computer, organizing your files, your schedule, your messages, your bills, bank accounts, etc. All the parts of your life that were routine drudgery should be able to be offloaded to a smart agent, based on your preference, to bring you the information you needed with natural language queries, contextualized to what you were doing at the time, when you need it.
What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.
There is no doubt that LLM's can do amazing things, but the current environment seems to make it nearly impossible to do anything with them that doesn't let someone else inspect, influence, and even restrict everything you are doing with with these systems.
Animats 7 hours ago [-]
> What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.
If we're going to have AI regulation, this is where to start. If a company's AI service acts for a user, the company has non-disclaimable financial responsibility for anything that goes wrong. There's an area of law called "agency", which covers the liability of an employer for the actions of its employees. The law of agency should apply to AI agents. One court already did that. An airline AI gave wrong but reasonable sounding advice on fares, a customer made a decision based on that advice, and the court held that the AI's advice was binding on the company, even though it cost the company money.
This is something lawyers and politicians can understand, because there's settled law on this for human agents.
pronik 3 hours ago [-]
> an agent running on your computer, organizing your files, your schedule, your messages, your bills, bank accounts, etc. All the parts of your life that were routine drudgery should be able to be offloaded to a smart agent, based on your preference, to bring you the information you needed with natural language queries, contextualized to what you were doing at the time, when you need it.
The hard reality is that you are still responsible for all of these things. If anything goes wrong at all, you are liable. Might not be devastating if it's just your shopping list or your photos mangled, but with taxes or bills? Even if the agent is running completely locally in your home, you still won't trust it fully if your livelihood depended on it.
The killer app is only possible if software is fully reliable, which we all know is not the case. Software is just that: software, it still has bugs, undefined behaviour etc. Agents are the same, they just break in different way and fixing them might be even more difficult.
Bottom line: you will always be liable for things happening in your name and we've been sold a fairy tale a very long time ago.
jeswin 13 hours ago [-]
A few decades back, a lot of computer use was emails. And it was stored on someone else's servers - with everyone from server operators along the route, to the government potentially having access to it. Even HTTPS is a relatively recent thing.
I guess what I'm saying is - we've always had this problem.
kjs3 5 hours ago [-]
As an email admin a few decades back, there's just a tiny bit of difference between "my corp or school has a mail server and holds my email and an admin could look at it" and "Google and a tiny number of other companies hold most everyones email and always looks at it".
MaxfordAndSons 13 hours ago [-]
Yea there have always been gaps in privacy, but nowadays it's several orders of magnitude easier for corporations to exploit that private data at scale.
skydhash 7 hours ago [-]
Snail mail is also not secure and can be tampered with. I don’t mind someone hosting my mail. But I do mind Google doing it (based on their behavior).
lubujackson 3 hours ago [-]
This low effort AI shoveling reminds me of how they keep trying to make fridges that tell you when you need to buy milk or auto-buy milk. They have been pushing that idea for decades now, but it ignores the biggest issue that AI has laid bare: it sucks without accurate context.
What if I am going on vacation next week? What if I need extra milk for a dinner I am planning? What if my kid puts the milk in the fridge sideways and it no longer detects it?
"Easy fixes" to easy problems never work because they add mental load to tasks we already manage capably. Yes we no longer have to think about buying milk when it gets low, which was a stable pattern. But we replace it with a nondeterministic "milk state" that we need to be constantly vigilant about and manually adjust any time our routines are altered - exactly when we don't want to stack on more overhead.
AI is discretely useful, tremendously so, but big tech loves to default to umbrella solutions before there is a rich context to reasonably support it. The real world is messy.
General AI product tip: show your tool fixing a messy problem not a happy path problem. That's where AI is impactful!
Windchaser 3 hours ago [-]
> Most of their examples seem like they could have been done with a right click drop down menu
Right-click menus can get cumbersome. I've seen a lot of software that suffers from function bloat - not that the functions don't work, or don't play well together, but that the user interface becomes too overwhelming for users as the number of available actions explodes. This is particularly tough for new users.
This is where voice controls could shine: as we interact with computers in more and more complex ways, we need a way to specify our desires simply and easily. And if we can't do so easily, the software has to remain simple to be usable.
concinds 21 hours ago [-]
The second half of your comment is a go-to-market concern but doesn't feel so relevant for a research prototype. It could be done with a private local model too, maybe not by Google.
But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.
It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.
aquariusDue 21 hours ago [-]
This will sound like another brick in the paved road to dystopia but I'm kinda bullish on equipment that can recognize subvocalization. Or at least let me have a small drawing tablet with a stylus (think etch-a-sketch or Wacom Intuos) because at this point I'd rather practice writing and do away with typing altogether (even though I enjoy typing for typing's sake via MonkeyType).
NDlurker 15 hours ago [-]
I've been dreaming about that for 20 years. And then use it for people to communicate while sleeping.
why_at 20 hours ago [-]
Yeah I think there could be something to the integration of AI in an operating system so that it can handle things going on in different applications the same way you can already copy and paste between things.
But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.
anon84873628 19 hours ago [-]
It seems that if we ultimately want to "move at the speed of thought," it will require speech.
Swizec 18 hours ago [-]
> It seems that if we ultimately want to "move at the speed of thought," it will require speech.
Except for the large majority of people who read, type, and click way faster than they can talk. Especially for visual things it’s way faster to drag a rectangle than to describe what you want.
A lot of us also aren’t linear verbal thinkers. It would take minutes to hours to verbalize concepts we can grasp visually/schematically in seconds.
Isn't "drag the rectangle" and visual interaction exactly the point of the research in the article? Speech is the perfect side channel to this interaction, not a context switch to text.
Also, I doubt DeepMind is designing for existing programmers and savvy computer users. They are thinking about the other billions of people in the world. Speech is the skill people will already have, not typing.
svachalek 17 hours ago [-]
Most people speak at about 150 wpm, but very few can type that fast. But reading and gesturing are fast, which is what TFA is about, combining reading and gesturing with speech.
Swizec 17 hours ago [-]
You rarely need 150wpm when typing. If you try dictation, you’ll notice that half those words are error correction and checksum bits and just turn taking filler.
I usually convey the same meaning with 80wpm typing. Makes it faster to read too
Maybe I’m just slightly adhd – listening to people talk drives my crazy. Get to the point! Much easier if they type it out
dkersten 12 hours ago [-]
> listening to people talk drives my crazy.
People have so many verbal tics and filler words too. Anthropic’s Dario says “you know” after every third word, for example.
Or they meander around unrelated/unimportant details.
atq2119 15 hours ago [-]
There's the adage that writing is thinking, but even more accurately at least for me, editing is thinking.
Neither typing speed nor dictation speed is a true bottleneck, but editing speech seems like it'd be harder than editing text.
Though there may be some hybrid approach that can work well.
anon84873628 2 hours ago [-]
I suppose the idea is that the AI is going to do the "editing" for you (with all the consequences for "thinking" that implies).
You don't have to think about the design of your app. You just say what you want and the AI makes it appear. If you don't like something, you tell the AI to change it. You iterate live until you get the final result you want.
This is what writing docs has become for me. I have the agent make a draft, then tell it which sections to rewrite, combine, etc. I tell it the ideas I forgot to include. I manually make certain word choice changes. The question is how do you extend this flow to non-pure-text scenarios. For most people, just talking about what you see if probably the easiest.
jwx48 9 hours ago [-]
> editing is thinking.
I hadn’t realized until just now how accurate that is for me as well. Thank you.
unsupp0rted 4 hours ago [-]
> Doing this when you're not completely alone would be annoying to everyone around you
I'm surprised sub-vocal HCI isn't better developed by now. Perhaps because of this stuff coming out it will be.
Humans speaking to one another is literally telepathy: I'm putting my thoughts in your head, with lots of ambiguity and noise, of course.
With better sub-vocal tech we can control our devices without bothering each other.
You should look into how often people are using tools like WisprFlow and SuperWhisper. Voice is a very native mechanism. Most people working in open floor plans are wearing headphones any way. As long as you're not screaming, it's probably fine. Maybe, we'll move away from open plan offices in the bid for efficiency, which I would welcome.
shostack 15 hours ago [-]
I am moving full remote because dictation is such a better input mechanism for most of my AI interactions that I have become less efficient sitting in my open floorplan desk at the office because I cannot dictate there and the latency adds up. Typing is just achingly slow these days.
Melatonic 15 hours ago [-]
I feel like I can type faster than I can talk but I could be totally wrong?
el_benhameen 14 hours ago [-]
I also feel this way, but more importantly, I feel like my sentences are more coherent when typed because typing allows for corrections and modifications of ideas. Do whispr people just … get coherent, finalized ideas out in a single shot without any misspoken words?
gtowey 14 hours ago [-]
They are not.
It's like a hidden curse of LLMs -- they're so good at parsing intended meaning from non-grammatically-correct language that we don't have to be very good at clear communication.
Eventually all LLMs will be controlled by humans uttering terse gutteral grunts. We will all become neanderthals, with machines that deliver our every whim.
jiehong 8 hours ago [-]
transcription gets post-processed by a LLM (with different styles, like based on prompts, so that it removes fillers, homophones, change the style, etc.
I recommend the youtube channel @afadingthought to see what people come up with (like v=283-z29TXeM).
jchw 15 hours ago [-]
You should look into how often people are using rectangles with buttons on them. They may be a bit archaic, but they are my preferred input method. For example, thanks to rectangles with buttons, the other people in my vicinity do not need to hear about the inane internet arguments I routinely involve myself in.
I dunno how I can express this best, but I found out a very long time ago that my problem with voice input wasn't that it wasn't good enough. My problem with voice input is that I don't want it. I am very happy for people who use these tools that they exist. I will not be them. Yes I am sure.
And yes, I know SuperWhisper can run offline, but it is a notable benefit that versus many modern speech recognition tools my keyboard does not require an always-active Internet connection, a subscription payment, or several teraflops of compute power.
I am not a flat-out luddite. I do use LLMs in some capacity, for whatever it is worth. Ethical issues or not, they are useful and probably here to stay. But my God, there are so many ways in which I am very happy to be "left behind".
shimman 15 hours ago [-]
I'm sorry but if you think the amount of workers using voice controls in the office to be more than 1% you are in a massive bubble my dude.
Don't know if you're making a joke, but call center workers using a phone is not the same thing as a call center worker doing all their work on a phone. Worked in a call center for 4 years, one thing everyone needed after their shift was to just STFU for a few hours to decompress.
fny 19 hours ago [-]
It's possible to rely on mouth movements instead of sound. I've been tweaking visual speech recognition models (VSR) for the past few weeks so that I can "talk" to my agents at the office without pissing everyone off. It works okay. Limiting language to "move this" "clear that" along side context cues vastly simplifies the problem and makes it far more possible on device.
I think its brilliant UX.
swiftcoder 6 hours ago [-]
> I've been tweaking visual speech recognition models (VSR) for the past few weeks so that I can "talk" to my agents at the office without pissing everyone off.
Wouldn't SilentWhisper do just as good a job?
makeitdouble 18 hours ago [-]
No UX needs to be perfect for everyone, but this doesn't sound trivial to make reliable.
First things that came to mind:
- facial hair
- getting people to learn to make bigger mouth movements and not mumble
- we're constantly self-correcting our speech as we hear our voice. This removes the feedback loop.
- non english languages (god forbid bilingualism)
- camera angles and head movement
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?
encom 15 hours ago [-]
>non english languages (god forbid bilingualism)
In my experience, any combination of computers + speech + danish has, so far without exception been terrible. Last time I tested ChatGPT, it couldn't understand me at all. I spoke both in my local dialect and as close to Rigsdansk [π] as I could manage. Unusable performance, and in any case I should be able to talk normally, or there's no point. It was about a year ago - it may have improved but I doubt it. I'm completely done trying to talk to machines.
Yeah, I'd hate to use this in an open-plan office (which is like 99% of offices these days) and even using it alone at home would feel awkward. I don't really want to talk to the computer despite what 1950's sci-fi books led us to believe.
It's a cool idea for the future when we have reliable EEG headsets or Neuralink or whatever though.
Windchaser 3 hours ago [-]
> I don't really want to talk to the computer despite what 1950's sci-fi books led us to believe.
I'll talk to a computer, even in an office setting, if it adds enough value. But it's got to be a lot of value. Handsfree while driving is great, Iron Man talking to Jarvis while he's flying around makes sense. Many of us here are developers, engineers, or scientists, and our work has already been co-optimized with mouse and keyboard and whatever software we're in.
But when the software is less well-developed, or when it's not just dealing with technical data dumps, I imagine that a voice interface might be more useful.
So I think this idea has legs. But a successful implementation might also well be decades out.
PAndreew 10 hours ago [-]
The only place I'd ever talk to a machine is my car. Instead of huge flashy screens that distracts and kills thousands of people maybe they could build a buttons + voice agent system that could actually be useful and durable. I hate to tap Waze/Maps/etc. every time when I go somewhere or that I cannot comfortably switch to specific songs en route without risking my life...
schnitzelstoat 9 hours ago [-]
I connect my iPhone to my car and it requires Siri to be enabled which I can then use to change songs, Google Maps destinations etc. without having to touch anything.
The Siri voice transcription is pretty awful compared to what I've experienced with ChatGPT though and it's weird going back almost to the pre-LLM world where you have to give such clear sort of computer-coded voice commands.
ei23 12 hours ago [-]
>Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Reads like the argument against cell phones where don't have a cabinet around you...
schnitzelstoat 12 hours ago [-]
I wouldn't sit in the office talking on my phone next to my colleagues, that would be really annoying.
I'd go and find a small meeting room or conference call booth in the office and take it there.
Essentially, a cabinet.
ei23 2 hours ago [-]
i was refercing the discussion back in the days when telephone booths were a thing and cell phones came up.
hirako2000 12 hours ago [-]
The argument is against human to machine control. Not human to human communication.
In fact, when humans happen to order other humans, it's typically done in writing.
>We present a general-purpose implementation of Grossman and Balakrishnan's Bubble Cursor [broken link] the fastest general pointing facilitation technique in the literature. Our implementation functions with any application on the Windows 7 desktop. Our implementation functions across this infinite range of applications by analyzing pixels and by leveraging human corrections when it fails.
Transcript:
>We present the general purpose implementation of the bubble cursor. The bubble cursor is an area cursor that expands to ensure that the nearest target is always selected. Our implementation functions on the Windows 7 desktop and any application for that platform. The bubble cursor was invented in 2005 by Grossman and Balakrishnan. However a general purpose implementation of this cursor one that works with any application on a desktop has not been deployed or evaluated. In fact the bubble cursor is representative of a large body of target aware techniques that remain difficult to deploy in practice. This is because techniques like the bubble cursor require knowledge of the locations and sizes of targets in an interface. [...]
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
dgently7 15 hours ago [-]
pull up any moderately busy picture with more than a trivial amount of objects. pictures of "traffic" or
with other similar repetition are great for this demo. pick one specific object (like a specific tire on one car) in the image and write (or say) out all the words youd need to specify that exact object. now take the same image and point at the object with your mouse or circle it with an annotation tool. its often very very hard to describe accurately which object you are talking about, you will often resort to vague "location" words anyway like "on the upper left" that try to define the position in a corse way that requires careful parsing to understand. pointing/annotating is massively superior both in brevity, clarity, and speed.
jdougan 11 hours ago [-]
Nothing new under the sun. "Put that there" demo, 1982.
I think they answer that question pretty convincingly: Because if what you're looking at is already on the screen, it much more easy to point to it and say "that" than to describe it.
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)
YeGoblynQueenne 8 hours ago [-]
Yes, it does seem kinda ... pointless.
nolist_policy 24 hours ago [-]
The "Edit an Image" Demo at the bottom is pretty fun. Maybe this is just Google flexing their LLM inference capacity.
maccard 21 hours ago [-]
That demo was an absolute disaster for me on Firefox on mac. It just fundamentally didn't work - the voice wasway behind my pointer, there were multiple agents speaking over each other saying conflicting things, and it couldn't even move the crab to the bottom right of the image. Embarassingly bad I would say!
Our_Benefactors 16 hours ago [-]
Yup - what google is suggesting here will never materialize beyond being a slopfeature. People who want these bespoke workflows will build them or seek out specific tools that enable them, not trusting some overarching daemon that contextually watches their cursor. I don't trust google one bit to execute correctly on something like this.
hansmayer 11 hours ago [-]
Well you see to really, really sell it to the common folks, they need to convince you that ChatBots are the "Intelligence" . So they are coming up with all sorts of crap, like this one. The TV advertisements for Gemini and co. are indicative of how they see the average user, as an idiot of sorts, who needs the shit-device for pretty much anything. Oh you spilled some water on the counter top? Quick, ask Gemini what to do! You are a 20ish something individual home alone? Quick, lay on the couch and ask Gemini if you can really talk to it, omg, its so exciting! You were in holidays all alone, but in the middle of a really large crowd? Gemini to the help, cut those people out and make it look like it was an exclusive spot, just for you! Nobody else was there. So this proposal is going into the same direction - probably targeting the average office "idiot".
tim-projects 1 hours ago [-]
I looked at the first example and I'm astonished that they took a standard click and drag mouse move, injected the need to speak into an llm and then acted like it was revolutionary.
Imagine trying to convince someone in the 90s that that's a step forward.
arjie 24 hours ago [-]
Oh interesting, this is very cool. At first I thought it was just focus-follows-mouse but it's more interesting. You have certain keywords trigger "add to prompt". Ignoring the voice functionality (which is admittedly crucial currently because other inputs currently take over focus), I've often wanted to just have a continuous conversation with the LLM as I 'point and click' (or tab over and select) at various things. Might be neat to have text input focus continue to go to the LLM where I'm typing text etc.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
Cool research. I wonder what we'll end up with.
skydhash 7 hours ago [-]
> I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't
Why not constrain your computing? It will require some programming chops, but you can note down your common tasks, figure out where actual input are required, and automate the rest.
footy 22 hours ago [-]
you can really tell the people building these tools spend a lot of time alone. I work from a home office 90% of the time and I wouldn't want this to be my workflow. I don't want to talk to my computer, I want to listen to music while I work, and I want to not sound deranged and disturb everyone around me when I am working from a coffeeshop or the open-plan office or the airport or the train or whatever.
and that's aside from the obvious privacy problems.
lorecore 20 hours ago [-]
Agreed. A lot of Google products feel anti-social (like Google glass). They are definitely missing a human touch. Perhaps a byproduct of elitism and leet code grind filtering of employees mixed with founder personality.
schnitzelstoat 12 hours ago [-]
I can actually see more use-cases for Google Glass tbh - having a HUD would be useful when using Google Maps for example, like a fighter pilot.
And being able to take photos/videos with the glasses (like the Meta ones nowadays) is really useful with my kid because he often does funny or cute stuff and I don't have time to pull my phone out to take a video/photo of it. I guess it could be useful for video calls too so my parents can see him.
But I just don't see anyone sitting in an office, or even at home, talking to their computer. It's really only useful for hands-free settings like when you are driving, or in the kitchen etc.
lorecore 6 hours ago [-]
It's anti-social because other people don't want to be filmed by you.
skydhash 7 hours ago [-]
> he often does funny or cute stuff and I don't have time to pull my phone out to take a video/photo of it.
IMO, that’s the anti social part. Why is a phone butting in your relation with your kid? I only have a few pictures taken when I was a kid (and most were for some grand occasion). And I’m happy that was the case because of so many cringeworthy things I did.
_def 20 hours ago [-]
Nothing else expected - one of the examples in this very article marks some text in a doc and prompts "make this more human"
sipjca 18 hours ago [-]
thats incredible
notatoad 17 hours ago [-]
somebody the other day described google products as being a 22 year old imagining what their dad would want, and that feels very true to me.
order-matters 21 hours ago [-]
yeah, i understand the frustration of needing to do all the communication through typing and clicking and that it can feel limiting - but i want the computer to be less demanding of my physical reality not more. i want to be able to talk to someone on the phone, work on something ith my hands, and still successfully manage my compute tasks. improvement can only be made by requiring less attention to hte screen and less hand movements, not adding in anything new like voice
mikeocool 17 hours ago [-]
Yeah, I talk to someone on google meet who will seamlessly transition between talking to me and talking to Claude while on the call, and it is extremely annoying.
skydhash 21 hours ago [-]
> i want to be able to talk to someone on the phone, work on something with my hands, and still successfully manage my compute tasks.
Maybe you can share a scenario for that one? I can’t figure a scenario where all of this needs to be true. It seems like a recipe for accidents.
duskdozer 16 hours ago [-]
It's not exactly "compute tasks" in this sense, but online gaming is similar. People are constantly talking to each other while managing multiple other kb/mouse inputs at the same time.
chromacity 23 hours ago [-]
My reaction to the first demo (recipe) is that it was slower than typing the same thing on your keyboard.
The second demo seems to be a wash: there's no time saved in saying "move this" versus "move crab". And an app-specific contextual menu would probably be faster.
The third demo doesn't seem to warrant the use of a pointer at all, since there is only one way to interpret the prompt.
None of this means that this approach will not be successful, but there's a reason why so many attempts to revolutionize user interfaces ended up going nowhere. Talking to your computer was always supposed to be the future, but in practice, it's slower and more finicky than typing.
In fact, the only new UI paradigm of the past 28+ years appears to have been touchscreens and swipe gestures on phones. But they are a matter of necessity. No one wants to finger-paint on a desktop screen.
joe5150 23 hours ago [-]
Talking to your computer can only ever work for people in atomized work-from-home silos, surely. I can't really imagine living in a world where everybody is just muttering commands to the computer all the time.
bonsai_spool 20 hours ago [-]
This happens daily in radiology departments around the world
zbentley 17 hours ago [-]
Aren’t radiologists dictating notes rather than issuing commands to the computer? From supporting them for a few years, I recall them having pretty good facility using the computer to zoom/filter/isolate parts of images, and most of the muttering was speech to text or a good old tape recorder for their writeups.
bonsai_spool 16 hours ago [-]
The dictation happens while reviewing images and the dictation software includes support for voice macros ranging from edits to adding information from the chart and other applications. Not quite the same as just recording.
wffurr 8 hours ago [-]
Working near the Android Assistant team is like this. It's pretty chaotic.
raincole 7 hours ago [-]
> My reaction to the first demo (recipe) is that it was slower than typing the same thing on your keyboard.
For you and me, who have used keyboard in our lives for more than 1,000 or even 10,000 hours.
There was a brief period when typing slowed people down because they could write the same information down with pen&paper, and that period eventually passed.
big-chungus4 12 hours ago [-]
But, you have been using your mouse and keyboard for many years, so you know well how to use them quickly. I think that you shouldn't expect to be able to be quick with a new input type when you only tried it for a few minutes
AuthAuth 18 hours ago [-]
All these problems dont need a conversation with an AI to solve. If you make the text selectable then you can do these actions fast and efficiently. This only becomes a product as they make the web shittier and less friendly towards PC workflows.
Its wild that they even put this out as a demo. It should have been picked apart in the internal meeting. There is no way I'd ever show my product taking 5s to change a 1 to a 2 in a piece of text that the user was already hovering or taking 10s to drag and drop a line of text from one box to another. Even the image of finding a route between two images could be done quick if images were auto OCR'd which is a setting on most image viewers.
kjellsbells 24 hours ago [-]
I sense a privacy problem brewing.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
xigoi 9 hours ago [-]
That’s the whole point. Users who want this “feature” will give Google permission to permanently record their screen.
accurrent 21 hours ago [-]
At some point I hope that we will reach a point where these megacorps figure that running these things locally might be most cost effective. FWIW I think local models I run on my MacBook are good enough for most of the tasks that this kind of interaction may ask for.
mrhottakes 22 hours ago [-]
> What happens when someone browses something very private?
Profit!
Nathanba 17 hours ago [-]
Kind of incredible how consistently terrible Google is at everything they do in the AI space. So they choose this demo and write a big blog post and advertise.. and the demo is horrible, doesn't work. Doesn't track what I circle with the mouse, just apparently where the mouse pointer landed at the end and only exactly where it landed. Multiple times it said "Got it, I'll move this empty space between the clouds over here" or "Got it, I'll convert this empty area to a sunhat" despite my mouse only being a few pixels next to an actual hat.
torben-friis 18 hours ago [-]
Do you know the deep frustration of watching a tech illiterate person use a PC? When they type google in the omnibar and click on Google.com for every search, for example?
Now you get to hear every person in the office do that around you.
Like, good tech, but do googlers live in the real world? Do they genuinely like the idea of an open office full of people talking to their computers? Do they all live alone without human contact?
ImaCake 20 hours ago [-]
I think this falls flat for a technical audience because we already know how to do this stuff. But there are a lot of people who don't know how to copy paste, or use reverse image search, or apply a filter to a table. Being able to use plain language to do these things is a game changer for them. Sure, it's inefficient and inelegant but it's an interface that will do for basic technical stuff what the ipad touch screen did for the mouse and keyboard.
8note 20 hours ago [-]
one step back, both for technical and non technical, is the knowledge that thats even a problem that you have.
the agent occasionally spots your real problem like an experienced engineer
fatata123 14 hours ago [-]
[dead]
juancn 23 hours ago [-]
Please don't.
I like text selection exactly how it is. I want precise controls.
It's fine for a touch interface like a phone, but on a computer I expect precision. As much as I can get.
exploderate 4 hours ago [-]
Ah wow, that must be the most expensive mouse pointer ever.
I wonder if it can also click links. I just tell AI to "click this" and it will figure out what is under the pointer, query a graph of UI widgets and trigger a "click" event for me. Maybe with a sub-agent that has application context, "this is a browser".
If we manage that, my plans for a pure XML based shell might not be too futuristic '<in><ls/></in><out><tree><file date="CDATA[...]">' ...
tintor 1 days ago [-]
Of course, it isn't a Google Demo, if you can't use it to book a table at restaurant. (shown at the bottom of the page)
rhet0rica 14 hours ago [-]
My thoughts exactly. Why the fuck is this the primary concern of EVERY person who works at Google?!
cmrdporcupine 7 hours ago [-]
What else can you do in the Bay Area sprawl other than work and go to restaurants? Is there anything else?
bitwize 13 hours ago [-]
Aside from work—commerce, entertainment, and social communication are pretty much the only things most normies will ever use a computer for. Maybe light accounting and scheduling as well. This has been true since the 90s when the normie market for computers opened way up due to the internet becoming widely available, and companies have been demoing commercial applications for their equipment and software since then if not even before. Ordering food for delivery was the big one back then. Here's Jennifer Aniston and Matthew Perry ordering Chinese via fax with the new Windows 95 in 1995: https://www.youtube.com/watch?v=vLlWrt-zmTo&t=685s
I'm sure Don Hopkins can tell you a long annotated tale about the NeWS pizza ordering app that displayed a real-time dynamically-updated rotating pie on the screen as you filled out your order.
wffurr 8 hours ago [-]
"Wiggle" the mouse cursor to do something - isn't this incredibly easy to trigger by accident? I remember turning on a "find cursor" mode in Windows once upon a time where it would zoom the mouse cursor on a wiggle, and it fired all the time by accident. Imagine an older person or child who is even less accurate with the mouse, too.
botanrice 22 hours ago [-]
while these examples might be easy fodder for criticism I do feel like this whole idea of talking to an LLM across multiple applications and anything your pointer is on will give it context is pretty powerful and cool idea.
I'm imagining a webpage with a link - instead of opening a new link to quickly google something or opening three new tabs based on hyperlinks, i can point at a paragraph or line and ask it to tell me about it.
Maybe I can point at a song on Spotify and have it find me the youtube video, or vice versa (of course this is assuming a tool like this wouldn't stay locked into one ecosystem.. which it will).
Point is that the concept of talking to the computer with mouse as pointer is pretty cool and i guess a step closer to that whole sci-fi "look at this part of the screen and do something"
Nathanba 16 hours ago [-]
I agree that AI audio interfaces will be the future but not because they are better UIs for users as we understand the term "user" today. The future users of UI are not users of UI at all, they want nothing to do with learning UI or what buttons to press or where to type something. They want to go to the shopping site and instead of typing anything into a search field they want to say "Find me some boots for the summer, I wanna look fresh" and then tell it to complete the purchase via voice as well once it found something. At most they'll still click some filters in the results page and on individual results but that will be it.
holoduke 22 hours ago [-]
Yeah. We just need 10x more compute. But constant ai analysing of everything is the ultimate direction.
botanrice 1 hours ago [-]
I wonder what the smallest, still practical, application of this would be now if you could implement it locally on your machine assuming you have decent hardware.
Maybe something with the file system? Like hover over any file and instead of seeing a snapshot or some details/metadata you could get a quick 1-2 line sentence on what it is. I suppose you may want to have that saved somewhere as well to cache it... but I'm def not an expert.
This would solve the always difficult issue of finding that one document! I still have trouble with document search on both mac & windows OS sometimes.
devhouse 20 hours ago [-]
What I actually would be more willing to allow is a version of this that is built into macOS, runs locally, never phones home. If Apple Intelligence made an "AI sees everything on my screen", I might turn it on.
gobdovan 23 hours ago [-]
This is how I always imagined FE development would work once ChatGPT 3 came out. Then Cursor appeared and seeing how successful they were with just a chat and a few tool calls, I thought I was over-complicating things.
Anyway, I built a prototype on this idea, but instead of relying only on hover, I press Option to select a node in a custom AST-ish semantic layer I designed around a minimalist UI grammar, and Option + up/down arrows to move to parent/child node. This way, I have have an accurate pointer to the element I want to talk about, plus a minimal context window (parent component, state, a few navigation related queries).
What I learned from using it, though, is that the killer use case isn't necessarily the flashy "talk to this UI element" interaction shown in the Google demos. I do use it that way too; I have `Option + Shift + click` to copy a selector to the clipboard, so I can give an LLM connected to the live medium a precise reference to the element I want to discuss.
But the place where it has been most useful day to day is much simpler: source navigation. Point at the thing in the UI, jump to the code that is responsible for it. The difficult part is jumping to the code you care about (the code for UI or for the semantic element?), but in my system that distinction turned out to be usually obvious, which is what makes the interaction useful.
jiehong 8 hours ago [-]
Now, with vim/helix key bindings:
1. select text
2. dictate action
Feels very similar to Helix's select text and act on it.
I think text selection could also be voice controlled (with a modal voice input), so one could say: "select sentence, action mode, copy and paste it in my list and remove duplicates"
danielbusnz123 5 hours ago [-]
I started building an open-source Rust version of this idea for Linux/Hyprland last week, before the announcement. Different design (OS-level instead of browser-only, Anthropic Haiku 4.5 + Computer Use instead of Gemini, sub-3s release-to-speech). Code: github.com/danielbusnz-lgtm/aegis
iandanforth 6 hours ago [-]
Deepmind hype is the worst hype. They do genuinely cool stuff and talk about it publicly, but don't make it available. Or it's only available to a tiny select few. Just shut up about the things you're doing until they are ready. You're part of a consumer products company, not a university PR department.
TonyAlicea10 17 hours ago [-]
At the core this is recreating the right click via voice.
Interesting but not “reimagining” anything.
I think the real story here is how vibe coding now enables flashy demo sites like this to be built for a concept that hasn’t yet earned it.
>The game features a panic button that when clicked on will cover the computer screen with a fake spreadsheet. The player can also choose to print out Maxie's current pose as a pinup.
I've been iterating on some 3D models for various wacky garage projects I have. It's fun. I've often wished I could click on an arbitrary place and say "add an eye bolt here" or somesuch.
Of course learning proper cad software is probably the right thing here, but having Claude write python scripts which generate HTML files which reference three.js to provide a 3d view has gotten me surprisingly far. If something could take my pointer click and reverse whatever coordinate transforms are between the source code and my screen such that the model sees my click in terms of the same coordinate system it's writing python in, well that would be pretty slick.
accurrent 21 hours ago [-]
Hah, Ive been thinking the same thing. Recently I prototyped a 2d paint app to validate the idea using chrome's Prompt API: https://arjo129.github.io/apps/voice_paint.html
Honestly what'd be epic is if this could be made to work with a XR headset. Imagine using the headset to capture the piece you are working on and generating saying "hey can you drill some holes over here"?
__MatrixMan__ 19 hours ago [-]
That would be pretty excellent. We need a layer agnostic notion of "here".
Until then I've just had it list every surface in a legend, each colored differently, so I can say "three inches down from the top of pole six, and rotate it so the hoop part of the bolt faces northwest."
walrus01 21 hours ago [-]
Can I use this AI mouse pointer to tell the difference between hotdog or not hotdog?
Legend2440 21 hours ago [-]
Indeed you could!
AbuAssar 1 days ago [-]
so Google will be monitoring whatever on the screen continuously or only when the user say the magic words (this, that, here, there)?
EdgeExplorer 1 days ago [-]
Indeed. "AI-enabled pointer" is misdirection. This isn't an AI-enabled pointer; it's sending screen to AI, which yes, includes pointer position. The AI doesn't live in the pointer. The AI lives, apparently, so thoroughly in the system that it can see and do anything, and the pointer is just a way of giving it context.
OtomotO 1 days ago [-]
Google Recall. Hey, it's all about the marketing.
nolist_policy 1 days ago [-]
Wiggle at CAPTCHAs, wiggle at Termux, wiggle at Emacs, wiggle at the Godot Editor, wiggle at my remote desktop.
(Not going to happen)
iamcalledrob 12 hours ago [-]
Nitpick, but it bothers me: The human factors of their demo video don't stack up.
Horizontal dragging with a mouse is actually really hard. Nobody's going to use it like that.
Your arm can easily move your hand and cursor up/down by pivoting your shoulder, but there's no mechanism for left/right movement. It's always an arc.
Or put another way: selection will be a lot slower and more tedious than the demo.
loaderchips 1 days ago [-]
It's beautiful how the human mind can take something very obvious but overlooked and make it into this fantastic innovation. Fab stuff.
maheenaslam 23 hours ago [-]
The concept is good but accuracy in cluttered environment can be a concern, also misinterpreting context can be a problem
grumbelbart2 12 hours ago [-]
What is going on with the font in this article? My Firefox renders it as a wired mixture of lower- and upper-case letters, all with the same height. Completely unreadable. Culprint seems to be this:
font-feature-settings: "ss02" on;
ungreased0675 15 hours ago [-]
To me, this is a bizarre and weird way of using a computer, but I’m glad they’re doing research and trying new things.
12 hours ago [-]
chamomeal 15 hours ago [-]
Kinda related, but this reminded me of the guy who made a voice controlled text editing language. It's kinda like vim with your voice. Super cool talk here:
CLI is peak computing. No way this is more convenient than a terminal.
lifis 9 hours ago [-]
Not clear what it actually does, but seems equivalent to a global right click menu with "Chat with AI about this"
vicentwu 9 hours ago [-]
Good work! Context-awareness has huge potential. I don't think this demo hit the right mark, but it definitely shed some light.
7 hours ago [-]
robot-wrangler 22 hours ago [-]
A zigzag merge gesture is obviously a terrible idea until/unless everything is a touch screen. Did they even think about this stuff at all? Ergonomics and RSI aside, if a horizontal drag means add, why not just make vertical drag mean merge. Not a fan of voice interaction generally, but it's something we'll all be grateful for as we get older. No need to accelerate it
dandaka 23 hours ago [-]
Next generation of OS should have constant video and audio recognition by on device LLM. This will provide valuable context for a lot of scenarios. So instead of frequent copy-pasting we are used to, we can let agents access context of our whole workflows from different apps.
But Google is a very ill positioned candidate for such OS. I would rather trust Apple and local-first on-device models.
mrhottakes 22 hours ago [-]
Next generation OS should absolutely -not- have always-on surveillance like you describe.
dandaka 9 hours ago [-]
why not? if context stays on device and operator is in full control, what downsides are there? there is no observer here, only operator and full context of his activity at his fingers
>Sousveillance (/suːˈveɪləns/ soo-VAY-lənss) is the recording of an activity by a member of the public, rather than a person or organisation in authority, typically by way of small wearable or portable personal technologies.[14] The term, coined by Steve Mann,[15] stems from the contrasting French words sur, meaning "above", and sous, meaning "below", i.e. "surveillance" denotes the "eye in the sky" watching from above, whereas "sousveillance" denotes bringing the means of observation down to human level, either physically (by mounting cameras on people rather than on buildings) or hierarchically (with ordinary people observing, rather than by higher authorities or by architectural means).[16][17][23]
dandaka 9 hours ago [-]
The goal is not to record activities of others, it is to have full context of operator activity. More like a personal knowledge base, that camera pointing to a cop.
vjvjvjvjghv 16 hours ago [-]
I think combining voice and mouse will be perfect for a lot of design work like photo editing or CAD. Like pointing at an edge and saying “put a 2mm chamfer here”. To me this would be a really nice workflow c
ianbicking 22 hours ago [-]
I've been doing something similar to this in a personal claude code frontend, though not particularly "magical".
I'm mostly using my system to make comments on long AI-generated documents (especially design documents). I find it works well to have the AI generate something, and then I read through it, making comments along the way.
You can get pretty far just repeating the things you see... "I'm reading [heading] and [comments]". But I do find some use in selecting content and saying "I don't agree with this" or whatever else.
The result is just an augmented message. It looks like:
<transcript>
Let's see what we've got here.
<selection doc="proposal.md" location="paragraph 3">
The system already...
</selection>
No, I don't like how this is approaching the problem, ...
</transcript>
Then I just send this as a user message. Claude Code (and I'm guessing any of the agentic systems) picks up on the markup very easily. It also helps to label it as a transcript, as it can understand there may be errors, and things like spelling and punctuation are inferred not deliberate. (Some additional instruction is necessary to help it understand, for example, that it should look for homophones that might make more sense in context.)
It makes reviewing feel pretty relaxed and natural. I've played around with similar note taking systems, which I think could be great for studying in school, but haven't had the focus on that particular problem to take it very far.
But I think the best thing really is giving the agent a richer understanding of what the user is experiencing and doing and just creating a rich representation of that. The keywords can be useful, but almost only as checkpoints: a keyword can identify the moment to take the transcript and package it up and deliver it.
One difference perhaps in design motivation: I have really embraced long latency interactions. I use ChatGPT with extended thinking by default, and just suck it up when the answer didn't really require thinking. I deliver 10 points of feedback at once instead of little by little. (Often halfway through I explicitly contradict myself, because I'm thinking out loud and my ideas are developing.) I just don't stress out about latency or feedback, and so low-latency but lower-intelligence interactions don't do it for me (such as ChatGPT's advanced voice mode, or probably Thinking Machine's work). I think this focus is in part a value statement: I'm trying to do higher quality work, not faster work.
8note 20 hours ago [-]
this is pretty built in with vscode plugins.
you select text in vscode, and write a comment, and the llm gets both
altern8 12 hours ago [-]
I can type pretty fast, so this seems like it would slow me down. Also, everyone in the office would get annoyed pretty quickly.
hmokiguess 24 hours ago [-]
Don't build these things, instead build protocols and expose system level APIs for application developers to build things.
SilverSurfer972 13 hours ago [-]
If we want to close Human <-> Machine loop as much as possible(pre-neuralink).
Assuming that today the most efficient way for human to transfer information to machines is via voice.
Assuming for machines to convey rich information to humans that's by printing html.
Then a combination of screen + eye tracking + voice is all you need.
The mouse doesn't make sense anymore.
I don't see how the examples given are much better than just natural language. With support for chaining multiple thought+capture then this could be pretty expressive and a nice accessible tool. I could see how eye gaze could be incorporated as well.
jaccola 1 days ago [-]
This seems like one of those things that is usable infrequently enough to be forgotten/poorly developed/never used. (Even before accounting for the actual failure rate of the LLM which will be none-zero).
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
amelius 21 hours ago [-]
Reimagine the chat interface first. For example, let the user click where the LLM went off the rails.
kixiQu 21 hours ago [-]
I'm pretty sure all these models have terms of service that make the user assert that they have permission to use the content you're feeding into them (clickwrap infringement-is-the-user's-fault). This kind of integration makes a mockery of that.
1970-01-01 22 hours ago [-]
How about you give me my normal white cursor and an "AI enhanced" orange cursor only when I'm doing AI things. To use their words, that would be "intuitive AI that meets users across all the tools they use, without interrupting their flow"
thih9 11 hours ago [-]
I guess Apple is taking notes and launching a version of that later in visionOS.
21 hours ago [-]
iridione 1 days ago [-]
Interesting! I wonder how UI will evolve in the long-term? If there are browser-use/computer-use and clicky-clones automating pointer actions, do we really need complex UI anymore? If yes, when?
Ancapistani 24 hours ago [-]
I've been playing with writing a visionOS app that allows an AI agent to be aware of what you're looking at at any given time.
At some point I fully expect eye tracking (or attention tracking) to be common enough to be a first-class input method.
strgrd 1 days ago [-]
No thanks
SirFatty 1 days ago [-]
It only took Google and their AI offering to come up with Graffiti.
chatmasta 19 hours ago [-]
I’m having flashbacks to Windows 7 gadgets. I can already imagine some developer marketplace for creating cursor prompts.
ivanjermakov 22 hours ago [-]
I spent quite some effort to _completely_ get rid of mouse usage in my computer workflows and I believe it paid off.
mvdtnz 1 days ago [-]
Both of the text based demos would have been simpler and faster with traditional mouse and keyboard interactions. What is the AI adding?
layer8 22 hours ago [-]
The AI is adding that you don’t need to provide a button or menu item or keyboard shortcut for each possible action, or for the user having to know that the command is there and how to access it.
I prefer keyboard operation myself, but I can see how this could become useful in the future, for certain use cases.
What would be useful as well then is if you could bind such a repeatedly-used AI command to button/menu item/keyboard shortcut in a way that it can still be used with pointing “this” and “that”.
duskdozer 16 hours ago [-]
It seems like the natural extension of the (unfortunate) simplification of UIs where everything is "cleanly" hidden behind multiple menus if not removed.
hapticmonkey 18 hours ago [-]
Maybe there’s potential here for certain accessibility scenarios for users who would benefit. But otherwise, leave the mouse cursor alone.
mrhottakes 22 hours ago [-]
> What is the AI adding?
More $$$ for the PM who launched the product.
hyperhello 1 days ago [-]
They’re going to take your abilities to do anything and spread it across many places so you have to run around to do them, same as all the moneyed technology.
wartywhoa23 23 hours ago [-]
Hype-flavored surveillance!
dieselgate 22 hours ago [-]
Yeah it's a gimmick
AuthAuth 18 hours ago [-]
latency, cost and undefined behavior
dfxm12 24 hours ago [-]
It tracks what's on the screen and sends it back to Alphabet. If you're watching a video about BBQ, enjoy a bunch of ads for Omaha steaks and big green egg in your Gmail.
On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.
slopinthebag 1 days ago [-]
It feels like everything modern is like this. No value added, just the appearance of it.
beepbooptheory 4 hours ago [-]
This page forced me to hard reboot my whole (older, Android) phone?
promiseofbeans 17 hours ago [-]
Their video demo is interesting. If that was to be useful, it would need to work on sites like Netflix. And for that to work, they would presumably have to axe drm. I am fully in favour of removing the pointless energy tax we pay as a society for the highly flawed and ineffective system of video drm.
Unless of course, their AI gets the same special privileges as the gpu in accessing drm content, and everything else is still locked out.
pshirshov 7 hours ago [-]
Could we play bullshit bingo?
Bullshit!
You don't need any new metaphors to support such (questionable) flows - at all.
Swipes instead of selection rectangles are annoying - you don't see the traces of the swipes on these demo gifs! So, you've effectively "selected" something - but you have to keep in mind WHAT you selected.
Total ridiculous bullshit.
jaredcwhite 20 hours ago [-]
Haha, April Fools! Good one.
Wait…it's May. Ugh, I'm so confused. :spiral eyes emoji:
imdsm 12 hours ago [-]
Actually don't hate the concept
felixsebastian 5 hours ago [-]
i think it's beautiful
elteto 20 hours ago [-]
The image editing demo was fun... the model is not very well censored.
LocalH 1 days ago [-]
do not want
adrianwaj 17 hours ago [-]
I like the idea of a touch screen that you don't touch. Just a few centimeters(variable) off would be fine - they do exist but I've never used one. I think what's required is a way to slightly flex one finger to activate cursor movement, then pinky/thumb twitch to button press. Maybe wear two magnetic silicone nail bands? Current air-touch technologies seem complex and power-hungry, I don't know.
Would be tiresome though to hold hand out all day - but good for mobile and handwriting/drawing. Need zero latency.
Furthermore, the mousepad could become the magnetic sensor and not the screen in order to rest the palm. The nail bands then become the equivalent of the mouse so it's a hall effect mouse. But could the pad detect finger twitch for the buttons, though?
RamblingCTO 10 hours ago [-]
What a load of horsecrap. Google was never good at usability or UX. But that's a new low. This is ambiguous as it gets and good UX is opposite of that. If I need to undo half the stuff that happened or an AI starts to do stuff I don't want ot because I am moving my mouse in a certain way I'd just get angry and turn it off.
xiphias2 23 hours ago [-]
Google needs to beat OpenAI and Antropic in coding models because that's where the big money is going. I love using the Gemini pro model for quick questions, but that's not where I'm spending the real money.
They have so many great software engineers but unable to use them to speed up coding AI research. Hopefully with Sergey's focus it will get better.
This cursor thing is just another experiment nobody cares about.
PufPufPuf 21 hours ago [-]
That example with the recipe is funny. Did they really need AI to copy two lines and then compute 2×1?
IG_Semmelweiss 16 hours ago [-]
50yr old tech (mouse) still hangs around because it is effective. Whatever this Google slop is, will likely not replace mice. The mouse has been successful at passing the test of time. Most of Google's own hardware products don't last 1/20th of that time.
There's a reason chairs are still around. They are +2000 years old. Its still waiting to be replaced.
One should be extremely skeptical of claims of replacing tech that has been around for a very long time.
xigoi 7 hours ago [-]
You’re just giving them ideas for AI chairs.
Joker_vD 24 hours ago [-]
Just seven hours ago there was a plea on HN [0] to please not do this. Seriously, what are they smoking at Google right now?
There must be some gargantuan blackhole around there that kills creativity.
dwa3592 20 hours ago [-]
I tried it in ai studio and it was extremely disappointing. It did not follow the directions. I pointed at the door of the sand castle and said make the door of the sand castle bigger. it created another very big sand castle on the side. Then i pointed at one flag of the castle and said turn this flag into a blue flag. It turned two other random flags to blue but the one i pointed the pointer at.
OtomotO 1 days ago [-]
Like a dream come true...
Nightmares are dreams as well and this is a nightmare like Windows Recall.
Technically wonderful though.
scotty79 12 hours ago [-]
Just few days ago, I found myself marking a part of my screen with rectangle and pasting what's there into chatGPT to find out more about what was referenced there.
This has a good utility.
themafia 1 days ago [-]
> We’ve been exploring new AI-powered capabilities to help the pointer not only understand what it’s pointing at, but also why it matters to the user.
We couldn't quite track you well enough before. So we're fixing that under the guise of "AI powered capabilities."
23 hours ago [-]
jinkuan 1 days ago [-]
being able to make precise edits would be huge for AI
andrethegiant 18 hours ago [-]
don't mess with the mouse
righthand 20 hours ago [-]
Google made a Microsoft Kinect.
delusional 21 hours ago [-]
Its like watching a demo from the old Xerox PARC, except everybody has only bad ideas. Like an opposite Xerox PARC.
walrus01 21 hours ago [-]
It's like a demo from Xerox PARC in an alternate universe where everything is run by marketing department MBAs. Oh wait, that's the one we live in now.
throwuxiytayq 21 hours ago [-]
Nice, cute, silly little feel-good demo so that we can all pretend like we’re all going to be making decisions and micro-managing AIs by pointing at things in 5 years. It’s going to be great! The future is bright!
etchalon 22 hours ago [-]
I don't understand why we need to move from an explicit operation like, say, circling something, to a fuzzy one where you have to hope the machine understands what you're pointing at.
I also don't think people want to constantly talk to their computers.
xigoi 4 hours ago [-]
> I also don't think people want to constantly talk to their computers.
People don’t. Tech companies think they do.
homeonthemtn 22 hours ago [-]
This is pretty neat
pronik 3 hours ago [-]
One thing I'm constantly baffled by in current technology is how it seems to be targeted at one pretty precise subset of human population: young, healthy, single people without any problems with perfectly pre-defined schedules and organized lives. Speaking English, having an OCD and being relatively wealthy is a bonus.
Only when you live alone might you be comfortable constantly speaking with your devices. Only if your life if perfectly predefined can you let your fridge order the same food that just gone stale or has been eaten. And only when you are young and healthy and not in any way differing from the "standard" would you be capable of working like these "researchers" imagine you to.
I'm not that person. I'm constantly failing at doing "triple-finger-taps" whenever I'm in need of one. I have a smartwatch with pedestrian navigation and never bothered to remember which vibration pattern means which turn. I don't configure different vibration patterns for different callers on the phone. I have a folding phone, but I almost never do side-by-side windows and when I do, I need to find out how to do that first -- and then how to leave that mode without losing my mind. I almost never use AI features on my phone not because I don't want to, but because I never remember how to activate them. I don't re-configure my gadgets to "fit my mood". I hate recommendations like "you like X, here's Y, it's the same!" I hate that I can't rest my mouse cursor on websites anymore without selecting something actionable, moving, animating or autoplaying.
All of the examples on the linked page are workflows I would never do this way. I won't be talking to my shopping list to double the ingredients. I won't be drawing gestures with my mouse on a document to activate a voice command. I won't use voice commands in general because as it turns out, I'm not capable of bringing out a complete coherent sentence without pausing and/or changing my mind and/or realizing I'm wrong once.
I appreciate those demos for the progress they are showing. It's impressive and astonishing to see restaurants getting extracted from videos or pictures getting expanded or text edited better than I ever could. It's all modern-day magic in a way. One thing it all isn't is a product. We don't have those anymore -- all we get are gimmicks. We don't do common interfaces anymore either, we are separating people in Google/Apple/Xiaomi camps.
And most importantly we don't use that technology for good except for a bunch of people writing e-mails all day, doing shopping lists and booking one of top restaurants in Tokyo for the same evening on a whim. We are long overdue for a remake of "American Psycho", but this time it will be a documentary instead of a satire.
simondw 24 hours ago [-]
Maybe I'm misunderstanding, but what is new about the pointer itself? Seems to be functionally the same as selecting + tooltips / context menus.
kwertyoowiyop 24 hours ago [-]
Shush, how is anyone going to get promoted with that kind of talk!?
DaiPlusPlus 24 hours ago [-]
> but what is new about the pointer itself?
I'm hoping for a const-reference joke.
justvugg 22 hours ago [-]
really interesting! This change now fits a faster UX and use.
pmarreck 24 hours ago [-]
There's already a product that does this lol
Aaaaand now I can't remember the name of it
CivBase 6 hours ago [-]
They replaced the context menu with voice commands. They replaced cursor drag selection with a wavey gesture. They expect us to use LLMs for something as trivial as copy+paste.
This reads like an April Fools joke. Even the title sounds like satire.
19 hours ago [-]
protocolture 21 hours ago [-]
please leave the pointer alone. Hes been with us so long without enshittification.
brgsk 24 hours ago [-]
what the hell is going on at google
ori_b 22 hours ago [-]
AI.
SirMaster 24 hours ago [-]
Thanks, I hate it
ekjhgkejhgk 21 hours ago [-]
Thank I hate it.
pyaamb 21 hours ago [-]
If its offline I love it. Otherwise I hate it.
jFriedensreich 12 hours ago [-]
[dead]
invalidSyntax 15 hours ago [-]
[dead]
ElenaDaibunny 15 hours ago [-]
[dead]
mcookly 1 days ago [-]
I wonder what sort of monstrous power would be unleashed if Google used Plan9 as a foundation.
bitwize 24 hours ago [-]
They'd half-finish it then bury it, like they did with Fuchsia which is heavily Plan-9-inspired.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
What's being delivered now is, an agent running on someone else's computer, copying your data to someone else's database, with zero responsibility, or mandate to protect that data and not share with with anyone else (in fact, they almost always promise to share it with their thousand partners), offering suggestions and preferences based on someone else's so-called recommendations, influenced by paying the agent's operators, and increasing pressure to make using someone else's computers + agents the only way to interact with other people and systems.
There is no doubt that LLM's can do amazing things, but the current environment seems to make it nearly impossible to do anything with them that doesn't let someone else inspect, influence, and even restrict everything you are doing with with these systems.
If we're going to have AI regulation, this is where to start. If a company's AI service acts for a user, the company has non-disclaimable financial responsibility for anything that goes wrong. There's an area of law called "agency", which covers the liability of an employer for the actions of its employees. The law of agency should apply to AI agents. One court already did that. An airline AI gave wrong but reasonable sounding advice on fares, a customer made a decision based on that advice, and the court held that the AI's advice was binding on the company, even though it cost the company money.
This is something lawyers and politicians can understand, because there's settled law on this for human agents.
The hard reality is that you are still responsible for all of these things. If anything goes wrong at all, you are liable. Might not be devastating if it's just your shopping list or your photos mangled, but with taxes or bills? Even if the agent is running completely locally in your home, you still won't trust it fully if your livelihood depended on it.
The killer app is only possible if software is fully reliable, which we all know is not the case. Software is just that: software, it still has bugs, undefined behaviour etc. Agents are the same, they just break in different way and fixing them might be even more difficult.
Bottom line: you will always be liable for things happening in your name and we've been sold a fairy tale a very long time ago.
I guess what I'm saying is - we've always had this problem.
What if I am going on vacation next week? What if I need extra milk for a dinner I am planning? What if my kid puts the milk in the fridge sideways and it no longer detects it?
"Easy fixes" to easy problems never work because they add mental load to tasks we already manage capably. Yes we no longer have to think about buying milk when it gets low, which was a stable pattern. But we replace it with a nondeterministic "milk state" that we need to be constantly vigilant about and manually adjust any time our routines are altered - exactly when we don't want to stack on more overhead.
AI is discretely useful, tremendously so, but big tech loves to default to umbrella solutions before there is a rich context to reasonably support it. The real world is messy.
General AI product tip: show your tool fixing a messy problem not a happy path problem. That's where AI is impactful!
Right-click menus can get cumbersome. I've seen a lot of software that suffers from function bloat - not that the functions don't work, or don't play well together, but that the user interface becomes too overwhelming for users as the number of available actions explodes. This is particularly tough for new users.
This is where voice controls could shine: as we interact with computers in more and more complex ways, we need a way to specify our desires simply and easily. And if we can't do so easily, the software has to remain simple to be usable.
But I don't think the voice problem is surmountable. I closed their image editing demo when I saw it required a mic.
It would be appealing as a Spotlight-like text pop-up interface where you type instructions, which would work in social/office environments, but that might only appeal to power users.
But if it's going to require phoning home to some Google/OpenAI/whoever then forget it. I don't want a constant connection to my OS from one of these companies.
Except for the large majority of people who read, type, and click way faster than they can talk. Especially for visual things it’s way faster to drag a rectangle than to describe what you want.
A lot of us also aren’t linear verbal thinkers. It would take minutes to hours to verbalize concepts we can grasp visually/schematically in seconds.
Great book on the topic: https://www.goodreads.com/book/show/60149558-visual-thinking
Also, I doubt DeepMind is designing for existing programmers and savvy computer users. They are thinking about the other billions of people in the world. Speech is the skill people will already have, not typing.
I usually convey the same meaning with 80wpm typing. Makes it faster to read too
Maybe I’m just slightly adhd – listening to people talk drives my crazy. Get to the point! Much easier if they type it out
People have so many verbal tics and filler words too. Anthropic’s Dario says “you know” after every third word, for example.
Or they meander around unrelated/unimportant details.
Neither typing speed nor dictation speed is a true bottleneck, but editing speech seems like it'd be harder than editing text.
Though there may be some hybrid approach that can work well.
You don't have to think about the design of your app. You just say what you want and the AI makes it appear. If you don't like something, you tell the AI to change it. You iterate live until you get the final result you want.
This is what writing docs has become for me. I have the agent make a draft, then tell it which sections to rewrite, combine, etc. I tell it the ideas I forgot to include. I manually make certain word choice changes. The question is how do you extend this flow to non-pure-text scenarios. For most people, just talking about what you see if probably the easiest.
I hadn’t realized until just now how accurate that is for me as well. Thank you.
I'm surprised sub-vocal HCI isn't better developed by now. Perhaps because of this stuff coming out it will be.
Humans speaking to one another is literally telepathy: I'm putting my thoughts in your head, with lots of ambiguity and noise, of course.
With better sub-vocal tech we can control our devices without bothering each other.
https://www.media.mit.edu/articles/exclusive-startup-lets-yo...
Depends on how many hands you've recently broken?
I assume they're using on-device Gemini Nano: https://developer.android.com/ai/gemini-nano
It's like a hidden curse of LLMs -- they're so good at parsing intended meaning from non-grammatically-correct language that we don't have to be very good at clear communication.
Eventually all LLMs will be controlled by humans uttering terse gutteral grunts. We will all become neanderthals, with machines that deliver our every whim.
I recommend the youtube channel @afadingthought to see what people come up with (like v=283-z29TXeM).
I dunno how I can express this best, but I found out a very long time ago that my problem with voice input wasn't that it wasn't good enough. My problem with voice input is that I don't want it. I am very happy for people who use these tools that they exist. I will not be them. Yes I am sure.
And yes, I know SuperWhisper can run offline, but it is a notable benefit that versus many modern speech recognition tools my keyboard does not require an always-active Internet connection, a subscription payment, or several teraflops of compute power.
I am not a flat-out luddite. I do use LLMs in some capacity, for whatever it is worth. Ethical issues or not, they are useful and probably here to stay. But my God, there are so many ways in which I am very happy to be "left behind".
https://www.youtube.com/watch?v=XthLQZWIshQ
https://en.wikipedia.org/wiki/Sorry_to_Bother_You
I think its brilliant UX.
Wouldn't SilentWhisper do just as good a job?
First things that came to mind:
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?In my experience, any combination of computers + speech + danish has, so far without exception been terrible. Last time I tested ChatGPT, it couldn't understand me at all. I spoke both in my local dialect and as close to Rigsdansk [π] as I could manage. Unusable performance, and in any case I should be able to talk normally, or there's no point. It was about a year ago - it may have improved but I doubt it. I'm completely done trying to talk to machines.
Pre-emptive kamelåså: https://www.youtube.com/watch?v=s-mOy8VUEBk
[π] https://en.wikipedia.org/wiki/Danish_language#Dialects
It's a cool idea for the future when we have reliable EEG headsets or Neuralink or whatever though.
I'll talk to a computer, even in an office setting, if it adds enough value. But it's got to be a lot of value. Handsfree while driving is great, Iron Man talking to Jarvis while he's flying around makes sense. Many of us here are developers, engineers, or scientists, and our work has already been co-optimized with mouse and keyboard and whatever software we're in.
But when the software is less well-developed, or when it's not just dealing with technical data dumps, I imagine that a voice interface might be more useful.
So I think this idea has legs. But a successful implementation might also well be decades out.
The Siri voice transcription is pretty awful compared to what I've experienced with ChatGPT though and it's weird going back almost to the pre-LLM world where you have to give such clear sort of computer-coded voice commands.
Reads like the argument against cell phones where don't have a cabinet around you...
I'd go and find a small meeting room or conference call booth in the office and take it there.
Essentially, a cabinet.
In fact, when humans happen to order other humans, it's typically done in writing.
https://www.youtube.com/watch?v=46EopD_2K_4
>We present a general-purpose implementation of Grossman and Balakrishnan's Bubble Cursor [broken link] the fastest general pointing facilitation technique in the literature. Our implementation functions with any application on the Windows 7 desktop. Our implementation functions across this infinite range of applications by analyzing pixels and by leveraging human corrections when it fails.
Transcript:
>We present the general purpose implementation of the bubble cursor. The bubble cursor is an area cursor that expands to ensure that the nearest target is always selected. Our implementation functions on the Windows 7 desktop and any application for that platform. The bubble cursor was invented in 2005 by Grossman and Balakrishnan. However a general purpose implementation of this cursor one that works with any application on a desktop has not been deployed or evaluated. In fact the bubble cursor is representative of a large body of target aware techniques that remain difficult to deploy in practice. This is because techniques like the bubble cursor require knowledge of the locations and sizes of targets in an interface. [...]
https://www.dgp.toronto.edu/~ravin/papers/chi2005_bubblecurs...
>The Bubble Cursor: Enhancing Target Acquisition by Dynamic Resizing of the Cursor’s Activation Area
>Tovi Grossman, Ravin Balakrishnan; Department of Computer Science; University of Toronto
I've written more about Morgan Dixon's work on Prefab (pre-LLM pattern recognition, which is much more relevent with LLMs now).
https://news.ycombinator.com/item?id=11520967
https://news.ycombinator.com/item?id=14182061
https://news.ycombinator.com/item?id=18797818
https://news.ycombinator.com/item?id=29105919
https://www.media.mit.edu/publications/put-that-there-voice-...
https://www.youtube.com/watch?v=RyBEUyEtxQo
(And if it's an abstract entity like a file, it might not even be possible to describe it, short of rattling off the entire file path)
Imagine trying to convince someone in the 90s that that's a step forward.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
Cool research. I wonder what we'll end up with.
Why not constrain your computing? It will require some programming chops, but you can note down your common tasks, figure out where actual input are required, and automate the rest.
and that's aside from the obvious privacy problems.
And being able to take photos/videos with the glasses (like the Meta ones nowadays) is really useful with my kid because he often does funny or cute stuff and I don't have time to pull my phone out to take a video/photo of it. I guess it could be useful for video calls too so my parents can see him.
But I just don't see anyone sitting in an office, or even at home, talking to their computer. It's really only useful for hands-free settings like when you are driving, or in the kitchen etc.
IMO, that’s the anti social part. Why is a phone butting in your relation with your kid? I only have a few pictures taken when I was a kid (and most were for some grand occasion). And I’m happy that was the case because of so many cringeworthy things I did.
Maybe you can share a scenario for that one? I can’t figure a scenario where all of this needs to be true. It seems like a recipe for accidents.
The second demo seems to be a wash: there's no time saved in saying "move this" versus "move crab". And an app-specific contextual menu would probably be faster.
The third demo doesn't seem to warrant the use of a pointer at all, since there is only one way to interpret the prompt.
None of this means that this approach will not be successful, but there's a reason why so many attempts to revolutionize user interfaces ended up going nowhere. Talking to your computer was always supposed to be the future, but in practice, it's slower and more finicky than typing.
In fact, the only new UI paradigm of the past 28+ years appears to have been touchscreens and swipe gestures on phones. But they are a matter of necessity. No one wants to finger-paint on a desktop screen.
For you and me, who have used keyboard in our lives for more than 1,000 or even 10,000 hours.
There was a brief period when typing slowed people down because they could write the same information down with pen&paper, and that period eventually passed.
Its wild that they even put this out as a demo. It should have been picked apart in the internal meeting. There is no way I'd ever show my product taking 5s to change a 1 to a 2 in a piece of text that the user was already hovering or taking 10s to drag and drop a line of text from one box to another. Even the image of finding a route between two images could be done quick if images were auto OCR'd which is a setting on most image viewers.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
Profit!
Now you get to hear every person in the office do that around you.
Like, good tech, but do googlers live in the real world? Do they genuinely like the idea of an open office full of people talking to their computers? Do they all live alone without human contact?
the agent occasionally spots your real problem like an experienced engineer
I like text selection exactly how it is. I want precise controls.
It's fine for a touch interface like a phone, but on a computer I expect precision. As much as I can get.
If we manage that, my plans for a pure XML based shell might not be too futuristic '<in><ls/></in><out><tree><file date="CDATA[...]">' ...
I'm sure Don Hopkins can tell you a long annotated tale about the NeWS pizza ordering app that displayed a real-time dynamically-updated rotating pie on the screen as you filled out your order.
I'm imagining a webpage with a link - instead of opening a new link to quickly google something or opening three new tabs based on hyperlinks, i can point at a paragraph or line and ask it to tell me about it.
Maybe I can point at a song on Spotify and have it find me the youtube video, or vice versa (of course this is assuming a tool like this wouldn't stay locked into one ecosystem.. which it will).
Point is that the concept of talking to the computer with mouse as pointer is pretty cool and i guess a step closer to that whole sci-fi "look at this part of the screen and do something"
Maybe something with the file system? Like hover over any file and instead of seeing a snapshot or some details/metadata you could get a quick 1-2 line sentence on what it is. I suppose you may want to have that saved somewhere as well to cache it... but I'm def not an expert.
This would solve the always difficult issue of finding that one document! I still have trouble with document search on both mac & windows OS sometimes.
Anyway, I built a prototype on this idea, but instead of relying only on hover, I press Option to select a node in a custom AST-ish semantic layer I designed around a minimalist UI grammar, and Option + up/down arrows to move to parent/child node. This way, I have have an accurate pointer to the element I want to talk about, plus a minimal context window (parent component, state, a few navigation related queries).
What I learned from using it, though, is that the killer use case isn't necessarily the flashy "talk to this UI element" interaction shown in the Google demos. I do use it that way too; I have `Option + Shift + click` to copy a selector to the clipboard, so I can give an LLM connected to the live medium a precise reference to the element I want to discuss.
But the place where it has been most useful day to day is much simpler: source navigation. Point at the thing in the UI, jump to the code that is responsible for it. The difficult part is jumping to the code you care about (the code for UI or for the semantic element?), but in my system that distinction turned out to be usually obvious, which is what makes the interaction useful.
1. select text
2. dictate action
Feels very similar to Helix's select text and act on it.
I think text selection could also be voice controlled (with a modal voice input), so one could say: "select sentence, action mode, copy and paste it in my list and remove duplicates"
Interesting but not “reimagining” anything.
I think the real story here is how vibe coding now enables flashy demo sites like this to be built for a concept that hasn’t yet earned it.
And the paper https://dam-prod.media.mit.edu/uuid/8e6d934b-6c6f-48e4-b0a1-...
Also featured in the Starfire vision video from 1992: https://youtu.be/jhe1DFY-SsQ?t=286
https://en.wikipedia.org/wiki/MacPlaymate
>The game features a panic button that when clicked on will cover the computer screen with a fake spreadsheet. The player can also choose to print out Maxie's current pose as a pinup.
https://archive.org/details/mac_MacPlaymate
Geraldo interviews Chuck Farnham about getting sued by Playboy:
https://news.ycombinator.com/item?id=42571845
https://www.upi.com/Archives/1989/02/09/Playboy-sues-over-se...
Of course learning proper cad software is probably the right thing here, but having Claude write python scripts which generate HTML files which reference three.js to provide a 3d view has gotten me surprisingly far. If something could take my pointer click and reverse whatever coordinate transforms are between the source code and my screen such that the model sees my click in terms of the same coordinate system it's writing python in, well that would be pretty slick.
Until then I've just had it list every surface in a legend, each colored differently, so I can say "three inches down from the top of pole six, and rotate it so the hoop part of the bolt faces northwest."
(Not going to happen)
Horizontal dragging with a mouse is actually really hard. Nobody's going to use it like that.
Your arm can easily move your hand and cursor up/down by pivoting your shoulder, but there's no mechanism for left/right movement. It's always an arc.
Or put another way: selection will be a lot slower and more tedious than the demo.
https://www.youtube.com/watch?v=NcUJnmBqHTY
But Google is a very ill positioned candidate for such OS. I would rather trust Apple and local-first on-device models.
>Sousveillance (/suːˈveɪləns/ soo-VAY-lənss) is the recording of an activity by a member of the public, rather than a person or organisation in authority, typically by way of small wearable or portable personal technologies.[14] The term, coined by Steve Mann,[15] stems from the contrasting French words sur, meaning "above", and sous, meaning "below", i.e. "surveillance" denotes the "eye in the sky" watching from above, whereas "sousveillance" denotes bringing the means of observation down to human level, either physically (by mounting cameras on people rather than on buildings) or hierarchically (with ordinary people observing, rather than by higher authorities or by architectural means).[16][17][23]
I'm mostly using my system to make comments on long AI-generated documents (especially design documents). I find it works well to have the AI generate something, and then I read through it, making comments along the way.
You can get pretty far just repeating the things you see... "I'm reading [heading] and [comments]". But I do find some use in selecting content and saying "I don't agree with this" or whatever else.
The result is just an augmented message. It looks like:
Then I just send this as a user message. Claude Code (and I'm guessing any of the agentic systems) picks up on the markup very easily. It also helps to label it as a transcript, as it can understand there may be errors, and things like spelling and punctuation are inferred not deliberate. (Some additional instruction is necessary to help it understand, for example, that it should look for homophones that might make more sense in context.)It makes reviewing feel pretty relaxed and natural. I've played around with similar note taking systems, which I think could be great for studying in school, but haven't had the focus on that particular problem to take it very far.
But I think the best thing really is giving the agent a richer understanding of what the user is experiencing and doing and just creating a rich representation of that. The keywords can be useful, but almost only as checkpoints: a keyword can identify the moment to take the transcript and package it up and deliver it.
One difference perhaps in design motivation: I have really embraced long latency interactions. I use ChatGPT with extended thinking by default, and just suck it up when the answer didn't really require thinking. I deliver 10 points of feedback at once instead of little by little. (Often halfway through I explicitly contradict myself, because I'm thinking out loud and my ideas are developing.) I just don't stress out about latency or feedback, and so low-latency but lower-intelligence interactions don't do it for me (such as ChatGPT's advanced voice mode, or probably Thinking Machine's work). I think this focus is in part a value statement: I'm trying to do higher quality work, not faster work.
you select text in vscode, and write a comment, and the llm gets both
Assuming that today the most efficient way for human to transfer information to machines is via voice. Assuming for machines to convey rich information to humans that's by printing html.
Then a combination of screen + eye tracking + voice is all you need. The mouse doesn't make sense anymore.
Links: https://x.com/trq212/status/2052809885763747935
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
At some point I fully expect eye tracking (or attention tracking) to be common enough to be a first-class input method.
I prefer keyboard operation myself, but I can see how this could become useful in the future, for certain use cases.
What would be useful as well then is if you could bind such a repeatedly-used AI command to button/menu item/keyboard shortcut in a way that it can still be used with pointing “this” and “that”.
More $$$ for the PM who launched the product.
On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.
Unless of course, their AI gets the same special privileges as the gpu in accessing drm content, and everything else is still locked out.
Bullshit!
You don't need any new metaphors to support such (questionable) flows - at all.
Swipes instead of selection rectangles are annoying - you don't see the traces of the swipes on these demo gifs! So, you've effectively "selected" something - but you have to keep in mind WHAT you selected.
Total ridiculous bullshit.
Wait…it's May. Ugh, I'm so confused. :spiral eyes emoji:
Would be tiresome though to hold hand out all day - but good for mobile and handwriting/drawing. Need zero latency.
Furthermore, the mousepad could become the magnetic sensor and not the screen in order to rest the palm. The nail bands then become the equivalent of the mouse so it's a hall effect mouse. But could the pad detect finger twitch for the buttons, though?
They have so many great software engineers but unable to use them to speed up coding AI research. Hopefully with Sergey's focus it will get better.
This cursor thing is just another experiment nobody cares about.
There's a reason chairs are still around. They are +2000 years old. Its still waiting to be replaced.
One should be extremely skeptical of claims of replacing tech that has been around for a very long time.
[0] https://news.ycombinator.com/item?id=48107027
Nightmares are dreams as well and this is a nightmare like Windows Recall.
Technically wonderful though.
This has a good utility.
We couldn't quite track you well enough before. So we're fixing that under the guise of "AI powered capabilities."
I also don't think people want to constantly talk to their computers.
People don’t. Tech companies think they do.
Only when you live alone might you be comfortable constantly speaking with your devices. Only if your life if perfectly predefined can you let your fridge order the same food that just gone stale or has been eaten. And only when you are young and healthy and not in any way differing from the "standard" would you be capable of working like these "researchers" imagine you to.
I'm not that person. I'm constantly failing at doing "triple-finger-taps" whenever I'm in need of one. I have a smartwatch with pedestrian navigation and never bothered to remember which vibration pattern means which turn. I don't configure different vibration patterns for different callers on the phone. I have a folding phone, but I almost never do side-by-side windows and when I do, I need to find out how to do that first -- and then how to leave that mode without losing my mind. I almost never use AI features on my phone not because I don't want to, but because I never remember how to activate them. I don't re-configure my gadgets to "fit my mood". I hate recommendations like "you like X, here's Y, it's the same!" I hate that I can't rest my mouse cursor on websites anymore without selecting something actionable, moving, animating or autoplaying.
All of the examples on the linked page are workflows I would never do this way. I won't be talking to my shopping list to double the ingredients. I won't be drawing gestures with my mouse on a document to activate a voice command. I won't use voice commands in general because as it turns out, I'm not capable of bringing out a complete coherent sentence without pausing and/or changing my mind and/or realizing I'm wrong once.
I appreciate those demos for the progress they are showing. It's impressive and astonishing to see restaurants getting extracted from videos or pictures getting expanded or text edited better than I ever could. It's all modern-day magic in a way. One thing it all isn't is a product. We don't have those anymore -- all we get are gimmicks. We don't do common interfaces anymore either, we are separating people in Google/Apple/Xiaomi camps.
And most importantly we don't use that technology for good except for a bunch of people writing e-mails all day, doing shopping lists and booking one of top restaurants in Tokyo for the same evening on a whim. We are long overdue for a remake of "American Psycho", but this time it will be a documentary instead of a satire.
I'm hoping for a const-reference joke.
Aaaaand now I can't remember the name of it
This reads like an April Fools joke. Even the title sounds like satire.