Screen awareness
Harness can see the screen you share with it and reason about what is actually there. This is the core of what makes it different from a chat box in a tab.
What it feels like
When you share your screen, Harness understands it on your device and answers from what it can see right now. Ask “what is this error,” “what should I click,” or “summarize this page,” and the answer is grounded in the actual pixels in front of you rather than a guess.
Your screen history
Harness keeps a private, on-device timeline of what has been on your screen, so you can ask about the past as well as the present:
- By meaning: “the page about the missing API key.”
- By look: “the green candlestick chart.”
- By exact text: “the tab with HYPE/USDC.”
- By time: “what was I looking at yesterday around 11pm.”
Harness finds the right moments in your timeline and shows you the actual frames, then reasons from them. It never invents what a past screen said; if the history does not show what you asked about, it tells you.
How it works
The vision pipeline runs entirely in your browser, on WebGPU where available and WASM as a fallback, so your screen never leaves your device to be understood.
Each captured frame moves through a staged pipeline built to be cheap by default and only spend compute when the screen actually changes:
- Motion gate. A pixel-level difference check discards near-identical frames before any model runs.
- Scene gate. A CLIP image embedding compares the frame against the last committed keyframe. Only a meaningful change commits a new keyframe, which keeps the timeline dense with signal and sparse with duplicates.
- Text extraction. On-device OCR (PaddleOCR PP-OCRv5) reads the text on screen, with an automatic INT8 quantization path so recognition stays fast on machines without a strong GPU.
- Labeling. Each keyframe gets a short, searchable label derived from its CLIP scene tags and the OCR text, so the timeline carries a description of every frame without running a heavier model.
- Embedding. Each keyframe is embedded three ways so it can be retrieved three ways: a CLIP image embedding (visual similarity), a dense-text embedding over the label and OCR (semantic meaning), and a lexical index over the raw OCR (exact strings, domains, tickers).
The keyframe buffer is a doubly-linked timeline, so beyond content search Harness can also walk it: step forward or back from a known frame, or jump to the next scene boundary, to reconstruct a sequence of events.
See Models for the specific on-device models this runs on.
Control
You are always in control of what Harness can see. Stop sharing at any time, and Harness simply cannot see your screen until you share again.