Privacy preserving AI surveillance - Coriolis Technologies

Eigensense is a privacy-preserving surveillance system. The camera uses AI to generate a text description of what it sees. This text is sent to an AI agent which evaluates the situation, and alerts humans if attention is needed. Video is processed completely on-device, not stored, not transmitted outside the camera.

e.g. The camera monitors a patient’s room, periodically sending texts like “There is a person lying on a bed”, which are evaluated by the AI agent. If the camera sends a message like “There is a person lying on the floor”, the AI agent alerts hospital staff.

This is an experimental proof-of-concept, not a production-ready system.

Why?

Privacy vs Security

There is a natural tension between privacy and security. People have a reasonable expectation of privacy even in public settings, and don’t want high-resolution videos of their every move to be viewable by strangers, and stored for an indefinite period. On the other hand, there are situations where security and safety require monitoring to identify dangerous situations - for example, people entering construction zones without protective clothing.

Eigensense resolves this tension between privacy and security by ensuring that video is not recorded or transmitted out of the camera. The text descriptions which are sent out describe the scene in general, and do not personally identify individuals.

Security Autopilot

There is a limit to how much a single person can effectively monitor, without becoming fatigued, or missing important cues. An AI agent can function without getting tired, in a consistent manner, and can be instructed to look for complex situations, spanning multiple cameras. This helps to scale out security and safety. Like fly-by-wire and autopilot technology, human ability is augmented, and human attention is reserved for where it is needed most, rather than getting depleted by routine monitoring.

Explainability and Transparency

The data flowing between the elements of the system and the instructions to the agents are in simple natural language. Therefore they can be examined by anyone, not merely technical experts. The system architecture, the prompts, the messages, the alerts, can be disclosed to affected individuals on demand without compromising other people’s privacy. The explainability and transparency of the system is more likely to secure the willing consent of the people being monitored.

How?

There are many possible implementations of this system - we describe one of the simplest models in a corporate setting

The sensor is an Apple device, iPhone or iPad running iOS 18. The sensor software is a web application. Image processing and conversion to text is done by Florence-2 base, a 230 million parameter vision model released by Microsoft. This model converts images to text descriptions purely locally, on-device, without sending any video or image data outside.

The sensor posts the text description to a corporate Slack channel. (Slack is a team communication platform, like Microsoft Teams)

An AI bot running within the corporate network is listening to the channel. It has been configured with a natural language prompt, specifying its role and what constitutes an alarming situation. For example, “You are an expert safety inspector. Below is a description of a factory floor. All people should be wearing yellow hard hats for safety. If the description mentions any people without hard hats, say ALARM and explain why, otherwise say OK. Your output will be relayed to safety personnel.”

On receipt of a new message, it uses a large language model (LLM) to evaluate whether the situation requires humans to be alerted. If an alert is warranted, a warning is posted to a separate channel, which results in humans being notified of the situation on their mobiles. This LLM can be run locally so that no surveillance text leaves the corporate network.

Humans can join both channels and can observe both the descriptions being posted, and the evaluation being performed by the AI bot.

Limitations and tradeoffs

The system’s core value comes from running the image to text pipeline entirely within the sensor device. Therefore we are restricted to vision models which can run on mobile/edge devices and their limited compute power. While Florence-2 is very good for its size, it can still make mistakes, both false positives and false negatives, which humans would not make.

The LLM judges whether a description calls for an alert; it can make mistakes and miss crucial as well. Small versions of local models (e.g. Microsoft Phi and Meta’s Llama) can make more mistakes than state of the art models running on the cloud (e.g. ChatGPT). There is a tradeoff between accuracy, privacy, and speed of response.

The system does not describe video, but a series of still images separated by seconds. While it is possible for the LLM to consider the last n snapshots to deduce changes in the situation, this is far less effective at capturing dynamic, fast-moving changes in the situation which a video processing model or human would catch.

Many of these limitations will reduce over time as AI capabilities become a focus for mobile hardware innovation. In situations where high requirements for privacy intersect with high requirements for safety, this system can deliver better results than the status quo.

AI & ML, Security