acoustsee

Future Features

Sorted by priority

6/16/2025

Considering adding OpenCV support

Adapting OpenCV for AcoustSee To meet the 66ms constraint and aid a blind user via audio cues, here’s how OpenCV can be tailored for AcoustSee: Pipeline Design Camera Input: Capture frames at 15–30 FPS (66–33.3 ms) at 480p or 720p to reduce processing load.

Use browser-based APIs (e.g., getUserMedia) for WebAssembly compatibility.

Object Detection: Use a lightweight model like MobileNet-SSD or YOLO-Tiny (pre-trained for common objects: sidewalk, wall, car, swing).

Processing time: ~15–30 ms on mid-range devices for 720p.

Output: Bounding boxes and labels for objects (e.g., “sidewalk: center, wall: left”).

Feature Extraction: Analyze color and texture within bounding boxes using OpenCV’s image processing (e.g., HSV color histograms, edge detection).

Example: Sidewalk = smooth, gray (low-frequency hum); wall = flat, textured (mid-frequency tone).

Processing time: ~5–10 ms.

Depth Estimation: Use a lightweight monocular depth model (e.g., MiDaS Small, optimized for TFLite) to estimate object distances.

Example: Swing at 2m = loud, broad sound; at 5m = quiet, narrow sound.

Processing time: ~20–30 ms on flagships, ~30–50 ms on mid-range.

Alternative: Use motion cues (e.g., optical flow) for faster processing (~10–20 ms).

Audio Mapping: Map visual features to audio cues using Web Audio API: Position: Stereo panning (left-right based on bounding box x-coordinate).

Depth: Volume (louder for closer) and spectral complexity (broader for closer).

Object type: Unique spectral signatures (e.g., hum for sidewalk, tone for wall, broadband for swing).

Generate 3–6 cues per 66ms frame, each ~5–10 ms, to align with auditory resolution.

Total Latency: Example: MobileNet-SSD (20 ms) + feature extraction (10 ms) + depth estimation (~30 ms) = ~60 ms on a flagship device for 720p.

Optimizations (e.g., 480p, quantized models, GPU) can reduce this to ~40–50 ms, fitting within 66ms.

Optimizations for Real-Time Lower resolution: Use 480p (640×480) instead of 720p/1080p to cut processing time by ~30–50%.

Lightweight models: Use quantized TFLite models (e.g., MobileNet-SSD, MiDaS Small) for 2–3x speedup.

Frame skipping: Process every other frame (effective 15 FPS) if needed, while interpolating audio cues.

GPU/Neural acceleration: Leverage OpenCV’s DNN module with OpenCL or mobile neural engines.

Asynchronous processing: Run image processing in parallel with audio synthesis to reduce perceived latency.

Audio Cue Design Number of cues: Limit to 3–6 per 66ms frame (e.g., sidewalk, wall, swing) to ensure auditory clarity, based on the 5–10 ms per cue limit from our previous discussion.

Spectral signatures: Sidewalk: Low-pass noise (100–200 Hz), center-panned, steady.

Wall: Sine wave (500–1000 Hz), left-panned, constant.

Swing: Sawtooth wave (200–2000 Hz), center-panned, dynamic volume/filter.

Car: Bandpass noise (500–5000 Hz), panned based on position.

Dynamic updates: Adjust volume and filter cutoff every 66ms based on depth/motion (e.g., swing closer = louder, broader spectrum).

Example of UI Templates:

The confirmation is done by pressing the center vertical rectangular square, that also works as webcam feed preview/canvas

The start and stop of the navigation is done by pressing the buttom trapezoid.

A console log live view and a copy feature is being considered too.