FAQs | Ambarella

Yes – the training code and pruning recipes will be released on GitHub soon. User can train those models with their own dataset and different input image resolution.

Choose models based on your task category.

Image classification: ResNet, EfficientNetV2 — assign a single label to an image.
Object detection: YOLOX, RTMDet — locate and classify multiple objects.
Segmentation: DeepLabv3+, TopFormer — assign a class to each pixel.
Vision-Language Models (VLMs): OWL-ViT, LLaVA OneVision — enable multimodal reasoning and can be applied across classification, detection, and segmentation tasks.

The model garden provides curated recommendations per task, including compute and memory requirements to help match your silicon budget. VLMs are more resource-intensive, but offer strong generalization and often work without task-specific training.

Yes — every model in the garden comes with a pre-validated runtime package, including compatible ONNX/quantized models, pre-processing pipelines (resize, normalization, tokenization), and post-processing (NMS, depth scaling, keypoint decoding, etc.).

Quantization (INT8, W4A8, mixed precision, etc.) and unstructured sparsity reduce memory and latency but can introduce negligible accuracy losses depending on the model and data distribution. The models on model garden are curated models from Ambarella with tradeoff between data formats and pruning budget to recover accuracy. Accuracy of the models can be verified on boards.

For VLMs like OWL-ViT, LongCLIP, or LLaVA OneVision, prompts should be explicit, structured, and task-specific (e.g., “Describe the objects in this image,” “Locate all emergency vehicles,” “Answer the question based only on the image”). Accuracy can be evaluated using standardized benchmarks (VQA, COCO retrieval, phrase grounding, open-vocabulary detection) or domain-specific metrics such as answer correctness, grounding precision, or retrieval recall. The SDK includes evaluation utilities to run these tests locally on the silicon for consistent measurement.

Frequently Asked Questions

Can fine-tune a model and still deploy it on the silicon?

How do I select the right model for my application (detection/classification vs VLM)?

Do these models run out-of-the-box on the silicon, including pre/post processing?

How does pruning/ quantization affect model accuracy, and how can I debug accuracy drops on silicon?

How should I write prompts for Vision-Language Models, and how do I measure their accuracy?