Table of Contents

Deployment and Inference Evaluation

This section documents the practical deployment of the fine-tuned Qwen-2.5-1.5B LoRA-merged model as a LiteRT-LM artifact, including infrastructure used, inference setup on consumer hardware, measured performance, and key constraints observed in real-world edge execution.

Deployment Infrastructure

Conversion Environment (PyTorch → .tflite)

Conversion was executed using Google AI Edge Torch on a high-memory CPU-only AWS EC2 instance:

  • Instance type: r6i.4xlarge
  • Compute: 16 vCPUs
  • Memory: 128 GB RAM
  • GPU: None required
  • OS: Ubuntu 24.04 LTS
  • Static KV-cache length: 4096 tokens

Conversion completed in ~30–35 minutes with peak memory usage of ~30–35 GB, producing a ~1.6 GB .tflite artifact. No GPU resources were needed; the process is memory-bound due to graph lowering and KV-cache materialization. The final cost of conversion, including ~10 minutes of environment setup, was ~0.60 USD, as per the instance's ~1.00 USD per hour pricing.

LiteRT-LM Packaging

Packaging of the .tflite model into .litertlm format (for LiteRT-LM inference) was performed on a lighter instance:

  • Instance type: r6i.2xlarge
  • Compute: 8 vCPUs
  • Memory: 64 GB RAM
  • OS: Ubuntu 24.04 LTS

This step was lightweight, completing in minutes with minimal resource demands. The LiteRT-LM builder handled metadata alignment for Qwen tokenizer conventions. Including environment setup (~10 minutes), the final cost of LiteRT-LM packaging was ~0.075 USD, as per the instance's ~0.50 USD per hour pricing.

Inference Platform and Performance

Inference evaluation was conducted locally on consumer-grade hardware to reflect realistic offline/on-device use (e.g., farmer-facing mobile tools or extension devices in low-connectivity areas).

  • Device: Mac mini (Apple M4)
  • CPU: 10-core
  • GPU: 10-core (utilized via LiteRT-LM GPU backend)
  • Neural Engine: 16-core (not leveraged by current LiteRT-LM as of Dec 2025)
  • Unified memory: 16 GB
  • OS: macOS 26.1 "Tahoe"

Runtime configuration:

  • Backend: LiteRT-LM (with GPU acceleration on Apple Silicon)
  • Context length: 4096 tokens (static KV cache)
  • Precision: Quantized graph as produced by AI Edge Torch
  • Decoding: Conservative parameters aligned with Qwen conventions

Measured performance (typical farmer advisory queries, 50–150 token responses):

  • Time-to-first-token (TTFT): < 1 second
  • Token generation: Incremental decoding, stable throughput
  • End-to-end response time: ~2.5–4 seconds

These latencies support interactive, offline advisory scenarios without requiring constant connectivity.

Summary of Deployment Findings

  • The fine-tuned Qwen-2.5-1.5B model is deployable as a LiteRT-LM artifact using public AI Edge and LiteRT-LM tooling.
  • Conversion requires only high-RAM CPU instances (no GPUs needed), enabling cost-effective cloud workflows.
  • On modern consumer hardware (e.g., Apple M4), interactive latencies (~2.5–4 s end-to-end) are achieved, suitable for offline agricultural advisory use.
  • Generation characteristics remain runtime-dependent, with LiteRT-LM outputs showing reduced coherence relative to PyTorch — highlighting the importance of runtime-specific validation.

All deployment scripts, configurations, and artifacts are included in the repository for full reproducibility.