Deployment and Inference Evaluation
This section documents the practical deployment of the fine-tuned Qwen-2.5-1.5B LoRA-merged model as a LiteRT-LM artifact, including infrastructure used, inference setup on consumer hardware, measured performance, and key constraints observed in real-world edge execution.
Deployment Infrastructure
Conversion Environment (PyTorch → .tflite)
Conversion was executed using Google AI Edge Torch on a high-memory CPU-only AWS EC2 instance:
- Instance type: r6i.4xlarge
- Compute: 16 vCPUs
- Memory: 128 GB RAM
- GPU: None required
- OS: Ubuntu 24.04 LTS
- Static KV-cache length: 4096 tokens
Conversion completed in ~30–35 minutes with peak memory usage of ~30–35 GB, producing a ~1.6 GB .tflite artifact. No GPU resources were needed; the process is memory-bound due to graph lowering and KV-cache materialization. The final cost of conversion, including ~10 minutes of environment setup, was ~0.60 USD, as per the instance's ~1.00 USD per hour pricing.
LiteRT-LM Packaging
Packaging of the .tflite model into .litertlm format (for LiteRT-LM inference) was performed on a lighter instance:
- Instance type: r6i.2xlarge
- Compute: 8 vCPUs
- Memory: 64 GB RAM
- OS: Ubuntu 24.04 LTS
This step was lightweight, completing in minutes with minimal resource demands. The LiteRT-LM builder handled metadata alignment for Qwen tokenizer conventions. Including environment setup (~10 minutes), the final cost of LiteRT-LM packaging was ~0.075 USD, as per the instance's ~0.50 USD per hour pricing.
Inference Platform and Performance
Inference evaluation was conducted locally on consumer-grade hardware to reflect realistic offline/on-device use (e.g., farmer-facing mobile tools or extension devices in low-connectivity areas).
- Device: Mac mini (Apple M4)
- CPU: 10-core
- GPU: 10-core (utilized via LiteRT-LM GPU backend)
- Neural Engine: 16-core (not leveraged by current LiteRT-LM as of Dec 2025)
- Unified memory: 16 GB
- OS: macOS 26.1 "Tahoe"
Runtime configuration:
- Backend: LiteRT-LM (with GPU acceleration on Apple Silicon)
- Context length: 4096 tokens (static KV cache)
- Precision: Quantized graph as produced by AI Edge Torch
- Decoding: Conservative parameters aligned with Qwen conventions
Measured performance (typical farmer advisory queries, 50–150 token responses):
- Time-to-first-token (TTFT): < 1 second
- Token generation: Incremental decoding, stable throughput
- End-to-end response time: ~2.5–4 seconds
These latencies support interactive, offline advisory scenarios without requiring constant connectivity.
Summary of Deployment Findings
- The fine-tuned Qwen-2.5-1.5B model is deployable as a LiteRT-LM artifact using public AI Edge and LiteRT-LM tooling.
- Conversion requires only high-RAM CPU instances (no GPUs needed), enabling cost-effective cloud workflows.
- On modern consumer hardware (e.g., Apple M4), interactive latencies (~2.5–4 s end-to-end) are achieved, suitable for offline agricultural advisory use.
- Generation characteristics remain runtime-dependent, with LiteRT-LM outputs showing reduced coherence relative to PyTorch — highlighting the importance of runtime-specific validation.
All deployment scripts, configurations, and artifacts are included in the repository for full reproducibility.