Running Local LLMs on Apple Silicon: M4 24GB Setup & Performance
AI News

Running Local LLMs on Apple Silicon: M4 24GB Setup & Performance

5 min
5/11/2026
apple siliconlocal llmmachine learningartificial intelligence

Local AI on Apple Silicon: A New Frontier for Developers

The promise of running powerful large language models (LLMs) locally, free from cloud dependencies, is becoming a tangible reality for developers equipped with modern Apple Silicon Macs. As detailed in a recent hands-on experiment, a MacBook Pro with an M4 chip and 24GB of unified memory can successfully host and run quantized models like Qwen 3.5-9B, enabling offline coding assistance, research, and planning.

This shift towards local AI is gaining urgency as Apple grapples with a severe global memory shortage. According to MacRumors, Apple has removed higher RAM configurations for Mac mini and Mac Studio models, with M4 Mac minis now capped at 24GB. This makes optimizing local AI performance on available hardware more critical than ever.

The Hardware Landscape: Memory Constraints and Choices

The feasibility of local LLMs is intrinsically tied to hardware memory. Apple's unified memory architecture offers high bandwidth but finite capacity. The 24GB configuration in base M4 Pro MacBooks, as highlighted by Wccftech, provides a sweet spot for multitasking while running LLMs.

However, the ecosystem is under strain. Apple has reportedly ceased offering Mac mini models with 32GB and 64GB of RAM, and Mac Studio configurations face delivery delays of up to 4-5 months. This scarcity, driven by surging demand from AI server builds, is pushing users to maximize the potential of their existing 24GB systems.

Software Stack: Ollama, llama.cpp, and LM Studio

Choosing the right inference engine is the first major hurdle. The primary options are Ollama, llama.cpp, and LM Studio. Each comes with distinct quirks, limitations, and model support, requiring careful evaluation based on the user's specific needs.

For the M4 with 24GB setup, LM Studio emerged as a successful platform. It provided the necessary balance of performance, configurability, and compatibility with client applications like Pi and OpenCode, which act as AI agent frameworks.

Model Selection: Finding the Sweet Spot

Model choice is a delicate balance of capability, size, and context window. The experiment tested several models, including Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B. While these technically fit within 24GB, they were found to be unusably slow in practice.

The winner was Qwen 3.5-9B quantized to 4-bit (Q4_K_S). This model delivered approximately 40 tokens per second with thinking enabled, supported successful tool use, and offered a 128K context window—all while leaving sufficient memory for other applications.

Google's recently announced Gemma 4 models also present a compelling option. As reported by Ars Technica, these open models incorporate "speculative decoding" (MTP), which can accelerate inference by up to 2.5x on Apple M4 silicon, potentially making larger models more accessible.

continue reading below...

Optimal Configuration for Coding Tasks

Fine-tuning the model's parameters is essential for quality output, especially for coding. The recommended settings for thinking mode on Qwen 3.5-9B are:

  • temperature=0.6
  • top_p=0.95
  • top_k=20
  • min_p=0.0
  • presence_penalty=0.0
  • repetition_penalty=1.0

Enabling thinking mode in LM Studio required a specific tweak: adding {% - set enable_thinking = true %} to the prompt template in the Inference configuration tab.

Integration with AI Agent Frameworks

To make the local model practical, it must integrate with tools that facilitate real-world tasks. The setup successfully connected the LM Studio-hosted Qwen model to two agent frameworks:

Pi: Configured via a ~/.pi/agent/models.json file pointing to the local LM Studio server. A "hideThinkingBlock": true setting in settings.json improves the user interface by concealing the model's internal reasoning process.

OpenCode: Configured through ~/.config/opencode/opencode.json, similarly directing the client to the local inference endpoint and specifying the model's capabilities, including a 131,072-token context window.

Performance and Practical Utility: A Real-World Assessment

It's crucial to temper expectations. A local 9B parameter model is not a replacement for cloud-based State-of-the-Art (SOTA) models like GPT-4 or Gemini. It is more easily distracted, can get stuck in loops, and misinterpret requests.

The effective workflow is highly interactive. The model excels as a research assistant, a "rubber duck" for debugging, and a reference for programming language details. It cannot autonomously solve complex, multi-step problems but can assist significantly when guided step-by-step.

In testing, the model successfully analyzed and suggested fixes for Elixir Credo linter warnings and identified simple Git merge conflict resolutions. However, it sometimes failed to execute edits correctly, highlighting the need for user oversight.

Why It Matters: The Shift to Sustainable, Private AI

The drive towards local LLMs is multifaceted. It offers privacy (no data sent to the cloud), cost predictability (no subscription fees, just electricity), and offline capability. As noted in the experiment, it also reduces dependence on large US tech companies.

Furthermore, while the environmental cost of training these models is significant, using local hardware for inference shifts compute away from massive, energy-intensive data centers. The permissive Apache 2.0 license of newer models like Gemma 4, as reported by Ars Technica, further lowers the barrier to experimentation and innovation.

Conclusion: A Viable Niche in an AI-Dominated Landscape

Running local LLMs on an M4 Mac with 24GB of RAM is a viable and rewarding endeavor for developers willing to navigate the initial setup complexity and accept the performance trade-offs. The global RAM shortage makes efficient use of available hardware paramount.

The ecosystem of open models (Qwen, Gemma), efficient inference engines (LM Studio, llama.cpp), and agent frameworks (Pi, OpenCode) is maturing rapidly. For tasks requiring instant recall, basic coding assistance, and offline operation, a local setup provides a powerful, private, and engaging alternative to cloud-based AI, carving out a sustainable niche in the broader AI revolution.