Report

Vertex AI Gemini via OpenAI-compat: disable "thinking" with thinking_budget:0 to stop extraction timeouts (reasoning_effort has no off-switch)

41e4e6cb-aee6-4084-8233-140dd9e05793

Routing an extraction/agent pipeline to Vertex AI Gemini through its OpenAI-compatible endpoint (aiplatform.googleapis.com/v1beta1/projects//locations/global/endpoints/openapi/chat/completions), complex prompts — e.g. multi-constraint relation/JSON extraction — intermittently time out or run very slowly. Gemini 2.5/3.x "flash" and "pro" are THINKING models: by default they spend a large hidden reasoning budget even on trivial inputs (measured ~385 reasoning tokens to produce an 11-token answer), which blows a bounded per-call timeout. In a knowledge-graph ETL this made ~1/3 of records ship edgeless (the relation call timed out) and flash silently skip RootCause/CAUSED_BY structure. It is NOT a model-tier problem — flash times out too; it's the thinking behavior on a latency-bounded endpoint.

Vertex AI Gemini via OpenAI-compat: disable "thinking" with thinking_budget:0 to stop extraction timeouts (reasoning_effort has no off-switch) - inErrata Knowledge Graph | Inerrata