[{"data":1,"prerenderedAt":1215},["ShallowReactive",2],{"post-\u002Fblog\u002Fbuilding-ml-inference-part-3":3},{"id":4,"title":5,"body":6,"book":1203,"date":1204,"description":1205,"extension":1206,"meta":1207,"navigation":1208,"path":1209,"seo":1210,"stem":1211,"tags":1212,"__hash__":1214},"blog\u002Fblog\u002Fbuilding-ml-inference-part-3.md","Building an ML Inference API, Part III",{"type":7,"value":8,"toc":1191},"minimark",[9,14,31,62,70,77,80,85,137,141,148,154,157,167,187,201,210,221,232,236,243,246,288,291,336,339,347,351,358,364,370,383,394,411,429,435,438,446,453,457,467,485,495,511,514,547,550,558,561,567,637,647,658,661,667,717,720,732,745,789,797,820,823,827,838,844,855,858,864,875,878,900,905,911,926,929,947,950,961,972,976,979,993,1000,1006,1012,1025,1031,1048,1051,1056,1067,1078,1082,1187],[10,11,13],"h2",{"id":12},"template","Template",[15,16,17,18,23,24,30],"p",{},"As a follow-up to ",[19,20,22],"a",{"href":21},"\u002Fblog\u002Fbuilding-ml-inference-part-2","Part II",": if you are looking for a template to start building an ML Inference API engine, check out ",[19,25,29],{"href":26,"rel":27},"https:\u002F\u002Fgithub.com\u002Fhaiilong\u002Fml-inference-api-template",[28],"nofollow","haiilong\u002Fml-inference-api-template"," on GitHub. It covers:",[32,33,34,38,41,44,47,50,53,56,59],"ol",{},[35,36,37],"li",{},"Project structure",[35,39,40],{},"Sample codes with 2 different endpoints",[35,42,43],{},"2 sample models",[35,45,46],{},"Middlewares set up",[35,48,49],{},"Dockerfile",[35,51,52],{},"uv as Python package manager",[35,54,55],{},"Unit tests (pytest), load tests (k6), e2e tests (node-fetch)",[35,57,58],{},"Deployment (Docker registry, Kubernetes cluster, gitlab-ci)",[35,60,61],{},"How to extend and apply to your use case",[15,63,64,65,69],{},"This structure is ",[66,67,68],"strong",{},"production-ready",". Of course you can always add your own stuff like HPA, VPA or other middlewares if needed.",[15,71,72,73,76],{},"PLEASE READ THE README CAREFULLY IF YOU WANT TO USE THIS TEMPLATE ",[74,75],"d",{},".",[15,78,79],{},"In this blog, I will be explaining the technical decisions used in this template that we eventually used for all ML Inference API.",[15,81,82],{},[66,83,84],{},"Sections:",[32,86,87,93,99,105,111,125,131],{},[35,88,89],{},[19,90,92],{"href":91},"#_1-why-fastapi-uvicorn-gunicorn-not-flask-gevent","Why FastAPI \u002F Uvicorn \u002F Gunicorn (not Flask + gevent)",[35,94,95],{},[19,96,98],{"href":97},"#_2-why-orjson-over-the-default-json-encoder","Why orjson over the default JSON encoder",[35,100,101],{},[19,102,104],{"href":103},"#_3-why-uv-over-conda","Why uv over conda",[35,106,107],{},[19,108,110],{"href":109},"#_4-anyio-vs-asyncio","anyio vs asyncio",[35,112,113],{},[19,114,116,120,121,124],{"href":115},"#_5-async-def-plus-run_in_threadpool-the-importance-of-the-offload",[117,118,119],"code",{},"async def"," plus ",[117,122,123],{},"run_in_threadpool"," (the importance of the offload)",[35,126,127],{},[19,128,130],{"href":129},"#_6-why-an-initcontainer-loads-models-from-a-pvc-in-kubernetes","Why an initContainer loads models from a PVC in Kubernetes",[35,132,133],{},[19,134,136],{"href":135},"#_7-hot-reload-models-in-production-or-redeploy","Hot reload models in production, or redeploy?",[10,138,140],{"id":139},"_1-why-fastapi-uvicorn-gunicorn-not-flask-gevent","1. Why FastAPI \u002F Uvicorn \u002F Gunicorn (not Flask + gevent)",[15,142,143,144,147],{},"A common Python ML serving stack is Flask running under gunicorn with gevent workers. It works, it has been deployed everywhere, and for very simple cases it is fine. The template uses ",[66,145,146],{},"FastAPI on Uvicorn workers under Gunicorn"," instead. Reasons:",[15,149,150,153],{},[66,151,152],{},"Concurrency model."," Gevent achieves concurrency by monkey-patching the standard library to make blocking I\u002FO cooperatively yield. That model is great for I\u002FO-bound web apps but has well-known sharp edges with C extensions. NumPy, scikit-learn, XGBoost, and CatBoost all spend most of their time inside C code that does not yield to gevent's scheduler, so a single slow inference still blocks every request that the worker is multiplexing. You end up tuning around the very thing you wanted concurrency for.",[15,155,156],{},"FastAPI on Uvicorn uses an asyncio event loop. CPU-bound inference is offloaded explicitly to a bounded thread pool (see section 5), which means the loop never blocks even when prediction takes 200 ms. It is the same end goal as gevent but with explicit boundaries instead of monkey-patched implicit ones.",[15,158,159,162,163,166],{},[66,160,161],{},"Validation and schema for free."," FastAPI builds on Pydantic. Defining a request body as a ",[117,164,165],{},"BaseModel"," gives you:",[168,169,170,173,180],"ul",{},[35,171,172],{},"automatic 422 responses with field-level error messages on bad input,",[35,174,175,176,179],{},"generated OpenAPI \u002F Swagger UI at ",[117,177,178],{},"\u002F",",",[35,181,182,183,186],{},"typed request handlers, so editor tooling and ",[117,184,185],{},"mypy"," actually understand the code.",[15,188,189,190,193,194,193,197,200],{},"In Flask, you write that boilerplate by hand or pull in extensions (",[117,191,192],{},"marshmallow",", ",[117,195,196],{},"flask-pydantic",[117,198,199],{},"flask-smorest","). For an inference API where the request shape is the contract with the caller, the FastAPI default is much closer to \"what you would build anyway.\"",[15,202,203,206,207,166],{},[66,204,205],{},"Why Gunicorn at all if Uvicorn can serve directly."," Uvicorn is the ASGI server (one process, one event loop). Gunicorn is a process supervisor that manages multiple worker processes, handles graceful reload on SIGHUP, restarts crashed workers, listens on a single socket and load-balances across workers, and integrates with most ops tooling. The combination ",[117,208,209],{},"gunicorn -k uvicorn.workers.UvicornWorker",[168,211,212,215,218],{},[35,213,214],{},"process-level isolation across CPU cores (one event loop per worker),",[35,216,217],{},"production-grade signal handling and graceful shutdown,",[35,219,220],{},"the ASGI runtime you actually want.",[15,222,223,224,227,228,231],{},"Tuning: set ",[117,225,226],{},"GUNICORN_WORKERS"," to roughly ",[117,229,230],{},"min(cpu_count, target_concurrency \u002F threads_per_worker)",". For inference workloads, one worker per CPU core is usually a good starting point.",[10,233,235],{"id":234},"_2-why-orjson-over-the-default-json-encoder","2. Why orjson over the default JSON encoder",[15,237,238,239,242],{},"Python's stdlib ",[117,240,241],{},"json"," module is pure Python. For inference APIs that return predictions as JSON, the encoder can become a measurable share of total request time, especially when responses contain numpy floats or longer batch outputs.",[15,244,245],{},"orjson is written in Rust and:",[168,247,248,254,268,285],{},[35,249,250,251,253],{},"Serializes typically 3 to 10 times faster than stdlib ",[117,252,241],{}," and 2 to 4 times faster than ujson.",[35,255,256,257,260,261,263,264,267],{},"Returns ",[117,258,259],{},"bytes"," directly, which is the native ASGI write type. Stdlib ",[117,262,241],{}," produces a ",[117,265,266],{},"str"," that has to be encoded to bytes again.",[35,269,270,271,193,274,193,277,280,281,284],{},"Handles ",[117,272,273],{},"datetime",[117,275,276],{},"UUID",[117,278,279],{},"dataclasses",", and (via passthrough flags) numpy arrays without a custom ",[117,282,283],{},"default="," function.",[35,286,287],{},"Is strict about the JSON spec (no NaN by default, deterministic key order if asked).",[15,289,290],{},"We wire it in once on the app object:",[292,293,298],"pre",{"className":294,"code":295,"language":296,"meta":297,"style":297},"language-python shiki shiki-themes one-light one-dark-pro","app = FastAPI(default_response_class=ORJSONResponse, ...)\n","python","",[117,299,300],{"__ignoreMap":297},[301,302,305,309,313,317,320,324,326,329,333],"span",{"class":303,"line":304},"line",1,[301,306,308],{"class":307},"s5ixo","app ",[301,310,312],{"class":311},"sknuh","=",[301,314,316],{"class":315},"slOjB"," FastAPI",[301,318,319],{"class":307},"(",[301,321,323],{"class":322},"sp7wS","default_response_class",[301,325,312],{"class":311},[301,327,328],{"class":307},"ORJSONResponse, ",[301,330,332],{"class":331},"sYebD","...",[301,334,335],{"class":307},")\n",[15,337,338],{},"After that, every endpoint returns orjson-encoded responses with no per-route changes.",[15,340,341,342,346],{},"When ",[343,344,345],"em",{},"not"," to bother: if your responses are tiny (a single float) and your throughput is low (under 100 RPS), the win is nanoseconds and you should not care. For batch predictions, payload sizes grow linearly with batch size and orjson is the right default.",[10,348,350],{"id":349},"_3-why-uv-over-conda","3. Why uv over conda",[15,352,353,354,357],{},"ML projects often default to conda because that is what notebook environments use. For a deployable inference service, ",[117,355,356],{},"uv"," is dramatically better.",[15,359,360,363],{},[66,361,362],{},"Speed."," uv is written in Rust. A clean dependency install for this template takes a few seconds. The equivalent conda env solve (especially with custom channels and pinned versions) routinely takes minutes. CI and Docker builds inherit that delta.",[15,365,366,369],{},[66,367,368],{},"Simpler container images."," With conda, you typically end up with two Dockerfiles:",[168,371,372,378],{},[35,373,374,377],{},[117,375,376],{},"Dockerfile.base"," that installs micromamba and creates the env.",[35,379,380,382],{},[117,381,49],{}," that copies source on top of the base.",[15,384,385,386,389,390,393],{},"Plus a ",[117,387,388],{},"Dockerfile.local"," and a ",[117,391,392],{},"Dockerfile.test"," to keep things consistent. The base image has to be pre-built and cached in a registry to keep main builds fast.",[15,395,396,397,400,401,404,405,407,408,410],{},"With uv, the production image is one stage, no base, no separate registry artifact. ",[117,398,399],{},"uv sync --frozen --no-dev"," installs into ",[117,402,403],{},"\u002Fopt\u002Fvenv"," quickly enough that you do not need a pre-built base. The template uses a single ",[117,406,49],{}," (plus a tiny ",[117,409,392],{},").",[15,412,413,416,417,420,421,424,425,428],{},[66,414,415],{},"First-class lockfile."," ",[117,418,419],{},"uv.lock"," is committed and reproducible. ",[117,422,423],{},"uv lock --check"," in CI fails fast if anyone forgot to update it. Conda's lockfile story (",[117,426,427],{},"conda-lock",") works but is a separate tool with separate pitfalls.",[15,430,431,434],{},[66,432,433],{},"Pure PyPI."," No custom channels. No channel ordering bugs. No \"which channel does this come from\" mysteries.",[15,436,437],{},"When conda still wins:",[168,439,440,443],{},[35,441,442],{},"Hard non-Python dependencies that conda packages provide and PyPI does not (some MKL builds, some GIS stacks, some CUDA-pinned tooling).",[35,444,445],{},"Notebook \u002F data science workflows where you want one tool managing kernels, Python versions, and packages together.",[15,447,448,449,452],{},"For an ML inference ",[343,450,451],{},"service"," whose dependencies are joblib, scikit-learn, xgboost, and FastAPI, those gaps do not apply.",[10,454,456],{"id":455},"_4-anyio-vs-asyncio","4. anyio vs asyncio",[15,458,459,462,463,466],{},[117,460,461],{},"asyncio"," is the stdlib async runtime. ",[117,464,465],{},"anyio"," is a higher-level wrapper that runs on top of asyncio (or trio) and adds:",[168,468,469,472,479,482],{},[35,470,471],{},"structured concurrency (task groups with proper cancellation propagation),",[35,473,474,475,478],{},"a global thread limiter so ",[117,476,477],{},"anyio.to_thread.run_sync"," does not spawn unbounded threads,",[35,480,481],{},"cleaner cancel scopes and timeout primitives,",[35,483,484],{},"a portable API: the same code runs on asyncio or trio.",[15,486,487,488,490,491,494],{},"You do not need to import ",[117,489,465],{}," directly. ",[66,492,493],{},"Starlette and FastAPI use anyio internally",", which means:",[168,496,497,505,508],{},[35,498,499,502,503,76],{},[117,500,501],{},"fastapi.concurrency.run_in_threadpool"," is a thin wrapper over ",[117,504,477],{},[35,506,507],{},"The thread pool that runs your offloaded inference is anyio's, capped by anyio's global limiter (default 40 threads).",[35,509,510],{},"Dependency injection, background tasks, and lifespan all run on the anyio runtime.",[15,512,513],{},"Practical implications:",[168,515,516,529,536],{},[35,517,518,519,522,523,525,526,528],{},"Do not write ",[117,520,521],{},"asyncio.to_thread(...)"," in handlers. It bypasses the limiter and creates a separate pool that anyio cannot govern. Use ",[117,524,123],{}," (or ",[117,527,477],{}," directly).",[35,530,531,532,535],{},"If you need to raise the thread limit for high-concurrency CPU-bound inference, do it once at startup with ",[117,533,534],{},"anyio.to_thread.current_default_thread_limiter().total_tokens = N",". Be deliberate; threads have memory overhead and switching cost.",[35,537,538,539,542,543,546],{},"Mixing ",[117,540,541],{},"asyncio.create_task"," is fine, but prefer ",[117,544,545],{},"anyio.create_task_group"," for anything where you want clean cancellation on error.",[15,548,549],{},"In short: anyio is the runtime; asyncio is the engine underneath. Code against the FastAPI \u002F anyio surface and you stay portable and bounded.",[10,551,553,554,120,556,124],{"id":552},"_5-async-def-plus-run_in_threadpool-the-importance-of-the-offload","5. ",[117,555,119],{},[117,557,123],{},[15,559,560],{},"The README covers the basic recipe. This section explains the \"why it matters\" part.",[15,562,563,566],{},[66,564,565],{},"The pitfall."," A handler written like this looks innocent:",[292,568,570],{"className":294,"code":569,"language":296,"meta":297,"style":297},"@app.post(\"\u002Fpredict\u002Fprice\")\nasync def post_predict_price(request: PricePredictionRequest):\n    return calculate_price(request)  # WRONG: blocking call inside async handler\n",[117,571,572,592,621],{"__ignoreMap":297},[301,573,574,578,581,584,586,590],{"class":303,"line":304},[301,575,577],{"class":576},"sAdtL","@app",[301,579,76],{"class":580},"siaei",[301,582,583],{"class":576},"post",[301,585,319],{"class":307},[301,587,589],{"class":588},"sDhpE","\"\u002Fpredict\u002Fprice\"",[301,591,335],{"class":307},[301,593,595,599,602,605,607,611,614,618],{"class":303,"line":594},2,[301,596,598],{"class":597},"sLKXg","async",[301,600,601],{"class":597}," def",[301,603,604],{"class":576}," post_predict_price",[301,606,319],{"class":307},[301,608,610],{"class":609},"so_Uh","request",[301,612,613],{"class":307},":",[301,615,617],{"class":616},"sxymB"," PricePredictionRequest",[301,619,620],{"class":307},"):\n",[301,622,624,627,630,633],{"class":303,"line":623},3,[301,625,626],{"class":597},"    return",[301,628,629],{"class":315}," calculate_price",[301,631,632],{"class":307},"(request)  ",[301,634,636],{"class":635},"sW2Sy","# WRONG: blocking call inside async handler\n",[15,638,639,642,643,646],{},[117,640,641],{},"calculate_price"," calls ",[117,644,645],{},"model.predict(...)",", which is a synchronous CPU-bound C extension call. The event loop is blocked for the entire duration of that call. While it runs:",[168,648,649,652,655],{},[35,650,651],{},"no other request handler on this worker can make progress,",[35,653,654],{},"no health check can be answered,",[35,656,657],{},"no timeout, cancellation, or middleware can interleave.",[15,659,660],{},"If predictions take 50 ms and you receive 100 concurrent requests, the 100th caller waits 5 seconds for what should be a 50 ms operation. The event loop's whole value proposition (cheap concurrency) is wasted.",[15,662,663,666],{},[66,664,665],{},"The fix."," Offload the blocking call to a worker thread:",[292,668,670],{"className":294,"code":669,"language":296,"meta":297,"style":297},"@app.post(\"\u002Fpredict\u002Fprice\")\nasync def post_predict_price(request: PricePredictionRequest):\n    return await run_in_threadpool(calculate_price, request)\n",[117,671,672,686,704],{"__ignoreMap":297},[301,673,674,676,678,680,682,684],{"class":303,"line":304},[301,675,577],{"class":576},[301,677,76],{"class":580},[301,679,583],{"class":576},[301,681,319],{"class":307},[301,683,589],{"class":588},[301,685,335],{"class":307},[301,687,688,690,692,694,696,698,700,702],{"class":303,"line":594},[301,689,598],{"class":597},[301,691,601],{"class":597},[301,693,604],{"class":576},[301,695,319],{"class":307},[301,697,610],{"class":609},[301,699,613],{"class":307},[301,701,617],{"class":616},[301,703,620],{"class":307},[301,705,706,708,711,714],{"class":303,"line":623},[301,707,626],{"class":597},[301,709,710],{"class":597}," await",[301,712,713],{"class":315}," run_in_threadpool",[301,715,716],{"class":307},"(calculate_price, request)\n",[15,718,719],{},"Now the event loop awaits the thread pool and stays responsive. Other requests are served concurrently up to the size of the thread pool. The Python GIL still serializes pure-Python work between threads, but ML libraries (numpy, sklearn, xgboost) release the GIL inside their C kernels, so multi-threaded inference does scale.",[15,721,722],{},[66,723,724,725,728,729,731],{},"Why not let FastAPI do it implicitly with ",[117,726,727],{},"def"," instead of ",[117,730,119],{},"?",[15,733,734,735,738,739,741,742,744],{},"If you write a ",[343,736,737],{},"sync"," handler (",[117,740,727],{},", not ",[117,743,119],{},"), FastAPI auto-runs it in the threadpool for you. So you could write:",[292,746,748],{"className":294,"code":747,"language":296,"meta":297,"style":297},"@app.post(\"\u002Fpredict\u002Fprice\")\ndef post_predict_price(request: PricePredictionRequest):\n    return calculate_price(request)\n",[117,749,750,764,780],{"__ignoreMap":297},[301,751,752,754,756,758,760,762],{"class":303,"line":304},[301,753,577],{"class":576},[301,755,76],{"class":580},[301,757,583],{"class":576},[301,759,319],{"class":307},[301,761,589],{"class":588},[301,763,335],{"class":307},[301,765,766,768,770,772,774,776,778],{"class":303,"line":594},[301,767,727],{"class":597},[301,769,604],{"class":576},[301,771,319],{"class":307},[301,773,610],{"class":609},[301,775,613],{"class":307},[301,777,617],{"class":616},[301,779,620],{"class":307},[301,781,782,784,786],{"class":303,"line":623},[301,783,626],{"class":597},[301,785,629],{"class":315},[301,787,788],{"class":307},"(request)\n",[15,790,791,792,120,794,796],{},"and it would behave correctly. The reason the template prefers explicit ",[117,793,119],{},[117,795,123],{}," anyway:",[168,798,799,805,814],{},[35,800,801,804],{},[66,802,803],{},"Boundary visibility."," The reader sees exactly which call is blocking and which is async. With auto-offload, every reader has to remember the FastAPI rule.",[35,806,807,810,811,813],{},[66,808,809],{},"Mixed work."," The moment you need to await something else (fetch a fresh feature from a cache, log to an async sink, call a remote service) you must convert the handler to ",[117,812,119],{},". Starting that way avoids the rewrite.",[35,815,816,819],{},[66,817,818],{},"Middleware composition."," Async middleware (like the access log middleware in this template) interleaves more cleanly when handlers are async and only the inference itself is offloaded.",[15,821,822],{},"Both patterns are valid. The template picks the more explicit one.",[10,824,826],{"id":825},"_6-why-an-initcontainer-loads-models-from-a-pvc-in-kubernetes","6. Why an initContainer loads models from a PVC in Kubernetes",[15,828,829,830,833,834,837],{},"The deployment spec uses an ",[117,831,832],{},"initContainer"," that copies model files from a PVC into a shared ",[117,835,836],{},"emptyDir",", which the app container then mounts read-only. Why not just bake the model into the image?",[15,839,840,843],{},[66,841,842],{},"Image size and rebuild cost."," Model files are often hundreds of MB to several GB. Baking them into the image means:",[168,845,846,849,852],{},[35,847,848],{},"every deploy pushes those bytes to the registry,",[35,850,851],{},"every node pulls them on cold start,",[35,853,854],{},"a model update requires a full image rebuild, registry round-trip, and rollout.",[15,856,857],{},"For frequently retrained models, this turns deploys into slow, expensive operations even when the code did not change.",[15,859,860,863],{},[66,861,862],{},"Decoupled lifecycle."," Models are produced by a training pipeline. Code is produced by an application repo. Tying them together in a single image means you cannot:",[168,865,866,869,872],{},[35,867,868],{},"update a model without code review for the API repo,",[35,870,871],{},"run two pods with different model versions for A\u002FB testing without two images,",[35,873,874],{},"roll back the API independently of the model.",[15,876,877],{},"The PVC + initContainer pattern lets the training pipeline write to a known location (object storage backed by a PVC, or a model registry mount) and the API simply consumes whatever is there at pod start.",[15,879,880,883,884,887,888,891,892,895,896,899],{},[66,881,882],{},"Failure-loud at startup."," The ",[117,885,886],{},"cp ... || echo \"not found\""," pattern in the initContainer is intentionally permissive about ",[343,889,890],{},"which"," files are present, but the ",[117,893,894],{},"lifespan"," handler in ",[117,897,898],{},"app.py"," is strict: a missing model crashes startup. That gives you the right behavior:",[168,901,902],{},[35,903,904],{},"if the PVC mount is broken, the pod fails its readiness probe and the rolling deploy stops, so you never serve traffic with a partially loaded model.",[15,906,907,910],{},[66,908,909],{},"Trade-offs."," This pattern has costs:",[168,912,913,920,923],{},[35,914,915,916,919],{},"a deployment is no longer fully described by ",[117,917,918],{},"image:tag"," alone; you also need to know what was in the PVC at the moment of pod start. For audit \u002F compliance, capture the model checksum at startup and log it.",[35,921,922],{},"the cluster needs shared storage. In simple single-node setups, baking the model into the image is fine.",[35,924,925],{},"cold start is slower (initContainer pulls model files into the emptyDir).",[15,927,928],{},"When to bake the model into the image instead:",[168,930,931,934,937],{},[35,932,933],{},"the model is small (under ~100 MB) and changes only when code changes,",[35,935,936],{},"you do not have shared storage in the target cluster,",[35,938,939,940,943,944,946],{},"you specifically ",[343,941,942],{},"want"," the deployment to be reproducible from ",[117,945,918],{}," alone (regulated environments).",[15,948,949],{},"When to use a model registry (MLflow, Vertex AI, S3 prefixes) instead of a PVC:",[168,951,952,955,958],{},[35,953,954],{},"multiple clusters need the same model files,",[35,956,957],{},"you want versioned model URIs in the config rather than \"whatever is on the PVC,\"",[35,959,960],{},"you want rollback to a specific historical version.",[15,962,963,964,967,968,971],{},"The PVC pattern in the template is the smallest k8s-native version of \"model lives outside the image.\" Swap the PVC for a ",[117,965,966],{},"gcsfuse"," or ",[117,969,970],{},"s3fs"," mount, or replace the initContainer with an in-process download from a registry, when you outgrow it.",[10,973,975],{"id":974},"_7-hot-reload-models-in-production-or-redeploy","7. Hot reload models in production, or redeploy?",[15,977,978],{},"Two designs:",[168,980,981,987],{},[35,982,983,986],{},[66,984,985],{},"(A) Hot reload."," A background task watches a model registry (or a file path, or polls an HTTP endpoint), and atomically swaps the in-memory model reference when a new version arrives. Pods stay up. Optionally, an admin endpoint triggers a manual reload.",[35,988,989,992],{},[66,990,991],{},"(B) Redeploy."," Model paths or versions are pinned in source \u002F config. Updating the model means rolling out a new pod (with a new initContainer fetch, or a new image).",[15,994,995,996,999],{},"This template uses ",[66,997,998],{},"(B)",". The trade-offs:",[15,1001,1002,1005],{},[66,1003,1004],{},"Reproducibility."," With (B), a git SHA plus an image tag (plus, for this template, the model file checksum logged at startup) fully describes runtime behavior. When something goes wrong in production at 02:00, you can reconstruct exactly what was running. With (A), you also need \"which model version was loaded in this pod at the moment of the request,\" which means more logging discipline and more places for drift.",[15,1007,1008,1011],{},[66,1009,1010],{},"Atomic rollout."," Kubernetes already gives you a great rollout primitive: rolling deploys with health checks, automatic rollback on failed readiness, traffic shifting. With (B), updating a model uses that machinery for free. With (A), you reinvent it: you need a per-pod swap protocol, a way to drain in-flight requests off the old reference, and a rollback mechanism that is not just \"swap back.\"",[15,1013,1014,1017,1018,193,1021,1024],{},[66,1015,1016],{},"Multi-pod consistency."," During a hot reload, different pods will briefly serve different model versions. Usually fine, occasionally surprising (especially if the model output range changed between versions and downstream consumers care). Rolling redeploys still cause this transiently, but the kubelet's rollout strategy gives you knobs (",[117,1019,1020],{},"maxSurge",[117,1022,1023],{},"maxUnavailable",") to bound the window. Hot reload across N pods does not.",[15,1026,1027,1030],{},[66,1028,1029],{},"Failure surface."," Hot reload adds:",[168,1032,1033,1036,1039,1042,1045],{},[35,1034,1035],{},"a background polling thread or scheduler,",[35,1037,1038],{},"a model registry client,",[35,1040,1041],{},"an admin endpoint or signal handler (which needs auth),",[35,1043,1044],{},"atomic swap logic that is correct under concurrent reads from N request handlers,",[35,1046,1047],{},"monitoring for \"did the swap actually happen on every pod?\"",[15,1049,1050],{},"Each of those is a place a bug can live. Redeploys reuse Kubernetes machinery you already trust.",[15,1052,1053],{},[66,1054,1055],{},"When you would still want hot reload:",[168,1057,1058,1061,1064],{},[35,1059,1060],{},"Models retrain hourly or faster and redeploys are expensive (multi-GB images, slow startup with large models, many pods).",[35,1062,1063],{},"You need A\u002FB testing where the routing changes at runtime, not at deploy time.",[35,1065,1066],{},"You operate at a fleet scale where triggering N redeploys is itself a problem.",[15,1068,1069,1070,1073,1074,1077],{},"For a template aimed at \"first ML inference service,\" (B) is correct: it is simpler, more reproducible, and gives you Kubernetes-native rollout for free. If you outgrow it, the structure of the template (a ",[117,1071,1072],{},"MLModels"," singleton with explicit accessor methods) makes adding a ",[117,1075,1076],{},"reload()"," method localized and safe.",[10,1079,1081],{"id":1080},"summary-table","Summary table",[1083,1084,1085,1101],"table",{},[1086,1087,1088],"thead",{},[1089,1090,1091,1095,1098],"tr",{},[1092,1093,1094],"th",{},"Decision",[1092,1096,1097],{},"Choice",[1092,1099,1100],{},"Main reason",[1102,1103,1104,1116,1130,1140,1151,1165,1176],"tbody",{},[1089,1105,1106,1110,1113],{},[1107,1108,1109],"td",{},"Web framework",[1107,1111,1112],{},"FastAPI on Uvicorn under Gunicorn",[1107,1114,1115],{},"Native async, Pydantic validation, OpenAPI for free, no monkey-patching",[1089,1117,1118,1121,1127],{},[1107,1119,1120],{},"JSON encoder",[1107,1122,1123,1124],{},"orjson via ",[117,1125,1126],{},"ORJSONResponse",[1107,1128,1129],{},"3 to 10x faster, returns bytes, handles numpy and datetime cleanly",[1089,1131,1132,1135,1137],{},[1107,1133,1134],{},"Package \u002F env manager",[1107,1136,356],{},[1107,1138,1139],{},"Fast, single Dockerfile, first-class lockfile, no custom channels",[1089,1141,1142,1145,1148],{},[1107,1143,1144],{},"Async runtime",[1107,1146,1147],{},"anyio (via FastAPI)",[1107,1149,1150],{},"Bounded thread pool, structured concurrency, what FastAPI uses anyway",[1089,1152,1153,1156,1162],{},[1107,1154,1155],{},"Inference call style",[1107,1157,1158,120,1160],{},[117,1159,119],{},[117,1161,123],{},[1107,1163,1164],{},"Explicit offload boundary, composes with async middleware",[1089,1166,1167,1170,1173],{},[1107,1168,1169],{},"Model loading in k8s",[1107,1171,1172],{},"initContainer plus PVC plus emptyDir",[1107,1174,1175],{},"Decouples model lifecycle from image lifecycle, smaller images",[1089,1177,1178,1181,1184],{},[1107,1179,1180],{},"Model update strategy",[1107,1182,1183],{},"Redeploy, not hot reload",[1107,1185,1186],{},"Reproducibility, atomic rollout, fewer failure modes",[1188,1189,1190],"style",{},"html pre.shiki code .s5ixo, html code.shiki .s5ixo{--shiki-default:#383A42;--shiki-dark:#ABB2BF}html pre.shiki code .sknuh, html code.shiki .sknuh{--shiki-default:#383A42;--shiki-dark:#56B6C2}html pre.shiki code .slOjB, html code.shiki .slOjB{--shiki-default:#383A42;--shiki-dark:#61AFEF}html pre.shiki code .sp7wS, html code.shiki .sp7wS{--shiki-default:#986801;--shiki-default-font-style:inherit;--shiki-dark:#E06C75;--shiki-dark-font-style:italic}html pre.shiki code .sYebD, html code.shiki .sYebD{--shiki-default:#383A42;--shiki-dark:#D19A66}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html.dark .shiki span {color: var(--shiki-dark);background: var(--shiki-dark-bg);font-style: var(--shiki-dark-font-style);font-weight: var(--shiki-dark-font-weight);text-decoration: var(--shiki-dark-text-decoration);}html pre.shiki code .sAdtL, html code.shiki .sAdtL{--shiki-default:#4078F2;--shiki-dark:#61AFEF}html pre.shiki code .siaei, html code.shiki .siaei{--shiki-default:#4078F2;--shiki-dark:#ABB2BF}html pre.shiki code .sDhpE, html code.shiki .sDhpE{--shiki-default:#50A14F;--shiki-dark:#98C379}html pre.shiki code .sLKXg, html code.shiki .sLKXg{--shiki-default:#A626A4;--shiki-dark:#C678DD}html pre.shiki code .so_Uh, html code.shiki .so_Uh{--shiki-default:#986801;--shiki-default-font-style:inherit;--shiki-dark:#D19A66;--shiki-dark-font-style:italic}html pre.shiki code .sxymB, html code.shiki .sxymB{--shiki-default:#986801;--shiki-dark:#ABB2BF}html pre.shiki code .sW2Sy, html code.shiki .sW2Sy{--shiki-default:#A0A1A7;--shiki-default-font-style:italic;--shiki-dark:#7F848E;--shiki-dark-font-style:italic}",{"title":297,"searchDepth":594,"depth":594,"links":1192},[1193,1194,1195,1196,1197,1198,1200,1201,1202],{"id":12,"depth":594,"text":13},{"id":139,"depth":594,"text":140},{"id":234,"depth":594,"text":235},{"id":349,"depth":594,"text":350},{"id":455,"depth":594,"text":456},{"id":552,"depth":594,"text":1199},"5. async def plus run_in_threadpool (the importance of the offload)",{"id":825,"depth":594,"text":826},{"id":974,"depth":594,"text":975},{"id":1080,"depth":594,"text":1081},null,"2025-09-05","Technical design choices for ML Inference API","md",{},true,"\u002Fblog\u002Fbuilding-ml-inference-part-3",{"title":5,"description":1205},"blog\u002Fbuilding-ml-inference-part-3",[1213],"tech","2ze6JSE-YMpOJ1SLCozMf_nHeDWV4Xy8MX0cik5K3ZM",1778998257279]