Skip to main content

OpenAI Proxy Server

A local, fast, and lightweight OpenAI-compatible server to call 100+ LLM APIs.

info

We want to learn how we can make the proxy better! Meet the founders or join our discord

Usage

pip install litellm
$ litellm --model ollama/codellama 

#INFO: Ollama running on http://0.0.0.0:8000

Test

In a new shell, run:

$ litellm --test

Replace openai base

import openai 

openai.api_base = "http://0.0.0.0:8000"

print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))

Other supported models:

Assuming you're running vllm locally
$ litellm --model vllm/facebook/opt-125m

Jump to Code

[Tutorial]: Use with Continue-Dev/Aider/AutoGen/Langroid/etc.

Here's how to use the proxy to test codellama/mistral/etc. models for different github repos

pip install litellm
$ ollama pull codellama # OUR Local CodeLlama  

$ litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048

Implementation for different repos

Continue-Dev brings ChatGPT to VSCode. See how to install it here.

In the config.py set this as your default model.

  default=OpenAI(
api_key="IGNORED",
model="fake-model-name",
context_length=2048, # customize if needed for your model
api_base="http://localhost:8000" # your proxy server url
),

Credits @vividfog for this tutorial.

note

Contribute Using this server with a project? Contribute your tutorial here!

Advanced

Save API Keys

$ litellm --api_key OPENAI_API_KEY=sk-...

LiteLLM will save this to a locally stored config file, and persist this across sessions.

LiteLLM Proxy supports all litellm supported api keys. To add keys for a specific provider, check this list:

$ litellm --add_key HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]

Create a proxy for multiple LLMs

$ litellm

#INFO: litellm proxy running on http://0.0.0.0:8000

Send a request to your proxy

import openai 

openai.api_key = "any-string-here"
openai.api_base = "http://0.0.0.0:8080" # your proxy url

# call gpt-3.5-turbo
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])

print(response)

# call ollama/llama2
response = openai.ChatCompletion.create(model="ollama/llama2", messages=[{"role": "user", "content": "Hey"}])

print(response)

Logs

$ litellm --logs

This will return the most recent log (the call that went to the LLM API + the received response).

LiteLLM Proxy will also save your logs to a file called api_logs.json in the current directory.

Configure Proxy

If you need to:

  • save API keys
  • set litellm params (e.g. drop unmapped params, set fallback models, etc.)
  • set model-specific params (max tokens, temperature, api base, prompt template)

You can do set these just for that session (via cli), or persist these across restarts (via config file).

E.g.: Set api base, max tokens and temperature.

For that session:

litellm --model ollama/llama2 \
--api_base http://localhost:11434 \
--max_tokens 250 \
--temperature 0.5

# OpenAI-compatible server running on http://0.0.0.0:8000

Across restarts:
Create a file called litellm_config.toml and paste this in there:

[model."ollama/llama2"] # run via `litellm --model ollama/llama2`
max_tokens = 250 # set max tokens for the model
temperature = 0.5 # set temperature for the model
api_base = "http://localhost:11434" # set a custom api base for the model

Save it to the proxy with:

$ litellm --config -f ./litellm_config.toml 

LiteLLM will save a copy of this file in it's package, so it can persist these settings across restarts.

Complete Config File

### API KEYS ### 
[keys]
# HUGGINGFACE_API_KEY="" # Uncomment to save your Hugging Face API key
# OPENAI_API_KEY="" # Uncomment to save your OpenAI API Key
# TOGETHERAI_API_KEY="" # Uncomment to save your TogetherAI API key
# NLP_CLOUD_API_KEY="" # Uncomment to save your NLP Cloud API key
# ANTHROPIC_API_KEY="" # Uncomment to save your Anthropic API key
# REPLICATE_API_KEY="" # Uncomment to save your Replicate API key
# AWS_ACCESS_KEY_ID = "" # Uncomment to save your Bedrock/Sagemaker access keys
# AWS_SECRET_ACCESS_KEY = "" # Uncomment to save your Bedrock/Sagemaker access keys

### LITELLM PARAMS ###
[general]
# add_function_to_prompt = True # e.g: Ollama doesn't support functions, so add it to the prompt instead
# drop_params = True # drop any params not supported by the provider (e.g. Ollama)
# default_model = "gpt-4" # route all requests to this model
# fallbacks = ["gpt-3.5-turbo", "gpt-3.5-turbo-16k"] # models you want to fallback to in case completion call fails (remember: add relevant keys)

### MODEL PARAMS ###
[model."ollama/llama2"] # run via `litellm --model ollama/llama2`
# max_tokens = "" # set max tokens for the model
# temperature = "" # set temperature for the model
# api_base = "" # set a custom api base for the model

[model."ollama/llama2".prompt_template] # [OPTIONAL] LiteLLM can automatically formats the prompt - docs: https://docs.litellm.ai/docs/completion/prompt_formatting
# MODEL_SYSTEM_MESSAGE_START_TOKEN = "[INST] <<SYS>>\n" # This does not need to be a token, can be any string
# MODEL_SYSTEM_MESSAGE_END_TOKEN = "\n<</SYS>>\n [/INST]\n" # This does not need to be a token, can be any string

# MODEL_USER_MESSAGE_START_TOKEN = "[INST] " # This does not need to be a token, can be any string
# MODEL_USER_MESSAGE_END_TOKEN = " [/INST]\n" # Applies only to user messages. Can be any string.

# MODEL_ASSISTANT_MESSAGE_START_TOKEN = "" # Applies only to assistant messages. Can be any string.
# MODEL_ASSISTANT_MESSAGE_END_TOKEN = "\n" # Applies only to system messages. Can be any string.

# MODEL_PRE_PROMPT = "You are a good bot" # Applied at the start of the prompt
# MODEL_POST_PROMPT = "Now answer as best as you can" # Applied at the end of the prompt

🔥 [Tutorial] modify a model prompt on the proxy

Clone Proxy

To create a local instance of the proxy run:

$ litellm --create_proxy

This will create a local project called litellm-proxy in your current directory, that has:

  • proxy_cli.py: Runs the proxy
  • proxy_server.py: Contains the API calling logic
    • /chat/completions: receives openai.ChatCompletion.create call.
    • /completions: receives openai.Completion.create call.
    • /models: receives openai.Model.list() call
  • secrets.toml: Stores your api keys, model configs, etc.

Run it by doing:

$ cd litellm-proxy
$ python proxy_cli.py --model ollama/llama # replace with your model name

Tracking costs

By default litellm proxy writes cost logs to litellm/proxy/costs.json

How can the proxy be better? Let us know here

{
"Oct-12-2023": {
"claude-2": {
"cost": 0.02365918,
"num_requests": 1
}
}
}

You can view costs on the cli using

litellm --cost

Deploy Proxy

Use this to deploy local models with Ollama that's OpenAI-compatible.

It works for models like Mistral, Llama2, CodeLlama, etc. (any model supported by Ollama)

usage

docker run --name ollama litellm/ollama

More details 👉 https://hub.docker.com/r/litellm/ollama

Support/ talk with founders