OpenAI Proxy Server
A local, fast, and lightweight OpenAI-compatible server to call 100+ LLM APIs.
Usage
pip install litellm
$ litellm --model ollama/codellama
#INFO: Ollama running on http://0.0.0.0:8000
Test
In a new shell, run:
$ litellm --test
Replace openai base
import openai
openai.api_base = "http://0.0.0.0:8000"
print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))
Other supported models:
- VLLM
- OpenAI Compatible Server
- Huggingface
- Anthropic
- TogetherAI
- Replicate
- Petals
- Palm
- Azure OpenAI
- AI21
- Cohere
$ litellm --model vllm/facebook/opt-125m
$ litellm --model openai/<model_name> --api_base <your-api-base>
$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model claude-instant-1
$ export ANTHROPIC_API_KEY=my-api-key
$ litellm --model claude-instant-1
$ export TOGETHERAI_API_KEY=my-api-key
$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k
$ export REPLICATE_API_KEY=my-api-key
$ litellm \
--model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3
$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf
$ export PALM_API_KEY=my-palm-key
$ litellm --model palm/chat-bison
$ export AZURE_API_KEY=my-api-key
$ export AZURE_API_BASE=my-api-base
$ litellm --model azure/my-deployment-name
$ export AI21_API_KEY=my-api-key
$ litellm --model j2-light
$ export COHERE_API_KEY=my-api-key
$ litellm --model command-nightly
[Tutorial]: Use with Continue-Dev/Aider/AutoGen/Langroid/etc.
Here's how to use the proxy to test codellama/mistral/etc. models for different github repos
pip install litellm
$ ollama pull codellama # OUR Local CodeLlama
$ litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048
Implementation for different repos
- ContinueDev
- Aider
- AutoGen
- Langroid
- GPT-Pilot
- guidance
Continue-Dev brings ChatGPT to VSCode. See how to install it here.
In the config.py set this as your default model.
default=OpenAI(
api_key="IGNORED",
model="fake-model-name",
context_length=2048, # customize if needed for your model
api_base="http://localhost:8000" # your proxy server url
),
Credits @vividfog for this tutorial.
$ pip install aider
$ aider --openai-api-base http://0.0.0.0:8000 --openai-api-key fake-key
pip install pyautogen
from autogen import AssistantAgent, UserProxyAgent, oai
config_list=[
{
"model": "my-fake-model",
"api_base": "http://localhost:8000", #litellm compatible endpoint
"api_type": "open_ai",
"api_key": "NULL", # just a placeholder
}
]
response = oai.Completion.create(config_list=config_list, prompt="Hi")
print(response) # works fine
llm_config={
"config_list": config_list,
}
assistant = AssistantAgent("assistant", llm_config=llm_config)
user_proxy = UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.", config_list=config_list)
Credits @victordibia for this tutorial.
pip install langroid
from langroid.language_models.openai_gpt import OpenAIGPTConfig, OpenAIGPT
# configure the LLM
my_llm_config = OpenAIGPTConfig(
#format: "local/[URL where LiteLLM proxy is listening]
chat_model="local/localhost:8000",
chat_context_length=2048, # adjust based on model
)
# create llm, one-off interaction
llm = OpenAIGPT(my_llm_config)
response = mdl.chat("What is the capital of China?", max_tokens=50)
# Create an Agent with this LLM, wrap it in a Task, and
# run it as an interactive chat app:
from langroid.agent.base import ChatAgent, ChatAgentConfig
from langroid.agent.task import Task
agent_config = ChatAgentConfig(llm=my_llm_config, name="my-llm-agent")
agent = ChatAgent(agent_config)
task = Task(agent, name="my-llm-task")
task.run()
Credits @pchalasani and Langroid for this tutorial.
In your .env set the openai endpoint to your local server.
OPENAI_ENDPOINT=http://0.0.0.0:8000
OPENAI_API_KEY=my-fake-key
NOTE: Guidance sends additional params like stop_sequences
which can cause some models to fail if they don't support it.
Fix: Start your proxy using the --drop_params
flag
litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048 --drop_params
import guidance
# set api_base to your proxy
# set api_key to anything
gpt4 = guidance.llms.OpenAI("gpt-4", api_base="http://0.0.0.0:8000", api_key="anything")
experts = guidance('''
{{#system~}}
You are a helpful and terse assistant.
{{~/system}}
{{#user~}}
I want a response to the following question:
{{query}}
Name 3 world-class experts (past or present) who would be great at answering this?
Don't answer the question yet.
{{~/user}}
{{#assistant~}}
{{gen 'expert_names' temperature=0 max_tokens=300}}
{{~/assistant}}
''', llm=gpt4)
result = experts(query='How can I be more productive?')
print(result)
Contribute Using this server with a project? Contribute your tutorial here!
Advanced
Save API Keys
$ litellm --api_key OPENAI_API_KEY=sk-...
LiteLLM will save this to a locally stored config file, and persist this across sessions.
LiteLLM Proxy supports all litellm supported api keys. To add keys for a specific provider, check this list:
- Huggingface
- Anthropic
- TogetherAI
- Replicate
- Bedrock
- Palm
- Azure OpenAI
- AI21
- Cohere
$ litellm --add_key HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --add_key ANTHROPIC_API_KEY=my-api-key
$ litellm --add_key TOGETHERAI_API_KEY=my-api-key
$ litellm --add_key REPLICATE_API_KEY=my-api-key
$ litellm --add_key AWS_ACCESS_KEY_ID=my-key-id
$ litellm --add_key AWS_SECRET_ACCESS_KEY=my-secret-access-key
$ litellm --add_key PALM_API_KEY=my-palm-key
$ litellm --add_key AZURE_API_KEY=my-api-key
$ litellm --add_key AZURE_API_BASE=my-api-base
$ litellm --add_key AI21_API_KEY=my-api-key
$ litellm --add_key COHERE_API_KEY=my-api-key
Create a proxy for multiple LLMs
$ litellm
#INFO: litellm proxy running on http://0.0.0.0:8000
Send a request to your proxy
import openai
openai.api_key = "any-string-here"
openai.api_base = "http://0.0.0.0:8080" # your proxy url
# call gpt-3.5-turbo
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])
print(response)
# call ollama/llama2
response = openai.ChatCompletion.create(model="ollama/llama2", messages=[{"role": "user", "content": "Hey"}])
print(response)
Logs
$ litellm --logs
This will return the most recent log (the call that went to the LLM API + the received response).
LiteLLM Proxy will also save your logs to a file called api_logs.json
in the current directory.
Configure Proxy
If you need to:
- save API keys
- set litellm params (e.g. drop unmapped params, set fallback models, etc.)
- set model-specific params (max tokens, temperature, api base, prompt template)
You can do set these just for that session (via cli), or persist these across restarts (via config file).
E.g.: Set api base, max tokens and temperature.
For that session:
litellm --model ollama/llama2 \
--api_base http://localhost:11434 \
--max_tokens 250 \
--temperature 0.5
# OpenAI-compatible server running on http://0.0.0.0:8000
Across restarts:
Create a file called litellm_config.toml
and paste this in there:
[model."ollama/llama2"] # run via `litellm --model ollama/llama2`
max_tokens = 250 # set max tokens for the model
temperature = 0.5 # set temperature for the model
api_base = "http://localhost:11434" # set a custom api base for the model
Save it to the proxy with:
$ litellm --config -f ./litellm_config.toml
LiteLLM will save a copy of this file in it's package, so it can persist these settings across restarts.
Complete Config File
### API KEYS ###
[keys]
# HUGGINGFACE_API_KEY="" # Uncomment to save your Hugging Face API key
# OPENAI_API_KEY="" # Uncomment to save your OpenAI API Key
# TOGETHERAI_API_KEY="" # Uncomment to save your TogetherAI API key
# NLP_CLOUD_API_KEY="" # Uncomment to save your NLP Cloud API key
# ANTHROPIC_API_KEY="" # Uncomment to save your Anthropic API key
# REPLICATE_API_KEY="" # Uncomment to save your Replicate API key
# AWS_ACCESS_KEY_ID = "" # Uncomment to save your Bedrock/Sagemaker access keys
# AWS_SECRET_ACCESS_KEY = "" # Uncomment to save your Bedrock/Sagemaker access keys
### LITELLM PARAMS ###
[general]
# add_function_to_prompt = True # e.g: Ollama doesn't support functions, so add it to the prompt instead
# drop_params = True # drop any params not supported by the provider (e.g. Ollama)
# default_model = "gpt-4" # route all requests to this model
# fallbacks = ["gpt-3.5-turbo", "gpt-3.5-turbo-16k"] # models you want to fallback to in case completion call fails (remember: add relevant keys)
### MODEL PARAMS ###
[model."ollama/llama2"] # run via `litellm --model ollama/llama2`
# max_tokens = "" # set max tokens for the model
# temperature = "" # set temperature for the model
# api_base = "" # set a custom api base for the model
[model."ollama/llama2".prompt_template] # [OPTIONAL] LiteLLM can automatically formats the prompt - docs: https://docs.litellm.ai/docs/completion/prompt_formatting
# MODEL_SYSTEM_MESSAGE_START_TOKEN = "[INST] <<SYS>>\n" # This does not need to be a token, can be any string
# MODEL_SYSTEM_MESSAGE_END_TOKEN = "\n<</SYS>>\n [/INST]\n" # This does not need to be a token, can be any string
# MODEL_USER_MESSAGE_START_TOKEN = "[INST] " # This does not need to be a token, can be any string
# MODEL_USER_MESSAGE_END_TOKEN = " [/INST]\n" # Applies only to user messages. Can be any string.
# MODEL_ASSISTANT_MESSAGE_START_TOKEN = "" # Applies only to assistant messages. Can be any string.
# MODEL_ASSISTANT_MESSAGE_END_TOKEN = "\n" # Applies only to system messages. Can be any string.
# MODEL_PRE_PROMPT = "You are a good bot" # Applied at the start of the prompt
# MODEL_POST_PROMPT = "Now answer as best as you can" # Applied at the end of the prompt
🔥 [Tutorial] modify a model prompt on the proxy
Clone Proxy
To create a local instance of the proxy run:
$ litellm --create_proxy
This will create a local project called litellm-proxy
in your current directory, that has:
- proxy_cli.py: Runs the proxy
- proxy_server.py: Contains the API calling logic
/chat/completions
: receivesopenai.ChatCompletion.create
call./completions
: receivesopenai.Completion.create
call./models
: receivesopenai.Model.list()
call
- secrets.toml: Stores your api keys, model configs, etc.
Run it by doing:
$ cd litellm-proxy
$ python proxy_cli.py --model ollama/llama # replace with your model name
Tracking costs
By default litellm proxy writes cost logs to litellm/proxy/costs.json
How can the proxy be better? Let us know here
{
"Oct-12-2023": {
"claude-2": {
"cost": 0.02365918,
"num_requests": 1
}
}
}
You can view costs on the cli using
litellm --cost
Deploy Proxy
- Ollama/OpenAI Docker
- Self-Hosted
- LiteLLM-Hosted
It works for models like Mistral, Llama2, CodeLlama, etc. (any model supported by Ollama)
usage
docker run --name ollama litellm/ollama
More details 👉 https://hub.docker.com/r/litellm/ollama
Step 1: Clone the repo
git clone https://github.com/BerriAI/liteLLM-proxy.git
Step 2: Put your API keys in .env Copy the .env.template and put in the relevant keys (e.g. OPENAI_API_KEY="sk-..")
Step 3: Test your proxy Start your proxy server
cd litellm-proxy && python3 main.py
Make your first call
import openai
openai.api_key = "sk-litellm-master-key"
openai.api_base = "http://0.0.0.0:8080"
response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey"}])
print(response)
Deploy the proxy to https://api.litellm.ai
$ export ANTHROPIC_API_KEY=sk-ant-api03-1..
$ litellm --model claude-instant-1 --deploy
#INFO: Uvicorn running on https://api.litellm.ai/44508ad4
This will host a ChatCompletions API at: https://api.litellm.ai/44508ad4
Support/ talk with founders
- Schedule Demo 👋
- Community Discord 💭
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai