LLM Basics
Master the fundamentals of Large Language Models.
Quick Start
Using OpenAI API
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers briefly."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
Using Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
inputs = tokenizer("Hello, how are", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
Core Concepts
Transformer Architecture
Input → Embedding → [N × Transformer Block] → Output
Transformer Block:
┌───────────────────────────┐
│ Multi-Head Self-Attention │
├───────────────────────────┤
│ Layer Normalization │
├───────────────────────────┤
│ Feed-Forward Network │
├───────────────────────────┤
│ Layer Normalization │
└───────────────────────────┘
Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"
# Encode
tokens = tokenizer.encode(text)
print(tokens) # [15496, 11, 995, 0]
# Decode
decoded = tokenizer.decode(tokens)
print(decoded) # "Hello, world!"
Key Parameters
# Generation parameters
params = {
'temperature': 0.7, # Randomness (0-2)
'max_tokens': 1000, # Output length limit
'top_p': 0.9, # Nucleus sampling
'top_k': 50, # Top-k sampling
'frequency_penalty': 0, # Reduce repetition
'presence_penalty': 0 # Encourage new topics
}
Model Comparison
| Model |
Parameters |
Context |
Best For |
| GPT-4 |
~1.7T |
128K |
Complex reasoning |
| GPT-3.5 |
175B |
16K |
General tasks |
| Claude 3 |
N/A |
200K |
Long context |
| Llama 2 |
7-70B |
4K |
Open source |
| Mistral 7B |
7B |
32K |
Efficient inference |
Local Inference
With Ollama
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Run a model
ollama run llama2
# API usage
curl http://localhost:11434/api/generate -d '{
"model": "llama2",
"prompt": "Why is the sky blue?"
}'
With vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(["Hello, my name is"], sampling)
Best Practices
- Start simple: Use API before local deployment
- Mind context: Stay within context window limits
- Temperature tuning: Lower for facts, higher for creativity
- Token efficiency: Shorter prompts = lower costs
- Streaming: Use for better UX in applications
Error Handling & Retry
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt: str) -> str:
return client.chat.completions.create(...)
Troubleshooting
| Symptom |
Cause |
Solution |
| Rate limit errors |
Too many requests |
Add exponential backoff |
| Empty response |
max_tokens=0 |
Check parameter values |
| High latency |
Large model |
Use smaller model |
| Timeout |
Prompt too long |
Reduce input size |
Unit Test Template
def test_llm_completion():
response = call_llm("Hello")
assert response is not None
assert len(response) > 0