| name | process_resume_skill |
| description | Process candidate resume PDFs to extract Name, Gender, Nationality and URR status |
Process Candidate Resumes
This skill processes candidate resume PDFs and extracts key information using an LLM.
What it does:
- Processes PDF resumes directly using Gemini API (handles both text-based and image-based PDFs)
- Uses Gemini 2.0 Flash to extract: Name, Gender, Country of Nationality, and URR status
- Saves results to a CSV file
Usage:
Basic Usage
Run the process script with default settings:
python .claude/skills/process_resume_skill/scripts/process_resume.py
With Custom Options
conda activate agent
python .claude/skills/process_resume_skill/scripts/process_resume.py \
--pdf_folder /path/to/resume/folder \
--output_file /path/to/output.csv \
--model gemini-2.0-flash-exp \
--skip_test
Command-line Arguments:
--pdf_folder: Path to folder containing PDF resumes (default:/ephemeral/home/xiong/data/Fund/Resumes/current)--output_file: Path to output CSV file (default:candidates_info.csvin pdf_folder)--model: Gemini model to use (default:gemini-2.0-flash-exp)--skip_test: Skip the Gemini API connection unit test
Key Functions:
The script provides several functions that can be used programmatically:
process_single_resume(pdf_path: str, client: genai.Client, model: str) -> Dict[str, str]: Process a single resume and extract candidate information using Gemini APIprocess_resumes(pdf_folder: str, client: genai.Client, model: str) -> pd.DataFrame: Process all resumes in a folder and return a DataFrameparse_json_response(response_text: str) -> Dict[str, str]: Parse JSON from Gemini response, handling markdown code blocksprint_statistics(df: pd.DataFrame) -> None: Print summary statistics about processed candidatesunit_test(client: genai.Client, model: str) -> bool: Test the Gemini API connectionget_extraction_prompt() -> str: Get the prompt for information extraction
Configuration:
- Requires
.envfile in project root withGOOGLE_API_KEY - Uses Google Gemini API with direct PDF processing (no text extraction needed)
- Handles both text-based and image-based (scanned) PDFs automatically
- Handles errors gracefully by marking them as "error" in the output CSV
Output:
The script generates a CSV file with the following columns:
- Name: Candidate's full name
- Gender: Male/Female (inferred from name and experience)
- Nationality: Country of nationality
- URR: "yes" or "no" indicating if from Under-Represented Region
URR Countries:
The script identifies candidates from Under-Represented Regions (URR) based on a predefined list including: Afghanistan, Algeria, Angola, Bahrain, Benin, Botswana, Brunei Darussalam, Burkina Faso, Cabo Verde, Cambodia, Cameroon, Central African Republic, Chad, China, Comoros, Côte d'Ivoire, Democratic Republic of the Congo, Djibouti, Egypt, and many others (see full list in process_resume.py:33-40).
Statistics:
After processing, the script displays:
- Total candidates processed
- Gender distribution (Female/Male counts and percentages)
- URR vs Non-URR distribution
- Processing errors (if any)