Autopilot for Data Science
Data science competitions and real-world projects involve repetitive tasks: exploring data, setting up baselines, tuning hyperparameters, and logging results. Managing these experiments while keeping track of code versions and insights can be overwhelming. DataPilot solves this by acting as an intelligent pair programmer that can plan, execute, and log experiments autonomously.
The project is designed as a modular agent framework:
config.py: Centralized configuration for API keys, model parameters, and paths.system_instructions.md: The "brain" of the agent, defining its persona and capabilities.tools.py: A suite of safe, sandboxed functions for data loading, code execution, and logging.agent_core.py: The main logic that orchestrates the ReAct (Reasoning + Acting) loop.agent_main.py: Entry point for CLI and Notebook usage.
-
Install Dependencies:
pip install -r requirements.txt
-
Configure API Key: Set your Gemini API key as an environment variable:
export GEMINI_API_KEY="your_api_key_here"
In Kaggle, use the "Secrets" add-on to securely store
GEMINI_API_KEY.
Import the runner function and start a session:
from agent_main import run_agent_session
# Define your goal
goal = "Load the Titanic dataset, clean missing values, and train a Random Forest classifier."
# Run the agent
run_agent_session(goal=goal, data_path="/kaggle/input/titanic")python agent_main.py --goal "Analyze the housing dataset" --offline(Use --offline to test without an API key using mock responses)
DataPilot follows a ReAct pattern:
- Thought: It analyzes the user request and plans the next step.
- Action: It calls a specific tool (e.g.,
load_dataorexecute_code). - Observation: It reads the tool's output (e.g., dataframe columns or model accuracy).
- Iteration: It repeats this cycle until the goal is met.
- Sandboxed Execution: Code is executed in a restricted environment, but users should still review generated code.
- No Internet Access (Default): The agent relies on local libraries and data.
- Hallucinations: Like all LLMs, it may occasionally generate incorrect code or facts. Always verify important results.
This project demonstrates:
- Real-world relevance: Directly aids the core activity of Kaggle users.
- Modular Design: Clean separation of concerns.
- Reproducibility: Structured logging ensures experiments are trackable.