DataPilot

Autopilot for Data Science

Problem Statement

Data science competitions and real-world projects involve repetitive tasks: exploring data, setting up baselines, tuning hyperparameters, and logging results. Managing these experiments while keeping track of code versions and insights can be overwhelming. DataPilot solves this by acting as an intelligent pair programmer that can plan, execute, and log experiments autonomously.

Architecture

The project is designed as a modular agent framework:

config.py: Centralized configuration for API keys, model parameters, and paths.
system_instructions.md: The "brain" of the agent, defining its persona and capabilities.
tools.py: A suite of safe, sandboxed functions for data loading, code execution, and logging.
agent_core.py: The main logic that orchestrates the ReAct (Reasoning + Acting) loop.
agent_main.py: Entry point for CLI and Notebook usage.

Setup

Install Dependencies:
```
pip install -r requirements.txt
```
Configure API Key: Set your Gemini API key as an environment variable:
```
export GEMINI_API_KEY="your_api_key_here"
```
In Kaggle, use the "Secrets" add-on to securely store GEMINI_API_KEY.

Usage

In a Kaggle Notebook

Import the runner function and start a session:

from agent_main import run_agent_session

# Define your goal
goal = "Load the Titanic dataset, clean missing values, and train a Random Forest classifier."

# Run the agent
run_agent_session(goal=goal, data_path="/kaggle/input/titanic")

From Command Line

python agent_main.py --goal "Analyze the housing dataset" --offline

(Use --offline to test without an API key using mock responses)

Agentic Behavior

DataPilot follows a ReAct pattern:

Thought: It analyzes the user request and plans the next step.
Action: It calls a specific tool (e.g., load_data or execute_code).
Observation: It reads the tool's output (e.g., dataframe columns or model accuracy).
Iteration: It repeats this cycle until the goal is met.

Safety & Limitations

Sandboxed Execution: Code is executed in a restricted environment, but users should still review generated code.
No Internet Access (Default): The agent relies on local libraries and data.
Hallucinations: Like all LLMs, it may occasionally generate incorrect code or facts. Always verify important results.

Kaggle Competition Alignment

This project demonstrates:

Real-world relevance: Directly aids the core activity of Kaggle users.
Modular Design: Clean separation of concerns.
Reproducibility: Structured logging ensures experiments are trackable.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
output		output
static		static
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
agent_core.py		agent_core.py
agent_main.py		agent_main.py
app.py		app.py
config.py		config.py
dom_structure.md		dom_structure.md
requirements.txt		requirements.txt
sync.ffs_lock		sync.ffs_lock
system_instructions.md		system_instructions.md
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataPilot

Problem Statement

Architecture

Setup

Usage

In a Kaggle Notebook

From Command Line

Agentic Behavior

Safety & Limitations

Kaggle Competition Alignment

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataPilot

Problem Statement

Architecture

Setup

Usage

In a Kaggle Notebook

From Command Line

Agentic Behavior

Safety & Limitations

Kaggle Competition Alignment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages