0.7 alpha 2 by Chenglong-MS · Pull Request #258 · microsoft/data-formulator

Chenglong-MS · 2026-03-18T21:36:06Z

PR Summary

Agents & AI Pipeline

Unified data agents: Consolidated agent_py_data_rec, agent_sql_data_rec, agent_py_data_transform, agent_sql_data_transform, agent_concept_derive, agent_py_concept_derive, agent_data_clean, and agent_exploration into three unified agents: data_agent.py, agent_data_rec.py, and agent_data_transform.py
Semantic type system: New semantic_types.py backend module and full frontend type registry (src/lib/agents-chart/core/type-registry.ts, field-semantics.ts, semantic-types.ts) with domain shape inference, tick constraints, zero-baseline classification, and snap-to-bound heuristics
Chart insight agent: New agent_chart_insight.py for AI-generated chart takeaways
Language agent: New agent_language.py for i18n-aware prompts
Diagnostics agent: New agent_diagnostics.py with unified diagnostic information builder for better error reporting
Improved agent robustness: Better handling of missing output blocks, output variable detection, multimodal fallback for text-only models

Visualization

Agents-chart library: Complete new chart rendering library (src/lib/agents-chart/, 120 files, ~44K lines) with multi-backend support for Vega-Lite, ECharts, Chart.js, and GoFish — includes template system, semantic-aware axis/domain/tick handling, color decisions, layout computation, faceting, and overflow filtering
Chart gallery: New ChartGallery.tsx with expanded chart type support including pie, US map, world map, bump, candlestick, density, lollipop, pyramid, radar, rose, streamgraph, strip plot, waterfall, and more
Chart render service: New ChartRenderService.tsx replacing static SVG rendering with vega-embed for interactive charts
Insight panel redesign: Insight takeaways now display as styled cards (matching concept explanation style) with 2-column grid layout instead of bullet lists
Chart recommendations: New SimpleChartRecBox.tsx and chartRecommendation.ts for improved chart suggestion workflow
Score tick fix: Score type with small domain spans (e.g., [0,1]) no longer forces integer-only ticks, preserving intermediate decimal ticks

Data Thread & Workflow

Hybrid thread redesign: Unified data thread with reports integrated into threads (DataThread.tsx rewrite, new DataThreadCards.tsx, InteractionEntryCard.tsx)
Unified formulate data hook: New useFormulateData.ts consolidating data derivation logic
Report editor: New Tiptap-based report editor (TiptapReportEditor.tsx) with richer editing support

Data Loading & Management

Unified upload dialog: New UnifiedDataUploadDialog.tsx replacing the old table selection view — supports file upload, URL, paste, database, and sample datasets in a single dialog with loading state indicators
Multi-table preview: New MultiTablePreview.tsx for previewing multiple tables before loading
Unified table loading thunk: New tableThunks.ts handling all data source types with server-side workspace storage
Live data & refresh: New useDataRefresh.tsx with auto-refresh, stream data sources, and RefreshDataDialog.tsx
Virtual table sorting: Server-side sorting now returns original row IDs (#rowId) via ROW_NUMBER() in DuckDB and pandas paths, preserving original row positions after sort

Data Loaders (Database Plugins)

New data loaders: Added Athena, BigQuery, and MongoDB data loaders
Enhanced existing loaders: Improved MySQL, PostgreSQL, MSSQL, S3, Azure Blob, and Kusto loaders with better error handling, connection cleanup, and password sanitization

Datalake / Workspace Backend

New workspace system: Complete datalake/ package with workspace.py, azure_blob_workspace.py, cached_azure_blob_workspace.py, file_manager.py, metadata.py, cache_manager.py, parquet_utils.py, and table_names.py
Workspace factory: New workspace_factory.py for configuration-driven workspace initialization
Session management: New session_routes.py for session-level API endpoints
Unicode & encoding: Support for Unicode filenames, path traversal checks, safe filename processing, UTF-8/GBK encoding detection
Atomic metadata updates: Prevent lost updates in concurrent scenarios

Security

Code signing: New code_signing.py for generated code integrity verification
Auth module: New auth.py for authentication handling
URL allowlist: New url_allowlist.py for URL validation
Error sanitization: New sanitize.py to prevent leaking sensitive info in error messages
Sandbox system: New sandbox/ package with local_sandbox.py, docker_sandbox.py, not_a_sandbox.py, and Dockerfile.sandbox replacing the old py_sandbox.py
Identity management: New identity.ts with browser-based identity for multi-user support

Internationalization (i18n)

Full i18n framework: Added react-i18next with English and Chinese locale files across 7 namespaces (common, chart, encoding, messages, model, navigation, upload)
Translation guide: Comprehensive TRANSLATION_GUIDE.md for contributors

UI & Design System

Design tokens: New tokens.ts with centralized color, spacing, shadow, transition, and radius tokens
Canvas redesign: Refactored DataFormulator.tsx and App.tsx with TopNavButton, AppShell navigation, and model management UI
Encoding shelf updates: Reworked EncodingShelfCard.tsx and EncodingShelfThread.tsx
Removed legacy components: Deleted ConceptCard.tsx, ConceptShelf.tsx, DerivedDataDialog.tsx

Model Management

Server-side global models: New model_registry.py for managing model configurations server-side
Model selection dialog: Enhanced ModelSelectionDialog.tsx with multi-model support

Infrastructure & DevOps

Docker support: New Dockerfile, docker-compose.yml, docker-compose.test.yml with volume permissions and sandbox user handling
Updated dev container: Refreshed .devcontainer/devcontainer.json
Dependency management: Migrated from npm to yarn, added uv.lock, updated pyproject.toml and requirements.txt

Testing

Comprehensive test suite: 69 new test files (~8K lines) covering backend unit, integration, contract, security, plugin, and frontend unit tests
Test infrastructure: New vitest.config.ts, pytest.ini, conftest.py, frontend setup, and test_plan.md
Database plugin tests: Docker-based test harnesses for MySQL, PostgreSQL, MongoDB, and BigQuery

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.5.4 to 6.5.5. - [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst) - [Commits](tornadoweb/tornado@v6.5.4...v6.5.5) --- updated-dependencies: - dependency-name: tornado dependency-version: 6.5.5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>

Bump cryptography from 46.0.4 to 46.0.6

Bump pygments from 2.19.2 to 2.20.0

Bump aiohttp from 3.13.3 to 3.13.4

Bump litellm from 1.81.6 to 1.83.0

Bump tornado from 6.5.4 to 6.5.5

Dev

…vider model, transitioning from a multi-provider chain to a single provider with anonymous fallback. Updated sections on protocol support and plugin design patterns for improved clarity and structure.

- 移除1-sso-plugin-architecture.md中重复的插件接口定义，改为引用1-data-source-plugin-architecture.md - 新增2-external-dataloader-enhancements.md详细说明外部数据加载器的三项改进方案： 1. 数据库元数据拉取（P0） 2. SSO Token透传（P1） 3. 凭证持久化（P2） - 明确各数据库实现细节和优先级规划

Dev

minor

添加详细的开发路线图文档，包含SSO认证、数据源插件框架、凭证保险箱等功能的实施计划与测试策略

补充 Superset 集成代码迁移说明，调整步骤编号，并添加文档交付要求

补充数据溯源描述的设计决策和实现方式，使用模板拼接而非AI生成来保证准确性和可刷新性。描述内容包括来源、筛选条件、时间范围等，并自动存储到loader_metadata中供前端和AI使用。

…ching issues Ensure table key values are validated when building catalog tree Propagate table_key to metadata for frontend use Fix frontend annotation data fetching path

Add SandboxSession class to support sharing Python namespace across multiple explore/execute_python calls within the same agent turn. Main changes include: 1. Extend worker protocol to support 4-tuple messages and __clear_ns__ directive 2. Add SandboxSession context manager encapsulating persistence logic 3. DataAgent and DataLoadingAgent integrate session management 4. Support cross-turn namespace save/restore (via parquet+manifest) 5. Add related test cases and security checks Resolve excessive tool call issues, improve multi-step data analysis efficiency.

Feature/plugin architecture

zhb-ai · 2026-05-01T10:57:17Z

+
+    The response body must not include endpoint-specific top-level fields.
+    """
+    return jsonify({"status": "success", "data": data}), status_code


General fix: never include raw exception text ({e}, {exc}) in client-facing response fields. Log full exceptions server-side (logger.exception(...)) and return stable generic messages (optionally with safe error codes). This preserves functionality while preventing information leakage.

Best targeted fix here:

File: py-src/data_formulator/sandbox/docker_sandbox.py

In both except Exception as exc blocks, keep logging but replace detail argument currently built from exception text with a generic diagnostic string.

File: py-src/data_formulator/sandbox/local_sandbox.py

Import sanitize_error_message.

In the two exception handlers returning "content" with {e}, replace with generic fixed user messages.

Add server-side logging via logger.exception(...).

Optionally keep a sanitized diagnostic field to preserve troubleshooting value without exposing raw internals.

This removes the source of tainted exception data for all listed variants and keeps outward API behavior (error status/content shape) intact.

Won't fix for now. The flagged exception text is used as part of the sandbox execution feedback loop, not just as a generic user-facing error. Sandbox failures are fed back into the agent so the model can repair generated Python code based on concrete diagnostics such as NameError, KeyError, SyntaxError, missing columns, missing output variables, or DataFrame type mismatches.

Replacing these messages with a generic error would likely reduce repair quality and could break existing agent workflows where the next model turn depends on the specific execution failure. The current behavior is therefore intentional for code-execution diagnostics.

We acknowledge that raw host/sandbox infrastructure exceptions should not be exposed broadly. A future targeted fix should preserve sanitized user-code execution errors for model repair while separately replacing infrastructure-level failures, such as worker communication or Docker startup failures, with stable generic messages and server-side logging. @Chenglong-MS

- Use ConfinedDir to restrict file operation scope, prevent symlink escape - Strictly validate origin and safely render JWT templates in SSO bridge - Add test cases to verify security protection measures

Remove function return value and parameter type annotations to support older Python runtime environments. These modifications primarily affect origin validation functionality related to SSO bridging.

Feature/plugin architecture

…ring - Add safe path check function to prevent open redirects - Improve error message handling, filter sensitive information and stack traces - Implement unified secure error responses in Docker sandbox and agents - Add test cases to verify security filtering functionality

…rning functionality Add sanitize_error_message feature in error handling module to securely sanitize all error details returned to clients Add streaming warning handling mechanism, including stream_warning_event and collect_stream_warning/flush_stream_warnings utilities Update development documentation, supplement streaming warning specifications and usage instructions

fix(oidc): use issuer returned by discovery endpoint to update configuration docs: update OIDC security guide, add GitHub private email and state validation instructions feat(github): add state parameter validation and private email retrieval functionality test: add contract tests for mainstream SSO providers

refactor(测试): 优化BigQuery模拟器探针逻辑避免阻塞 docs(测试): 更新README文档说明新的测试服务管理方式 fix(测试): 移除MySQL和PostgreSQL测试中不必要的row_count检查 style(测试): 格式化Superset测试代码并更新API端点

Feature/plugin architecture

…ensitive query parameters refactor(frontend): extract rule pre-content processing logic into independent module docs: update documentation for error handling and log desensitization test: add unit tests for log desensitization and rule pre-content

fix(security): enhance log desensitization functionality to support s…

Add test cases to verify automatic log rotation by date functionality Refactor ReasoningLogger class, extract log file path related properties as instance variables Add _ensure_fd_for_today method to handle log file switching when date changes

…ing document Add document 15.3 with detailed planning for Agent knowledge injection matrix and search tool implementation Update document 15 table format and add link to document 15.3

1. 重构元数据同步流程，支持从缓存读取富元数据 2. 在数据摘要中自动注入目录元数据（表/列描述、标签等） 3. 前端展示合并后的元数据状态和用户标注提示 4. 新增测试验证元数据同步和展示逻辑 - 目录缓存支持两种写入模式（replace/seed_if_missing） - 搜索工具返回source_id/table_key用于元数据读取 - 优化数据加载时的行数截断提示 - 统一元数据状态枚举和前端多语言支持

…on field support Add support for verbose_name and expression fields in Agent context, while preserving dual-source descriptions from source_description and user_description

Feature/plugin architecture

Add _build_sort_clause static method for building ORDER BY clauses Modify fetch_data_as_arrow method to support sorting options Add related unit tests to verify sorting functionality

… security design document Add ISSUE-005 design document, detailing the current status of sorting capabilities and column name concatenation security issues in data loaders, root cause analysis, fix plan, and testing strategy

Feature/plugin architecture

… library Merge skills and experiences directories into a unified library directory, keeping rules independent. Main changes include: - Modify constant definitions and initialization logic in knowledge/store.py - Update category references in API routes and Agent tools - Adjust frontend type definitions and state management - Optimize ExperienceDistillAgent prompts to distill general methodologies - Update related test cases

…periences - Merge original three directories rules/skills/experiences into rules/experiences - KnowledgeStore.search() automatically skips rules with alwaysApply=true - Update ExperienceDistillAgent to distill general methodologies rather than specific cases - Frontend simplified from three tabs to Rules/Experiences two sections - Injection text uses semantic tags [knowledge]/[rule] instead of directory names

…ience distillation Add timeout parameter configuration in API, backend routes, and agent, frontend sets request timeout based on configuration

…gorithm Refactor knowledge base rule injection logic, unify duplicate code into KnowledgeStore.format_rules_block() method, support preloading data to avoid secondary disk reads. Improve search algorithm to tokenization + multi-field weighted matching, support Chinese-English mixed query splitting and table name tag bonus. Update related documentation and test cases. - Add format_rules_block() and load_always_apply_rules() methods - Implement _tokenize_query() supporting Chinese-English mixed tokenization - Improve _match_score() weighted algorithm and table name tag bonus - Update DataLoadingAgent to use last user message as search query - Refactor rule injection code in 6 Agents - Improve test coverage and development documentation

- Update Superset SSO configuration example, add DF Token Exchange endpoint - Supplement SSO token exchange mode documentation, including flow, deployment steps, and security notes - Mark user metadata feature as completed, abandon imported table editing approach - Streamline knowledge system documentation, archive completed content - Update user isolation design document, record implemented portions - Adjust knowledge injection planning document, highlight core conclusions and implemented portions

Add Chinese and English translations for Agent logs, including status messages and expand/collapse functionality Implement long message folding and expanding functionality, optimize message display experience Enhance Agent step display, support icon differentiation for error, warning, and info states Fix JSON parsing issues when weak models call tools, add validation logic

Feature/plugin architecture

github-advanced-security AI found potential problems Mar 24, 2026

View reviewed changes

Comment thread py-src/data_formulator/agent_routes.py Fixed

Comment thread py-src/data_formulator/agent_routes.py Fixed

Comment thread py-src/data_formulator/tables_routes.py Fixed

Chenglong-MS requested a review from Copilot March 24, 2026 20:34

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Chenglong-MS requested a review from Copilot March 24, 2026 20:36

Copilot AI reviewed Mar 24, 2026

View reviewed changes

github-advanced-security AI found potential problems Mar 25, 2026

View reviewed changes

Comment thread py-src/data_formulator/routes/agents.py Fixed

Comment thread py-src/data_formulator/tables_routes.py Fixed

Comment thread py-src/data_formulator/tables_routes.py Fixed

Comment thread py-src/data_formulator/tables_routes.py Fixed

dependabot Bot and others added 16 commits April 4, 2026 09:22

add test infra

06c3744

Merge pull request #272 from microsoft/dependabot/uv/cryptography-46.0.6

026f8ae

Bump cryptography from 46.0.4 to 46.0.6

Merge pull request #274 from microsoft/dependabot/uv/pygments-2.20.0

07dea4d

Bump pygments from 2.19.2 to 2.20.0

Merge pull request #275 from microsoft/dependabot/uv/aiohttp-3.13.4

ca71f43

Bump aiohttp from 3.13.3 to 3.13.4

Merge pull request #276 from microsoft/dependabot/uv/litellm-1.83.0

0ab9c78

Bump litellm from 1.81.6 to 1.83.0

Merge pull request #277 from microsoft/dependabot/uv/tornado-6.5.5

ad2823f

Bump tornado from 6.5.4 to 6.5.5

workflow refactor

6a5f2d2

hybrid thread

4d5cbc1

Merge pull request #279 from microsoft/dev

7d26881

Dev

Refactor SSO plugin architecture documentation to clarify the AuthPro…

5074a5d

…vider model, transitioning from a multi-provider chain to a single provider with anonymous fallback. Updated sections on protocol support and plugin design patterns for improved clarity and structure.

report unified to threads

bc395c4

reports update

ff1e252

fixes and improvements

59a6bd7

fix issues from zhb

e0597f4

Chenglong-MS requested a review from zhb-ai April 8, 2026 05:46

zhb-ai and others added 7 commits April 8, 2026 14:20

Merge pull request #281 from microsoft/dev

f47ada0

Dev

minor

346643a

Merge pull request #282 from microsoft/dev

ee274db

minor

docs: 添加SSO与数据源插件开发路线图文档

c4ae334

添加详细的开发路线图文档，包含SSO认证、数据源插件框架、凭证保险箱等功能的实施计划与测试策略

docs(设计文档): 更新开发路线图文档内容

503a531

补充 Superset 集成代码迁移说明，调整步骤编号，并添加文档交付要求

new workspace design

5fcf0df

zhb-y-agent and others added 3 commits May 1, 2026 03:47

fix: fix data catalog tree building and frontend table annotation fet…

e0690ae

…ching issues Ensure table key values are validated when building catalog tree Propagate table_key to metadata for frontend use Fix frontend annotation data fetching path

Merge pull request #322 from microsoft/feature/plugin-architecture

9393d53

Feature/plugin architecture

github-advanced-security AI found potential problems Apr 30, 2026

View reviewed changes

zhb-y-agent and others added 3 commits May 1, 2026 05:26

fix(security): prevent directory traversal and XSS attacks

283fd39

- Use ConfinedDir to restrict file operation scope, prevent symlink escape - Strictly validate origin and safely render JWT templates in SSO bridge - Add test cases to verify security protection measures

refactor: remove type annotations for legacy Python compatibility

7fd637a

Remove function return value and parameter type annotations to support older Python runtime environments. These modifications primarily affect origin validation functionality related to SSO bridging.

Merge pull request #323 from microsoft/feature/plugin-architecture

464abd5

Feature/plugin architecture

github-advanced-security AI found potential problems Apr 30, 2026

View reviewed changes

Comment thread docs-cn/config-examples/superset/oauth_config.py Fixed

zhb-y-agent and others added 22 commits May 1, 2026 16:57

Merge pull request #324 from microsoft/feature/plugin-architecture

4251221

Feature/plugin architecture

Merge pull request #325 from microsoft/feature/plugin-architecture

92ab564

fix(security): enhance log desensitization functionality to support s…

docs(knowledge system): add 15.3 knowledge injection and search plann…

fe596c6

…ing document Add document 15.3 with detailed planning for Agent knowledge injection matrix and search tool implementation Update document 15 table format and add link to document 15.3

feat(agents): enhance metadata display with verbose_name and expressi…

563e5e7

…on field support Add support for verbose_name and expression fields in Agent context, while preserving dual-source descriptions from source_description and user_description

Merge pull request #326 from microsoft/feature/plugin-architecture

9dcc45b

Feature/plugin architecture

feat(data loader): add sorting functionality to Superset data loader

58567c8

Add _build_sort_clause static method for building ORDER BY clauses Modify fetch_data_as_arrow method to support sorting options Add related unit tests to verify sorting functionality

Merge pull request #327 from microsoft/feature/plugin-architecture

1fbe1b2

Feature/plugin architecture

feat(knowledge distillation): add timeout parameter support for exper…

2244e46

…ience distillation Add timeout parameter configuration in API, backend routes, and agent, frontend sets request timeout based on configuration

Merge pull request #328 from microsoft/feature/plugin-architecture

6a9740e

Feature/plugin architecture

@@ -17,6 +17,7 @@
             import pandas as pd
+            from data_formulator.security.sanitize import sanitize_error_message
             from .base import Sandbox
             logger = logging.getLogger(__name__)
@@ -535,9 +536,13 @@
                                 }
                         except Exception as e:
+                            logger.exception("[LocalSandbox] Error during execution setup")
                             return {
                                 "status": "error",
-                                "content": f"Error during execution setup: {type(e).__name__} - {e}",
+                                "content": "Error during execution setup.",
+                                "diagnostics": {
+                                    "safe_detail": sanitize_error_message(f"{type(e).__name__}: {e}")
+                                },
                             }
                 # ------------------------------------------------------------------
@@ -569,7 +572,14 @@
                         _worker_pool.release(proc, conn)
                         return result
                     except Exception as e:
+                        logger.exception("[LocalSandbox] Worker communication failed")
                         _worker_pool.discard(proc, conn)
-                        return {"status": "error", "content": f"Error: worker communication failed - {e}"}
+                        return {
+                            "status": "error",
+                            "content": "Error: worker communication failed.",
+                            "diagnostics": {
+                                "safe_detail": sanitize_error_message(f"{type(e).__name__}: {e}")
+                            },
+                        }

@@ -193,7 +193,7 @@
                             self._cleanup(tmpdir)
                             return _safe_error_response(
                                 "Failed to start Docker container.",
-                                f"{type(exc).__name__}: {exc}",
+                                "Sandbox startup failed",
                             )
                         stdout = proc.stdout or ""
@@ -243,7 +243,7 @@
                             self._cleanup(tmpdir)
                             return _safe_error_response(
                                 "Failed to read output parquet.",
-                                f"{type(exc).__name__}: {exc}",
+                                "Sandbox output read failed",
                             )
                         self._cleanup(tmpdir)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.7 alpha 2#258

0.7 alpha 2#258
Chenglong-MS wants to merge 384 commits intomainfrom
dev

Chenglong-MS commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check warning

Copilot Autofix

zhb-ai May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

Chenglong-MS commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Agents & AI Pipeline

Visualization

Data Thread & Workflow

Data Loading & Management

Data Loaders (Database Plugins)

Datalake / Workspace Backend

Security

Internationalization (i18n)

UI & Design System

Model Management

Infrastructure & DevOps

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check warning

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot Autofix

zhb-ai May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Chenglong-MS commented Mar 18, 2026 •

edited

Loading