Data Source Support¶
sktime-mcp now supports loading data from multiple sources beyond the built-in demo datasets!
Supported Data Sources¶
- SQL Databases - PostgreSQL, MySQL, SQLite, MSSQL
- Files - CSV, TSV, Excel, Parquet
Quick Start¶
1. CSV File¶
result = executor.load_data_source({
"type": "file",
"path": "/path/to/data.csv",
"time_column": "date",
"target_column": "sales",
"exog_columns": ["temperature", "promotion"], # Optional
})
2. SQL Database¶
# SQLite
result = executor.load_data_source({
"type": "sql",
"connection_string": "sqlite:///path/to/database.db",
"query": "SELECT date, sales FROM sales_table",
"time_column": "date",
"target_column": "sales",
})
# PostgreSQL
result = executor.load_data_source({
"type": "sql",
"connection_string": "postgresql://user:pass@host:5432/db",
"query": "SELECT * FROM sales WHERE date >= '2020-01-01'",
"time_column": "date",
"target_column": "sales",
})
MCP Tools¶
load_data_source¶
Load data from any supported source.
Arguments:
- config (dict): Data source configuration
Returns:
- success (bool): Whether loading succeeded
- data_handle (str): Handle to reference the loaded data
- metadata (dict): Information about the data (rows, columns, frequency, etc.)
- validation (dict): Data quality validation results
Example:
{
"config": {
"type": "pandas",
"data": {"date": [...], "value": [...]},
"time_column": "date",
"target_column": "value"
}
}
fit_predict_with_data¶
Fit a model and generate predictions using custom data.
Arguments:
- estimator_handle (str): Handle from instantiate_estimator
- data_handle (str): Handle from load_data_source
- horizon (int): Forecast horizon (default: 12)
Example:
list_data_sources¶
List all available data source types.
Returns:
- sources (list): Available source types
- descriptions (dict): Description for each source
list_data_handles¶
List all currently loaded data handles.
Returns:
- count (int): Number of loaded data handles
- handles (list): List of data handle information
release_data_handle¶
Release a data handle and free memory.
Arguments:
- data_handle (str): Handle to release
Configuration Options¶
Pandas Adapter¶
{
"type": "pandas",
"data": df_or_dict, # DataFrame or dict
"time_column": "date", # Optional, will try to detect
"target_column": "sales", # Optional, defaults to first column
"exog_columns": ["temp", "promo"], # Optional
"frequency": "D" # Optional, will try to infer
}
SQL Adapter¶
{
"type": "sql",
# Option 1: Connection string
"connection_string": "postgresql://user:pass@host:5432/db",
# Option 2: Individual components
"dialect": "postgresql", # postgresql, mysql, sqlite, mssql
"host": "localhost",
"port": 5432,
"database": "mydb",
"username": "user",
"password": "pass",
# Query
"query": "SELECT * FROM sales", # Direct SQL query
# OR
"table": "sales", # Table name
"filters": {"region": "North"}, # Optional filters
# Column mapping
"time_column": "date",
"target_column": "sales",
"exog_columns": ["temperature"],
# Optional
"parse_dates": ["date"],
"frequency": "D"
}
File Adapter¶
{
"type": "file",
"path": "/path/to/data.csv",
"format": "csv", # csv, excel, parquet (auto-detected if not specified)
# Column mapping
"time_column": "date",
"target_column": "sales",
"exog_columns": ["temperature"],
# CSV-specific options
"csv_options": {
"sep": ",",
"header": 0,
"encoding": "utf-8"
},
# Excel-specific options
"excel_options": {
"sheet_name": 0,
"header": 0
},
# Common options
"parse_dates": True,
"frequency": "D"
}
Data Validation¶
All data sources are automatically validated for:
- ✅ DatetimeIndex presence
- ✅ No duplicate time indices
- ✅ Sufficient data points
- ✅ Missing value detection
- ✅ Frequency inference
Validation results are included in the response:
{
"valid": True,
"errors": [], # Critical issues that prevent usage
"warnings": ["Missing values detected: {'sales': 5.0}"] # Non-critical issues
}
Installation¶
Core (Pandas support included)¶
With SQL support¶
With file format support¶
With all optional dependencies¶
Examples¶
See the examples/ directory for complete working examples:
pandas_example.py- Loading from pandas DataFramescsv_example.py- Loading from CSV/TSV filessql_example.py- Loading from SQL databases (SQLite, PostgreSQL, MySQL)
Architecture¶
Data Source Layer
├── base.py # Abstract adapter interface
├── registry.py # Adapter registry
└── adapters/
├── pandas_adapter.py # In-memory DataFrames
├── sql_adapter.py # SQL databases
└── file_adapter.py # CSV, Excel, Parquet
Each adapter implements:
- load() - Load data from source
- validate() - Check data quality
- to_sktime_format() - Convert to (y, X) format
Metadata¶
Each loaded data source provides rich metadata:
{
"source": "sql",
"rows": 100,
"columns": ["sales", "temperature"],
"frequency": "D",
"start_date": "2020-01-01",
"end_date": "2020-04-09",
"missing_values": {"sales": 0, "temperature": 2}
}
Error Handling¶
All operations return structured error information:
{
"success": False,
"error": "Could not convert index to datetime",
"error_type": "ValueError",
"validation": {
"valid": False,
"errors": ["Index must be DatetimeIndex"]
}
}
💡 LLM Guidelines & Best Practices¶
When using these tools to handle user data, follow these best practices to avoid common execution errors:
1. Column Identification¶
- Target vs. Features: Identify which column is the target (
y) and which are exogenous features (X). - Numeric Only: Ensure the target column contains only numeric data. Forecasting models will fail if string/categorical columns are accidentally included in
y. - User Confirmation: If the dataset has multiple columns and the user hasn't specified which one to forecast, ask for clarification instead of guessing.
2. Strategic Loading¶
- Use
usecols: When loading from CSV or Excel, usecsv_options: {"usecols": ["Column1", "Column2"]}to load only the necessary columns. This prevents string-based columns from being accidentally passed to the model. - Time Column Mapping: Always map the
time_columnif one exists. If it doesn't, sktime will default to aRangeIndex, which is often safer for simple sequences.
3. Frequency Hurdles¶
- Month-Start vs Month-End: If you encounter errors like
<MonthBegin> is not supported as period frequency, it is likely due tostatsmodelslimitations. In such cases, consider dropping the timestamp index and using a simple integer index (RangeIndex) by not specifying atime_column.
4. Validation First¶
- Status Check: Always inspect the
validationobject returned byload_data_source. Ifvalidisfalse, read theerrorslist to understand why the data might break the model.
Troubleshooting¶
"No module named 'sqlalchemy'"¶
"No module named 'openpyxl'" (for Excel files)¶
"No module named 'pyarrow'" (for Parquet files)¶
"Could not infer frequency"¶
Specify frequency explicitly in config:
"Index must be DatetimeIndex"¶
Ensure your time column is specified correctly: