Data Source Support¶

sktime-mcp now supports loading data from multiple sources beyond the built-in demo datasets!

Supported Data Sources¶

SQL Databases - PostgreSQL, MySQL, SQLite, MSSQL
Files - CSV, TSV, Excel, Parquet

Quick Start¶

1. CSV File¶

result = executor.load_data_source({
    "type": "file",
    "path": "/path/to/data.csv",
    "time_column": "date",
    "target_column": "sales",
    "exog_columns": ["temperature", "promotion"],  # Optional
})

2. SQL Database¶

# SQLite
result = executor.load_data_source({
    "type": "sql",
    "connection_string": "sqlite:///path/to/database.db",
    "query": "SELECT date, sales FROM sales_table",
    "time_column": "date",
    "target_column": "sales",
})

# PostgreSQL
result = executor.load_data_source({
    "type": "sql",
    "connection_string": "postgresql://user:pass@host:5432/db",
    "query": "SELECT * FROM sales WHERE date >= '2020-01-01'",
    "time_column": "date",
    "target_column": "sales",
})

MCP Tools¶

`load_data_source`¶

Load data from any supported source.

Arguments: - config (dict): Data source configuration

Returns: - success (bool): Whether loading succeeded - data_handle (str): Handle to reference the loaded data - metadata (dict): Information about the data (rows, columns, frequency, etc.) - validation (dict): Data quality validation results

Example:

{
  "config": {
    "type": "pandas",
    "data": {"date": [...], "value": [...]},
    "time_column": "date",
    "target_column": "value"
  }
}

`fit_predict_with_data`¶

Fit a model and generate predictions using custom data.

Arguments: - estimator_handle (str): Handle from instantiate_estimator - data_handle (str): Handle from load_data_source - horizon (int): Forecast horizon (default: 12)

Example:

{
  "estimator_handle": "est_abc123",
  "data_handle": "data_xyz789",
  "horizon": 7
}

`list_data_sources`¶

List all available data source types.

Returns: - sources (list): Available source types - descriptions (dict): Description for each source

`list_data_handles`¶

List all currently loaded data handles.

Returns: - count (int): Number of loaded data handles - handles (list): List of data handle information

`release_data_handle`¶

Release a data handle and free memory.

Arguments: - data_handle (str): Handle to release

Configuration Options¶

Pandas Adapter¶

{
    "type": "pandas",
    "data": df_or_dict,  # DataFrame or dict
    "time_column": "date",  # Optional, will try to detect
    "target_column": "sales",  # Optional, defaults to first column
    "exog_columns": ["temp", "promo"],  # Optional
    "frequency": "D"  # Optional, will try to infer
}

SQL Adapter¶

{
    "type": "sql",

    # Option 1: Connection string
    "connection_string": "postgresql://user:pass@host:5432/db",

    # Option 2: Individual components
    "dialect": "postgresql",  # postgresql, mysql, sqlite, mssql
    "host": "localhost",
    "port": 5432,
    "database": "mydb",
    "username": "user",
    "password": "pass",

    # Query
    "query": "SELECT * FROM sales",  # Direct SQL query
    # OR
    "table": "sales",  # Table name
    "filters": {"region": "North"},  # Optional filters

    # Column mapping
    "time_column": "date",
    "target_column": "sales",
    "exog_columns": ["temperature"],

    # Optional
    "parse_dates": ["date"],
    "frequency": "D"
}

File Adapter¶

{
    "type": "file",
    "path": "/path/to/data.csv",
    "format": "csv",  # csv, excel, parquet (auto-detected if not specified)

    # Column mapping
    "time_column": "date",
    "target_column": "sales",
    "exog_columns": ["temperature"],

    # CSV-specific options
    "csv_options": {
        "sep": ",",
        "header": 0,
        "encoding": "utf-8"
    },

    # Excel-specific options
    "excel_options": {
        "sheet_name": 0,
        "header": 0
    },

    # Common options
    "parse_dates": True,
    "frequency": "D"
}

Data Validation¶

All data sources are automatically validated for:

✅ DatetimeIndex presence
✅ No duplicate time indices
✅ Sufficient data points
✅ Missing value detection
✅ Frequency inference

Validation results are included in the response:

{
    "valid": True,
    "errors": [],  # Critical issues that prevent usage
    "warnings": ["Missing values detected: {'sales': 5.0}"]  # Non-critical issues
}

Installation¶

Core (Pandas support included)¶

pip install -e .

With SQL support¶

pip install -e ".[sql]"

With file format support¶

pip install -e ".[files]"

With all optional dependencies¶

pip install -e ".[all]"

Examples¶

See the examples/ directory for complete working examples:

pandas_example.py - Loading from pandas DataFrames
csv_example.py - Loading from CSV/TSV files
sql_example.py - Loading from SQL databases (SQLite, PostgreSQL, MySQL)

Architecture¶

Data Source Layer
├── base.py                 # Abstract adapter interface
├── registry.py             # Adapter registry
└── adapters/
    ├── pandas_adapter.py   # In-memory DataFrames
    ├── sql_adapter.py      # SQL databases
    └── file_adapter.py     # CSV, Excel, Parquet

Each adapter implements: - load() - Load data from source - validate() - Check data quality - to_sktime_format() - Convert to (y, X) format

Metadata¶

Each loaded data source provides rich metadata:

{
    "source": "sql",
    "rows": 100,
    "columns": ["sales", "temperature"],
    "frequency": "D",
    "start_date": "2020-01-01",
    "end_date": "2020-04-09",
    "missing_values": {"sales": 0, "temperature": 2}
}

Error Handling¶

All operations return structured error information:

{
    "success": False,
    "error": "Could not convert index to datetime",
    "error_type": "ValueError",
    "validation": {
        "valid": False,
        "errors": ["Index must be DatetimeIndex"]
    }
}

💡 LLM Guidelines & Best Practices¶

When using these tools to handle user data, follow these best practices to avoid common execution errors:

1. Column Identification¶

Target vs. Features: Identify which column is the target (y) and which are exogenous features (X).
Numeric Only: Ensure the target column contains only numeric data. Forecasting models will fail if string/categorical columns are accidentally included in y.
User Confirmation: If the dataset has multiple columns and the user hasn't specified which one to forecast, ask for clarification instead of guessing.

2. Strategic Loading¶

Use usecols: When loading from CSV or Excel, use csv_options: {"usecols": ["Column1", "Column2"]} to load only the necessary columns. This prevents string-based columns from being accidentally passed to the model.
Time Column Mapping: Always map the time_column if one exists. If it doesn't, sktime will default to a RangeIndex, which is often safer for simple sequences.

3. Frequency Hurdles¶

Month-Start vs Month-End: If you encounter errors like <MonthBegin> is not supported as period frequency, it is likely due to statsmodels limitations. In such cases, consider dropping the timestamp index and using a simple integer index (RangeIndex) by not specifying a time_column.

4. Validation First¶

Status Check: Always inspect the validation object returned by load_data_source. If valid is false, read the errors list to understand why the data might break the model.

Troubleshooting¶

"No module named 'sqlalchemy'"¶

pip install sqlalchemy

"No module named 'openpyxl'" (for Excel files)¶

pip install openpyxl

"No module named 'pyarrow'" (for Parquet files)¶

pip install pyarrow

"Could not infer frequency"¶

Specify frequency explicitly in config:

config = {
    ...
    "frequency": "D"  # Daily, "W" for weekly, "M" for monthly, etc.
}

"Index must be DatetimeIndex"¶

Ensure your time column is specified correctly:

config = {
    ...
    "time_column": "date",  # Name of your date/time column
    "parse_dates": True
}